Federal Science DataHubFederal Science DataHub
  • English
  • Français
  • English
  • Français
  • Overview
  • Managing Workspaces and Users

    • Getting a workspace (only available on the GC network)
    • Estimate costs (only available on the GC network)
    • Account Setup
    • Requesting, configuring and removing tools in your workspace
    • Invite a user
    • Change a user role
    • Manage your CBR & workspace budgets
  • Storage

    • Working with Azure Storage
    • Bring Your Own Storage

      • Import AWS S3 Bucket
      • Import Azure Storage
      • Import Google Cloud Platform Storage
    • Access Storage in Databricks
    • Use AzCopy to Interact with Storage
  • Databricks

    • Getting Started with Databricks
    • FSDH Cluster Policies
    • MLFlow: AutoML and Experiments
    • Databricks Workflows
    • Dashboarding

      • How to Dashboard in Databricks
      • Dashboarding Tool Comparison
    • External Extensions

      • Git/GitHub Integration with Databricks
      • Databricks VS Code Extension
      • Working with Conda
      • Connecting Google API to Databricks
  • PostgreSQL

    • Create and use a PostgreSQL Database
    • Add a User to PostgreSQL on FSDH
    • PostgreSQL vs Azure Databricks Database Features
  • Web Applications

    • Hosting Web Apps on DataHub
  • Migrating to Production

    • Migrating Storage
    • Migrating Databricks
    • Migrating PostgreSQL
    • Migrating Web Apps
  • User Guidance

    • Account Management and Access control of workspaces
    • Backup and Recovery
    • Github and code repo management
    • Incident Detection & Response
    • Monitor Usage
    • Monitoring and Auditing a Workspace
    • Source code
    • Restricted File Types on FSDH Storage
  • Terms and Conditions

FSDH Cluster Policies

Overview

As part of the Federal Science Datahub, we provide custom Databricks cluster policies that are designed to help you get the most out of your Databricks clusters by having predefined cluster configurations. We currently offer 3 cluster policies on top of Databrick's personal compute policy. These are "Datahub Small Cluster", "Datahub Regular Cluster" and "Datahub Large Cluster". While workers and drivers configurations are predefined, the choice of databricks runtime is completely up to you. We will go through each of these cluster policies in detail below.

Personal Compute

This is the default cluster policy that is provided by Databricks. It is designed for personal use and is not recommended for production use. It is a good choice if you are just getting started with Databricks and want to get a feel for how it works. It is also a good choice if you are working with small datasets and do not need a lot of compute power. Do note that this cluster is a single-node/single-worker cluster and as such no parallelization will be done. By default, it uses the latest machine learning runtime, meaning machine learning toolings are available within this cluster. The cluster configuration is as follows:

  • Node type: Standard_DS3_v2 (4 CPU, 14 GB memory)

Read more here on personal compute policies

Datahub Small Cluster

This cluster policy is designed for small production workloads. It is a good choice if you are working with small datasets and do not need a small amount of compute power that can scale if needed. It is also a good choice if you are just getting started with Databricks and want to get a feel for how non-personal cluster. The cluster configuration is as follows:

  • Worker and driver type: Standard_D4ds_v5 (4 CPU, 16 GB memory)
  • Amount of workers: 0 to 2 workers
  • Can do spot instances
  • Can do autoscaling

Datahub Regular Cluster

This cluster policy is designed for regular production workloads. It is a good choice if you've encountered bottlenecks with the small cluster configuration. The cluster configuration is as follows:

  • Worker and driver type: Choice of the following
    • Standard_D4ds_v5 (4 CPU, 16 GB memory)
    • Standard_D8ds_v5 (8 CPU, 32 GB memory)
    • Standard_D16ds_v5 (16 CPU, 64 GB memory)
  • Amount of workers: 0 to 4 workers
  • Can do spot instances
  • Can autoscale

Datahub Large Cluster

This cluster policy is designed for large production workloads. It is a good choice if you're working with extremely big data that requires maximal parallelization. It is not recommended users use this cluster unless they are aware of the possible costs it may entail. The cluster configuration is as follows:

  • Worker and driver type: Choice of the following
    • Standard_D4ds_v5 (4 CPU, 16 GB memory)
    • Standard_D8ds_v5 (8 CPU, 32 GB memory)
    • Standard_D16ds_v5 (16 CPU, 64 GB memory)
    • Standard_D32ds_v5 (32 CPU, 128 GB memory)
    • Standard_D48ds_v5 (48 CPU, 192 GB memory)
    • Standard_D64ds_v5 (64 CPU, 256 GB memory)
  • Amount of workers: 0 to 4 workers
  • Can do spot instances
  • Can autoscale

Creating a cluster

For more information on how to create clusters, please refer to the Databricks documentation.

Edit this page on GitHub
Last Updated: 2026-04-13, 11:39 a.m.
Previous
Getting Started with Databricks
Next
MLFlow: AutoML and Experiments