PostgreSQL vs Azure Databricks Database Features

The Federal Science DataHub offers multiple ways to use database functionality. This guide will compare the use of a PostgreSQL database on FSDH with the database features provided by Azure Databricks.

Introduction

When it comes to hosting a database with FSDH, there are multiple options available. One popular choice is using PostgreSQL, while another option is leveraging the database features provided by Azure Databricks. In this comparison, we will explore the key differences and benefits of each approach.

PostgreSQL

PostgreSQL is a fully managed database service that allows you to deploy PostgreSQL databases in the cloud. It provides high availability, automatic backups, and integration with other services. Some of the key features of PostgreSQL include:

  • Easy deployment of PostgreSQL databases
  • Built-in high availability and automatic backups
  • Support for advanced PostgreSQL features such as JSONB, full-text search, and spatial data types
  • Seamless integration with other services like web apps or APIs

Azure Databricks Database Features (Hive)

Azure Databricks is an analytics platform that combines Apache Spark with a collaborative environment for data science and machine learning. While it is primarily known for its big data processing capabilities, Azure Databricks also provides powerful database features. Some of the key database features offered by Azure Databricks include:

  • Delta Lake: A transactional storage layer that provides ACID transactions, schema enforcement, and data versioning for data lakes.
  • Delta Engine: A high-performance query engine optimized for Delta Lake, enabling fast and scalable SQL queries on large datasets.
  • Delta Live Tables: A real-time change data capture (CDC) feature that allows you to capture and process real-time data changes from various sources.
  • Data Lakehouse Architecture: Azure Databricks promotes a unified architecture that combines the best of data lakes and data warehouses, enabling efficient data storage, processing, and analytics.

Databricks leverages the standard Apache Hive (opens new window) project which enables to leverage standard storage and query data using SQL. The underlying model is quite different from a traditional database.

  • Databricks SQL requires the cluster to be enabled to run SQL queries
  • Large datasets can be easily accomodated by simply scaling the cluster size
  • There is only storage costs for data that is not used
  • Large datasets are cost efficient because standard Azure storage is used.

Comparison

This chart outlines the key differences between PostgreSQL and Azure Databricks Database Features:

Feature PostgreSQL Azure Databricks Database Features (Hive Tables)
Database Model Traditional Relational Database Unified data analytics platform based on Lakehouse architecture
Primary Use Case General-purpose database. Web app / API integration. Complex SQL queries. Big data analytics and real-time processing. Machine Learning. Collaborative data exploration.
Schema Strict traditional SQL Schema Flexible and auto-schema management
Performance Good for OLTP workloads. Optimized for big data processing.
Scalability Limited scaling in current setup Horizontal scaling with Delta Engine.
Integration Seamless integration with web apps and APIs. Integration with big data processing and machine learning workflows.

PostgreSQL

  • Web Application
  • Interactive Form
  • API integration
  • Connection to existing application

Databricks SQL Hive Tables

  • Databricks Dashboards
  • Science experiment data
  • Evolving data that requires flexible schemas
  • ETL pipeline intermediary data

Costing Difference

On-Demand Databricks

  • Databricks costing is pay for what you use aka it depends on the cluster size and how long the cluster has been running
  • With light usage, it is possible to keep databricks under $100 in a month and that includes Hive tables
  • It is recommended to perform some small experiments and monitor the costs to precisely estimate the cost

PostgreSQL Cost

  • PostgreSQL in FSDH uses the Azure Flexible Server (opens new window) model which is also based on consumption (burstable model)
  • With very light usage, it costs about $30/month
  • It is recommended to perform experiments as well and monitor the cost to determine more precisely the cost for a specific application

Connectivity

Feature PostgreSQL Azure Databricks Database Features
Connect from Web Application X Not possible
Connect from Power BI X X
Connect from Tableau X X
Connect from Python X Possible but requires cluster to be on

Conclusion

In conclusion, PostgreSQL and Azure Databricks Database Features offer different strengths and use cases. PostgreSQL is well-suited for general-purpose database workloads, web app integration, and complex SQL queries. On the other hand, Azure Databricks database features are optimized for big data analytics, real-time processing, and collaborative data exploration. Depending on your specific use case and requirements, you can choose the database solution that best fits your needs.