Should you use Data Lakehouse instead of a Data Warehouse and / or Data Lake?
Intro
When using your Data Platform to improve your Business Intelligence with useful dashboards and reports, you’ll more than likely want to use a Data Warehouse. Add on your data science builds, and storing your raw data cheaply, plus adding a Data Lake just for good measure, and the costs soon start to add up. Running both in tandem on a Data Platform can have serious costs and maintenance associated.
So, can you have the best of both worlds with the Data Lakehouse? And what is the best Lakehouse to use?
Before we answer those questions, we must ask “What is a Data Warehouse, Data Lake and a Data Lakehouse?”
What is a Data Warehouse, Data Lake and a Data Lakehouse?
Data Warehouse is a data architecture that has been around since the 90s and is still relevant today. It is a means to store tabular data so it can be easily used by business intelligence applications such as Tableau or Power BI, web applications, and even other data warehouses. The three most common Data Warehouse architectures are Kimball Star Schema, Data Vault and One Big Table.
The name is also confusingly used to identify a type of Database, such as AWS Redshift, Azure Synapse and Snowflake, which specialise in storing and querying large amounts of data.
Data Warehouses have their issues; they can be more expensive than a Data Lake when processing large amounts of data, and work best when data is of reasonable quality and in a tabular structure.
Architecture of a simple Data Platform using just a Data Warehouse.
So, along came the Data Lake to help ease these common pain points:
- Data Scientists needing to be able to process large amounts of raw data of dubious quality.
- Increasing requirements for storage of non-tabular data sources.
- The need for data storage that is more flexible in structure and schema.
- The need to store data that might be needed at later date, for example for auditing, but have a low setup and maintenance cost (little or no ETL process needed compared to a Database).
Data Lake is just a distributed file system at its heart, usually hosted in the cloud in AWS S3 or Azure Data Lake, with large files split by a key, so you can save on processing costs by loading the partitions you need.
Data Lakes also generally have more flexibility in that it can store an unlimited number of file formats and offer a common interface to its storage that allows you to use many compute engines. This often called separating storage from compute, which has become so popular that many Data Warehouses offer this too now. Data Lakes can also easily store non-tabular data (images, videos and music) that Data Warehouses cannot without some pre-processing.
However, without Delta Lake it cannot easily or efficiently do row level updates and inserts, nor connect easily to business intelligence applications, that a Data Warehouse or Database can do.
Architecture of an example Data Platform using both a Data Lake and Data Warehouse.
What is a Data Lakehouse?
A Data Lakehouse is an open data management architecture that combines the flexibility, cost-efficiency, and scale of Data Lakes with the data management and ACID transactions of Data Warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.
What is Databricks Lakehouse?
Until a few years ago, Databricks was mainly designed as an easy way to run Spark, a distributed data processing library for large scale Data Engineering and Data Science. It worked mainly in tandem with a Data Lake, with similar advantages and drawbacks.
In 2019 Databricks released Delta Lake, a file format with attributes only found previously in Databases and Data Warehouses as mentioned above. Combined with Spark to process and transform a wide variety of data, this gave birth to the Data Lakehouse.
Today, Databricks has a fully featured SQL Data Warehouse, enterprise security, data governance with Unity Catalog, many data connectors, as well as the ability to output data to Power BI and Tableau, so it can meet all common data use cases.
Architecture of an example Databricks “Lakehouse” using Spark as the processing engine and Delta Lake as storage.
For those looking at building a Data Mesh, Databricks has federated query in preview, though Delta Lake also has connectors for Trino, Starburst and Dremio so you can join up many Data Lakes across your organisation:
Architecture of many Lakehouse Data Products in a Data Mesh – the query layer and governance layer will have access to all Data Products, limited by access permissions.
Will I still need a Data Warehouse?
Maybe, but note it may take some time for a data team used to Databases/Data Warehouses and SQL to convert to Data Lakehouse. Here at Oakland we feel it is still easier to set up and optimise Cloud Native Warehouses like Snowflake and Google Big Query, than Databricks, as there are fewer moving parts.
These maintenance costs can far outweigh the benefits of the Lakehouse, generally at smaller scales and data complexity.
Also, while we’ve seen first-hand that Lakehouse can be the cheaper and more performant option than a Data Warehouse, this hasn’t been the case 100% of the time and you should do your own testing, as performance and cost heavily depends on the data you use and the environment you operate in.
Can I build a Lakehouse somewhere other than Databricks?
Yes, Delta Lake is open source and can be used in many different data compute products which are listed below. However, Databricks has built in special optimisations just for Databricks and a robust user interface to manage the Lakehouse. So, it is likely running Delta Lake will be slower and could be harder to maintain elsewhere.
Example the Databricks user interface for datasets showing the schema and a sample of the dataset.
Also note that Databricks is a general compute engine rather than a database or programming interface: it can run SQL, Pandas, Ray, Spark, most of the popular data science libraries, do graph analytics, geospatial, IoT, near-real time streaming and import almost any Python, Java, R or Scala library. Databricks’ main benefit to us is its extreme versatility, potentially reducing costs by not having to maintain separate business intelligence and data science data processing applications.
Also, Databricks is in strong position to customise Large Learning Models (LLMs) like ChatGPT, with its general compute and strong MLflow integration, so you can pick the best open-source AI models and tune it with your organisational data in a highly efficient way using MLOps.
However, if you’re already using one of the Lakehouse alternatives listed below, it may not be worth adding Databricks to your Data Platform.
The alternatives to Databricks Lakehouse are:
- Build your own Lakehouse using open-source Delta Lake, Apache Hudi or Apache Iceberg.
- Iceberg used by Netflix and has support from a variety of major vendors.
- Hudi is also used by major enterprises, including Walmart.
- Starburst, like Databricks, is a cloud neutral and cloud native compute engine with a full suite of enterprise options and data connectors. It has Delta Lake and Iceberg connectors that can be fully controlled with a SQL API.
- Azure Synapse has the option to use its own Spark Engine, can import Java and Python libraries, and has Delta Lake Integration too. Has excellent integration with rest of Azure.
- AWS Glue allows you to use Delta Lake in S3. Has excellent integration with rest of AWS.
Some may say Pandas or DuckDB can be a Data Lakehouse, though from our research in May 2023 they cannot do transactions or merges on a Data Lake file (Delta Lake, Iceberg, etc.) so have been excluded from the above – they still have their own use cases though.
Summary
In short, like with other data products and architectures, the answer is it depends on the makeup of your data team, security, the size and structure of your data, and how the data is used among many other factors.
If you are consuming a lot of data in your data platform, struggling to manage both a Data Lake and Data Warehouse at the same time, or trying to figure out how to use advanced analytics like Machine Learning with your data, Data Lakehouse is in our opinion a convincing proposition.
We also find ourselves recommending Databricks more often than the alternatives as it offers the most complete Lakehouse solution, though competitors are quickly catching up and offering a near as good as experience as Databricks, so the choice isn’t as easy to make as it was in 2021 when we first wrote this article.