Oakland

The Role of Databricks Architecture in Data Engineering

Since it was founded in 2013, Databricks has revolutionised enterprise data management and analytics. Built on Delta Lake, an open source storage format, the set of data engineering tools it prides itself on processing enormous amounts of data, then transforming them into datasets that are primed for exploration via machine learning (ML) models.

At Oakland, anything and everything data is at the core of the data and AI consultancy services we provide. When it comes to data engineering, Databricks is a central block in how we build advanced data platforms to provide actionable data insights for clients. 

In this article, we dive into the details of building a data platform with Databricks, including:

Let’s start.

The Role of Databricks in Data Engineering

Databricks plays a central role in modern data engineering by providing a scalable, high-performance platform. Built on Apache Spark (which is around ten times faster than traditional SQL databases), it enables teams to ingest, transform, and process vast volumes of structured and unstructured data efficiently. 

With features like Delta Lake for reliable data storage, and Photon for accelerated SQL performance, Databricks powers robust ETL (extract, transform, load) pipelines, real-time processing, and advanced analytics. Its unified workspace supports collaboration across data engineers, analysts, and data scientists, making it a key component of enterprise data platforms.

Databricks and ETL: A Match Made in Data Heaven

As a cloud-based platform, Databricks lends itself to ETL workflows – in fact, several of its tools and features have been specially designed with ETL pipelines in mind. So if slicker data extraction, transformation, and loading is important to your data activities, a platform engineered using Databricks could be a perfect fit. 

Some of the benefits (and the features that enable them) are listed below:

Easier ETL development: 

Thanks to Databricks Lakeflow Declarative Pipelines (previously Delta Live Tables), the operational complexities of ETL processes are automated. You define what should happen, not how, reducing boilerplate code and enabling ETL in SQL or PySpark, speeding up development cycles and reducing operational overhead. ETL can be written in Spark or SQL, too, for extra flexibility.

Streamlined workflows

ETL tasks, analytics, and machine learning pipelines are all orchestrated in Databricks Lakeflow Jobs (previously known as Databricks Workflows).

More focus on data quality

Thanks to features like Lakeflow Declarative Pipelines and automated data quality (DQ) testing, Databricks reduces the need for engineers to manage pipeline infrastructure or check DQ, freeing up time to deliver high-quality data.

What is Azure Databricks?

Given its power, it was only a matter of time before Microsoft jumped on the Databricks capability. In 2017, they became a first-party provider of Databricks’s cloud-base platform, integrating it with its own Azure cloud services. The result? Azure Databricks, the open analytics platform that allows you to build, deploy, share, and maintain enterprise-grade data, analytics, and AI solutions at scale. 

Naturally, as a Microsoft Partner awarded the Analytics on Microsoft Azure specialisation, we were super excited about the integration! Azure Databricks is another building block in our data engineering toolkit, allowing us to engineer data platforms at enterprise scale. Not to mention the immense potential it’s opening up for our customers and their data assets.

Our blog, ‘How to create a secure Azure data platform’, looks at Azure services in more detail.

“Our collaboration with Microsoft builds on our momentum as a leading cloud platform for Apache Spark-based analytics. The ability to provide our Unified Analytics Platform to all Microsoft Azure users in such an integrated fashion is invaluable to end users looking to simplify big data and AI.” 

Ali Ghodsi, Co-Founder and CEO of Databricks

Databricks Use Cases

With this in mind, let’s cut to three of our recent use cases using Databricks as part of our advanced data engineering service.

1. Building a sustainable, long-term data platform for Network Rail

Data platform engineering is an investment, so you need to make sure the technology is set up for future success. Databricks enables an open approach, reducing the complex nature of being ‘locked in’ that comes from using a more traditional platform vendor. Something our client, Network Rail, knew all too well.

Like many other large organisations with legacy data platforms, Network Rail was struggling to access data, which made extending the capabilities of their datasets difficult. Using Databricks, we built an open data platform architecture for the rail services provider, which has:

Yet that’s just the start – Click here to read the full case study.

2. Informing sales strategies for a leading provider of IT infrastructure

Sales team struggling to extract data from multiple sources? We recognise the challenge, and it’s one we helped a leading provider of IT infrastructure overcome. 

After we designed the IT service firm’s new data analytics platform, we leveraged the Databricks stack to build a machine learning and data science model. Their sales team now have access to far richer insights, driving better margins for the overall business. These insights include:

3. Driving an ROI increase of £150m+ for Yorkshire Water

As part of an overall data transformation programme, we developed a new data platform for Yorkshire Water. Databricks was primed to be the enterprise data architecture for the utilities company and was pivotal to the design and build of their new, strategic data platform. 

In total, ROI from the overall business transformation has exceeded £150m.

What are the Pros and Cons of Databricks?

It’s fair to say our Databricks and Azure Databricks use-cases and results speak for themselves. However, it’s important to weigh up the pros and cons of any data architecture to make sure you’re choosing the best fit for your business needs. We’ve outlined some of the major pros and cons of Databricks below to give you a better understanding of whether it’s right for you or not.

Pros

Advanced data governance capabilities, such as data lineage, roles, and permissions thanks to an in-built Unity Catalogue

Ease of scaling and maintenance

One unified platform for batch, streaming, ML, AI, and analytics

Native integration with all major cloud platforms and PaaS, plus native DevOps and Git support

Eliminates data silos by using Data Lakehouse architecture

Provides a collaborative approach to Data Warehousing in a database

The ETL process is in-built and in one place, omitting the need for another tool

Features are continuously updated and added 


Cons

Cost, especially at scale

Higher barrier to entry for non-developers

More suitable for bigger datasets

Constantly evolving product, so you need to have the time and resources to dedicate to understanding these changes

Data Engineering Advice

Of course, for more advice on Databricks and data engineering, or to speak to us about your needs for a data platform, please get in touch with our friendly team. That’s what makes us Oakland, everything data.