Taming your data assets with Databricks

Safely, securely, and efficiently handling data at any scale is challenging. Here at Oakland, we’ve had years of experience helping complex organisations tame their vast data assets to draw meaningful insights from them. These years of experience and the fact we are passionately tech-agnostic enable us to recommend the right tool for the job.

One of the most impressive tools in our kit bag is Databricks – which has suited our needs in the data landscape for three main reasons: power, flexibility, and a low barrier to entry. What Databricks is has been covered https://hevodata.com/learn/what-is-databricks/ extensively https://medium.com/codex/what-is-databricks-and-how-can-it-be-used-for-business-intelligence-6ac62cac198a , but a significantly more interesting question is why it has been so widely adopted?

A History Lesson

Before 2000, the dominant form of data storage was the Relational Database Management System, which was generally interrogated and created with SQL. However, post-millennium, the huge rise in web traffic created a vast quantity of semi-structured and unstructured data. Everything could be recorded: clicks, user patterns, tweets, purchasing patterns, sound, and video. But that was just the beginning; with the rise of IoT devices and the proliferation of cheap telemetry, this paradigm of collecting a large quantity of dissimilarly structured data is not abating– and is one of the most significant problems we help our clients with here at Oakland.

We have increasingly seen enterprises turn to the Data Lake to store this data. Unlike an RDMS where data is stored neatly in related tables, a Data Lake is effectively an open storage medium either on-premises or in the cloud. An organisation can pour its data into this Lake to be retrieved, structured, and analysed later. The first and most apparent problem is organisation: the early Data Lakes did not have folder structures. Even now, the supported folders in AWS’s S3 buckets are a naming convention only. A more pressing problem, though, was of scale.

With growing quantities of data, analytical operations become increasingly difficult. It becomes impossible to load all the data into a single computer’s memory simultaneously. Even if that were possible, the time taken to perform even simple aggregations or analysis was becoming unacceptably long. Parallel operations are required, where the data is broken up into several chunks and operated on by a cluster of processors – all of which are orchestrated by a central driving processor.

The Power and The Storage

To offer anything to the Big Data marketplace, Databricks would have to leverage the power of parallel computing. Databricks sits on top of Spark, which is Apache’s open-source engine for Big Data analysis. More than that, their CTO was the creator of Spark, and Databricks remain significant contributors to the codebase. Databricks also have some closed-source optimisations which can only be accessed through the product itself. With those improvements, Databricks claim processing speeds of several times faster than the bare Spark product.

To properly leverage Spark, Databricks runs on a cluster of compute resource. The amount of memory and number of CPUs the cluster has are configurable – with a balance to be made between the speed of queries and the cost of maintaining the cluster. Also with this, Databricks offer several different runtimes – which are the sets of core components running on the compute resource. Runtimes are picked depending on the use case, whether general purpose, ML specific or tailored to intensive SQL queries. Importantly though, leveraging this power is easy for the users of the platform. After an initial configuration, data professionals can continue their work.

At Oakland, one of the things we think makes us different is that we pride ourselves in doing work that sets clients up for success in the future. The ease with which Databricks’ power can be stood up and maintained is a key factor in why we use it and recommend it to our clients.

Alongside the processing power to run analytical workloads, Databricks workspaces also include managed storage. In particular, their ‘Delta Lake’ paradigm allows for the retention of much of the flexibility of a Data Lake while retaining many of the advantages of a traditional relational database. The Delta Lake is built on parquet files, which are accompanied by metadata JSON recording all changes (deltas) made to that file. As a result, ACID transactions are possible, and there’s a clear governance record for each piece of data.

Low Barrier To Entry

Another factor in our choice to implement Databricks can often be its incredibly low barrier to entry for developers to work on the platform. The central component of the workspace which most analysts or developers in any organisation will interact with is the notebook. Heavily inspired by the Jupyter notebook – developing inside it should be familiar to most coding data professionals. There’s also flexibility in the language of development. Python, R, Scala, and SQL are all first-class citizen languages in Databricks – though the Spark bindings of Python and Scala make them arguably the most powerful of the four. This familiarity with both environment and language speeds up both onboarding and development time, adding weight to the choice of Databricks.

We have also found Databricks to have a low barrier to entry from the infrastructure and set-up side. While at Oakland we have created and maintained complex Databricks platforms, it is also possible to create them graphically from a cloud portal – or with a small amount of Infrastructure-As-Code.

Flexibility

The final reason for our clients to choose Databricks is the platform’s flexibility. At the end of the day, the Databricks notebook is effectively just some arbitrary code running on a Linux machine – so it can, in theory, contain any process imaginable. Orchestrating and automating the running of whole notebooks with Databricks Jobs is a potent tool – and one we have taken advantage of multiple times here at Oakland for our clients. With notebook automation, Databricks can stand up as nearly any part of an enterprise’s ETL process.

We have used Databricks as an ingestion engine to drive the aggregation of many disparate data sources into one Data Lake. This leverages the power of Databricks jobs – running a notebook to pull data from many sources every few minutes. We have also found Databricks helpful on the analytical side when clients have asked us to use notebooks to create custom aggregations and transformations to feed BI dashboards. It has also been possible to use the notebooks as dashboards with widget visualization functions – even as a quick measure to demonstrate to clients some initial exploration of their data.

With a global recession on the horizon, organisations will be looking to drive valuable insights from an ever-increasing pool of data; quickly and democratically. Here at Oakland, we are seeing increased use of tools like Databricks for the reasons we began this blog with; power, flexibility, and a reasonably low barrier to entry. If you’d like to speak to one of our tech team about how Oakland has successfully implemented Databricks, please email hello@theoaklandgroup.co.uk

Mike Le Galloudec is a Data Engineer at The Oakland Group