Oakland

Prefect: Should you utilise the next generation of data pipelining software?

What is data pipelining, and why does it matter?

Data pipelining, or orchestration is an everyday activity performed by companies to move their data from one place, likely the primary storage location, to another, such as a cloud-based data lake, often including transformations during this process. Whilst this is a standard operational activity, the number of pipelines utilised by a single company has increased due to the more widespread usage of data, mainly for reporting, analytics, and machine learning. This is because the easiest way to perform these tasks is by having previously disparate data sources in a clean format, all in one place. As such, while the noise around ML and AI grows, the importance of data pipelines that feed these processes, and normal business operations, also increases.

 

So surely this problem has been solved?

Given moving data around is such a common activity, you might assume that there is an agreed set of tools and standards. At Oakland we’ve learnt never to assume, and while the formats in which you move data around are now kind of set. The tooling to do so is constantly changing, largely driven by the increased usage of cloud services and the move towards “big data”. Common tools these days include Apache Airflow, Azure Data Factory, AWS Glue, Talend, and Informatica, with both open source, bring your own resource, and managed service mindsets catered for. As you would expect, all these tools have things for and against them, and as with every technology solution there is no one size fits all. However, many of these tools have a common set of issues that we believe are overcome by something developed more recently, Prefect.

logo

 

What is Prefect?

Prefect is an open-source (free to use and modify) Data Orchestration library, which can also be utilised as a paid platform. You would use it instead of the various tools mentioned in the previous section, AWS Glue, Azure Data Factory, Talend, etc, that co-ordinate both the movement and transformation of data, often running tasks in parallel where possible.

It can handle orchestrating batch movements of large amounts of data, including transformation and cleaning operations, using parallel running tasks if required.

So why should you use it?

A mature market like data orchestration software requires something special to move a company from its current tooling. However, we believe there are several core features that showcase the benefit of using Prefect:

Usability:

Efficiency:

Scalability:

Flexibility:

Maintainability:

And why shouldn’t you?

Prefect won’t be for everyone, and we wouldn’t recommend it to data teams for the following scenarios:

Streaming:

No code solution:

Very simple operations

In conclusion….

Prefect is a great new tool that can help consolidate data pipelining/orchestration activities by utilising a very common language and with the ability to scale out efficiently. This blog only covers a fraction of Prefect’s features. For more complex examples of how we’ve integrated Prefect into our projects, please get in touch by emailing hello@theoaklandgroup.co.uk or calling 0113 234 1944

 

Jake Watson is a Senior Data Engineer