Data is nothing new. Long before the ‘Big Data’ explosion, businesses were producing and storing their data. The problem for those with large enterprise systems still in place is extracting that data so they can get the maximum benefit from it today. It can be a painful process.
Having recently been set that challenge by a client, our Lead Data Scientist, Alex Young, shares the approach he took and explains the decisions he faced along the way.
Legacy systems and how we got to where we are today
in the 1990s and early 2000s companies woke up to the potential of data and began building their own data warehouses. By the early 2010s things had got more serious and they started looking for ways to get all that data in one place and make it accessible.
The solutions that were popular at the time were great for searching and retrieving records, but they had their limitations when the data grew to the tens or hundreds of gigabytes. They simply weren’t designed to carry out massive aggregation queries, which made large scale data analytics challenging.
For businesses with data still tied to this type of system, that’s a problem that needs resolving.
The obstacles to unlocking the data
Firstly, these systems can be tricky to integrate with newer technology so you need to find creative solutions to enable you to work with the best tools on the market.
Toolsets can be locked down as data is encrypted with proprietary encoding, which makes my job so interesting, exploring new ways to bring the data outside of the Oracle stack. Aggregation mappings and exporting the data is also tough and it can be costly to scale your database – scale means more servers and the cost of additional licences can add up.
There’s also the fact that moving from an on-premises solution to a cloud-based one is a big change for any organisation. If you work on similar projects, I’d suggest doing what I do and spend some time with the client’s IT department and suppliers to get them up to speed. It’s appreciated as it’s often the case that no one in their team has been through the process before.
What we did and why
My job is to help extract the data sitting in enterprise systems and replicate it in the cloud where it can be utilised more effectively. The first step to this is always to listen to the client’s specific needs and objectives and understand exactly what data they have.
I always stress upfront that there’s no silver bullet; no single solution that can do everything we need. The truth is that you have to be pragmatic in choosing a mix of components that are right for each individual job.
For example, on the most recent project I worked on the solution we delivered combined the capabilities of AWS, Databricks, Redshift and Power BI. Databricks was chosen because it not only works well with AWS but also gave us the perfect platform to explore the data. Power BI was chosen because the client was already familiar with it.
It’s also ideal for businesses that use Office 365 as it contains all the drivers needed to connect data sources and integrate perfectly with their existing systems. It has great visualisation and aggregation capabilities too. And we chose Redshift because it again works well with AWS and is great for analytic reporting. This gave us the flexibility to serve the data to the client in the way they wanted it.
In terms of the actual process, we first securely extracted the data from Oracle. This was done by batch processing in Spark then dumping it in an S3 bucket, which is affordable and accessible for all the development team to use. We then pre-processed and cleaned the data, which was a task in itself as much of it has been inputted without proper care about its quality. We then served the data back to Power BI.
There’s also an AWS Oracle migration tool which is a great fit with the Redshift data warehouse product, which you might want to consider.
The question of which data warehouse to use is a tricky one. For essential data that’s queried every day, I suggest using the more expensive Redshift, which we did on this project. For less frequently used data I recommend making it available in S3 with a hive scheme on top so you can query it from a Spark engine which will pull out the aggregations and serve the data back.
The value that the Oakland team can bring
A lot of businesses have put the management of their data in the hands of external consultants. I enjoy removing the barriers to employees using the data for themselves.
We bring speed and transparency to the process. With our help, companies can integrate, access, interrogate and understand their own data in an instant rather than having to rely on a monthly report generated for them that only provides a fraction of the true picture.
It’s also satisfying to help businesses finally use their data in a way that delivers tangible insights and real operational improvements.
More about Alex
Alex is a Lead Data Scientist at The Oakland Group, he’s one of the guys we look to for the smart solutions.