Oakland

A data scientists view of Big Data LDN 2019

Anyone who’s read The Hitchhikers Guide to The Galaxy, or seen the film, will know the seminal quote “Space is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is.”. Well apparently, space has a new competitor for sheer size, and that’s data!

Every year a mass of people descends on the Olympia for Big Data London (BDL). This year the Oakland team arrived en masse to find out the latest tech developments in the data/analytics world and chat to attendees in order to understand what issues they were having and how we might be able to help. As such, below I’ve summarised a few thoughts from the event based on talks, exhibitors and those conversations!

1: Data quality and access are common blockers in getting the most from data!
The main reason for being at BDL was to chat with potential clients in order to understand what data centric issues they were having, from overall strategy down to data quality. This allows Oakland to really get under the skin of our audience and determine what we can deliver to help solve these issues.

I discovered the most common issues were around being able to access data easily and ensuring the data quality. These problems seem largely to be caused by the isolation of data systems away from those trained to utilise the data (i.e. data held by IT, whilst analysts sit in other functions) or poor ETL processes. As such, there may be some hope for these companies with the greater push from on prem to cloud infrastructure, which allows permissions and processes to be reset for a more data centric organisation!

2: There appears to be a trend in code free solutions?!
Of the over 100 vendors on show at BDL, there appeared to be a number offering code free tools (Such as Alteryx or Streamsets), which allow companies to improve their outputs without hiring the more technically skilled staff that are required for dealing with complex data.
I found this surprising as our experience with these tools has been disappointing, But I can see the merit for companies trying to rapidly upskill their use of data, looking to push away from an excel based solution but not jumping into the deeper lake of full on programming, allowing things to stay somewhat simpler!

My concern around these tools is the ability to optimise processes and run production scale jobs, being locked out of the inner code. As such, some companies may find themselves building large processes within these tools and finding themselves having to translate them later.

3: Managed services are on the rise!
A couple of years ago AWS brought out a product that amalgamated a variety of their offerings into one product, known as SageMaker. SageMaker was designed to give more power to data scientists by taking away some of the pains of developing cloud infrastructure and allowing them to focus on analysis and modelling. This was done by providing an analysis interface which could easily connect with AWS data sources and be configured to automatically develop a model endpoint in order to serve a production level model.

A key thing I noticed this year’s BDL, through both walking around and attending talks, is that these types of service appear to be on the rise. This was clear from the strong presence of DataBricks, who provide a managed platform in order to more easily utilise the Spark big data infrastructure. In addition, there were several smaller companies who were offering to reduce the dev ops / architecture burden of data scientists, allowing them to focus on model development (such as Algorithmia and BDB).

The most obvious reminder of this rise in managed services came when attending a talk launching Microsoft Azure’s new product Synapse. Synapse has been set up in a similar way to Sagemaker, allowing a data scientist to quickly develop models connected to various data sources. However, an additional feature is the development of the SQL engine to allow it to more quickly incorporate files typical used for model outputs. As such, analysts relying on the outputs of said models will be able to access the underlying predictions faster.

Overall, I feel like this is a step in the right direction, if used properly (i.e. don’t just throw out models), and whilst these services come at a price, they do allow those with more analytic than infrastructure experience to get up and running until they fell, they can fly on their own!

4: Everyone loves AI….

Interestingly, this year’s BDL fell only a week after a large controversy within the data community, namely the discovery of the highly gender biased AI used to determine credit limits for Apple’s new credit card (https://observer.com/2019/11/goldman-sachs-bias-detection-apple-card/). As such it was interesting to contemplate how the mass of AI based companies would deal with the expected questions from their potential customers around how their products would deal with bias.

Aside from this, my main thoughts were around the sheer number of companies touting AI as the solution, fully cementing the data world in the AI hype. My personal feelings around this, based on my experience dealing with messy, incomplete data over the last few years, is that these types of solutions are often unprepared to deal with the state of company’s data. This is especially true for older or larger organisations where said company has gone through multiple process changes, altering how data is collected / maintained / represented.
As such, whilst I do feel like the rise of useful AI is something that is inevitable as companies use of data matures, for now the better approach would be to have a greater understand of their current position and properly plan how to achieve their analytic goals

Dr Richard Louden is a data scientist with The Oakland Group