the data platform

Every self-respecting company has a data lake. And for good reasons! Data is being recognized as an invaluable asset to boost your company and products. Data products prove themselves to be of key importance in a rapidly changing digital world. These products range from BI reports providing important business insights to real-time recommendation algorithms that automate intelligent decisions. However, scaling the process of building these kind of products turns out to be hard. This story talks about a couple of learnings from our data lake infrastructure and gives a high level overview of the DataHub we are currently building.

Randstad Group has seen different variants of data lakes in the last years. The first real data lake was built on the Hortonworks platform. Hortonworks provides basically a predefined set of tools that together form a data lake / data management solution. We chose to host and operate it ourselves in our own AWS environment. The data engineering team was responsible for getting data in, scheduling ETL jobs, scheduling algorithms, exposing results to consumers and was also responsible for operations on the Hortonworks components themselves.

Let's highlight two problems that we encountered:

  1. The environment was very complex. Achieving a sufficient level of knowledge and keeping up to date on the many components was a challenge. This led to 'single point of failures'; only one or two persons who would know how to work with specific components.
  2. Since working with the platform was such a specialist affair, all actions in the production environment had to be performed by the data engineering team. This resulted in a bottleneck that prevented fast development of new products.

The next generation was a complete rebuild of the data lake from scratch, but this time we were using AWS managed services whenever possible. We put all the data in Redshift, had ingest and unload areas in S3, etc. Only scheduling was done with Airflow. The data engineering team was still responsible for building and operating the platform, but this was easier to manage. The team would also ingest data from the various systems of record and would schedule exports and an occasional transformation.

However, it is interesting to highlight again two problems:

  1. Working with the platform was still a fairly specialist affair, all actions in the production environment had to be performed by the data engineering team. This resulted in a bottleneck that prevented fast development of new products.
  2. Managing ingesting and exporting of the data to all the producers and consumers was still a time challenge. The work was fully on the shoulders of data engineering. This also decreased the involvement of data producers and data consumers to a level where especially producers would no longer know about the existence of the data lake's dependency on their systems.

-It's not a tool, it's a platform-
The big change we are making is to view data infrastructure and services as a product for randstad group the netherlands. When we want to quickly build data products by data consuming teams, we need to bring these consumers together with the producers of that data. Furthermore, the fact that data from a certain source is now in the data platform, does not change the ownership of that data.

We approached the data lake as a technical tool. Although the tool can provide functions that did not exist before, it could not be used directly by the end users without hand holding of data engineering. Why can't someone just push data to it? Why can't someone just deploy an algorithm?

Today we are developing the DataHub as a platform. The DataHub brings data producers and consumers together in a self-service way. It will take care of the obstacles that producers and consumers encounter.

On the consumer side we have self-service data subscriptions and self-service JupyterLab notebook servers. The platform takes care of providing the infrastructure, security / privacy and access control, tagging requirements, etc. For producers we try to make it as effortless as possible to push data into the DataHub while retaining control on their data.

Once consumers and producers can help themselves, the data engineering team can focus on growing the platform and provide great value for all data products within randstad group the netherlands. The future is looking great for data within Randstad Group.

tags blog