When I was at Lyst, my line manager looked after our platforms. His team were mostly composed of SREs doing lots of DevOps work and who spent a great deal of time in Terraform. However, he also had two software engineers when I joined, who were working on a data project. The idea of the project was to build an architecture that would allow any product or marketing team to send data to a central API that would then store, envelope and partition the data. What this also meant was that it was very easy for my team of analytics engineers to simply pass configuration into Terraform1 in order to receive new data that had been sent to the API in Snowflake.
Looking back, it is clear that my manager had instituted a data platform, and the two software engineers who built it were data platform engineers, at least in activity. This was long before I first heard the term ‘data platform engineering’, at a London Analytics Engineering Meetup at Sainsbury’s, where the speaker was a staff data platform engineer from Deliveroo.
This data platform enabled my analytics engineering team to operate more broadly and independently of the requirement for data engineers. In fact, many other organisations would describe what they did as data engineering instead of analytics engineering. I’m not precious about whether this work is considered data engineering or analytics engineering. Where a data platform exists, the two roles naturally converge.
The velocity of these analytics engineers, who had a data platform to work from, was much faster than I have seen at other orgs where, instead of this setup, analytics engineers have to wait for data engineers to build the first leg of pipelines for them. If it’s possible to reduce the number of people in the chain to produce something, we should do it - it reduces context loss. If an analytics engineer can understand the semantic and data model a stakeholder needs, then acquire the raw data they need from the data platform using only config, or not much more than config, and then build their transformation models… it’s a much better process.
If there is both a data engineer and an analytics engineer in the value chain, it’s much more complicated. The analytics engineer is close to the stakeholder and distils the stakeholder needs into a data and semantic model to build, then they have to request the data they need from the data engineer. The data engineer then interprets what the analytics engineer has asked for and creates/edits a pipeline to acquire the data from the third party or internal source. The data engineer often will make assumptions about what data will be useful and how to efficiently store it. They will then expose raw or cleaned data sources for the analytics engineers to apply business logic onwards.
Relay races are slower than individual sprints, and the best teams can drop the baton and lose. Making data and analytics engineers work on separate parts of the pipeline in a chain introduces ambiguity, context loss, and delay. When the analytics engineers who can’t self-serve their own source data request data from data engineers, if the data engineers are busy, they can end up waiting a long time, exponentially increasing the average time stakeholders have to wait for what they need.
The data engineers are also disconnected from what the stakeholders and the organisation need because the analytics engineers mask this from them. They might like this because it allows them to focus on things that interest them, like new technologies to implement and other technical considerations… but being further away from the value that they are supposed to be driving will always make them less valuable. A migration to an exciting new data warehouse, implementing a new tool, or infrastructure-as-code… does not deliver anything for stakeholders.
Compare this to where a data platform and a data platform team are in place. The data platform team accelerates pipeline building through the standardisation of tooling and allowing for many new datasets to be acquired through a simple config change. It enables analytics engineers to self-serve their source data acquisition in a very short time as part of their everyday workflows, and then move forward into building transformation, data and semantic models. In this case, it is right for the data platform team to focus on improving tooling and infrastructure as they aren’t in the critical path of delivery for stakeholders. They are there specifically to accelerate the use of data by others, through standardisation and automation.
It also makes data platform engineering care about the developer experience for the analytics engineers. When building the pipelines themselves, they become opinionated about the tools they have to use to build with. When enabling others to build for themselves, they must ensure the tools work for those colleagues. Being attached to using legacy tooling because they are comfortable with it no longer makes sense, if it means that the analytics engineering team find it difficult, slow or impossible to use on their own.
What’s good enough for data platform engineering becomes what’s good enough for the org, instead of with data engineering, where that is what is good enough for their team alone. Low velocity due to lack of modernisation and disconnection from the business is mostly invisible to a data engineering team who operate in this way; they only feel it when it’s too late and pipelines are breaking and stakeholders are hunting them down on Slack…
In short, I think any organisation looking to reorganise or set up a new data organisation should seriously consider having one single discipline that builds end-to-end pipelines, whether that is called analytics or data engineering, and a data platform engineering discipline that enables, accelerates, and stabilises this.
It’s concerning that what I saw for the first time almost six years ago is still far ahead of how many organisations operate now. With the global forces at play, driving data organisations2 to become exponentially more efficient, more quickly than ever, they don’t have the time or the money to stick with what is comfortable. They may not realise their time is running out.
Data platform engineering is no longer a novel and pioneering way to operate; it’s the norm of the day. The old way is antiquated and perilously so. Within two years, Data team leads will have the option of AI-assisted workflows that will take them from data sources to modelled data. This way of operating is consistent with data platform engineering, and data platform engineers can institute these tools and enhance them with their human ability. Where data engineers are the first leg of a relay race, they make their Data team lead choose between them and the AI-assisted workflow. I don’t need to tell you what humanity has chosen every time, when choosing between the old ways and automation that works.
We had a terraform module that could accept an S3 bucket location, plus a few other parameters to determine the table name in Snowflake, and upon applying a new use of the module Snowpipe would start to ingest data from the right location in S3 and store it in a standardised raw input Snowflake table that became our raw sources to build onwards from. It allowed us to get from having no data in the data warehouse to use, to having data modeled and exposed in Looker with a couple of hours work.
It’s whole orgs and all orgs but I’m focused on data in this post.
This blog post summarizes my thoughts and dialogue with various companies over the past couple of months. I often see data organizations have data ingestion, modeling, exposure, and infrastructure split into four separate roles and responsibilities.
I believe that having a team of people handle the data pipeline from end-to-end, alongside a separate data platform team with no access to data, creates the most efficient, precise, and motivated roles. These roles take responsibility for their solutions.
You could write a book on this post, its fundamental