Build, Buy or Borrow

Data Engineering has reached an inflection point

David Jayatillake

Feb 23, 2022

There has been a lot of talk about unbundling are rebundling on data twitter and blog posts:

The Analytics Engineering Roundup

Would you like your data stack bundled or unbundled?

I nearly threw in the towel this week. I mean, what else is there to say after this tweet…

3 years ago · 6 likes · Anna Filippova

The timing of this discussion, in relation to the topic of this post, is near perfect. I had decided on this topic on the day Gorkem of fal released his post and I had spoken to him about potentially using fal to enable asset-based orchestration (this is my new favourite phrase), much like how Dagster have enabled this using Airbyte connectors and dbt models.

This unbundling or rebundling into better products than we’ve had before, will have an impact on data engineering more than the other modern data stack roles. In my previous role, I remember speaking to a data engineer we had hired about a misalignment of expectations: he had been expecting to be working on building data pipelines using technologies like Kafka for streaming and Airflow for orchestration, but had ended up building backend micro-services that were for data use cases. I remember thinking when speaking to him, why would we build those kind of pipelines every time we need to move data? They are not easy or quick to set up and are relatively brittle, as you require the data engineers to support upstream changes and unanticipated volume changes in the pipeline.

The Modern Data Stack has a whole host of technologies to automate these kinds of workflows:

CDPs/event trackers like Rudderstack and Snowplow which have SDKs, in most languages you could think of, to enable event data collection in your applications to be piped to cloud storage or DWH in a standard format.
EL tools like Fivetran, Rivery, Airbyte, Gravity Data, Meltano, Stitch… (😅 a crazy amount of VC money has gone here in the last year - the joy of painkiller products)
Of course, dbt for SQL based transforms, metric/feature store
TLE tools like Census and Hightouch

Page 1 of 4…

It’s beginning to feel like the unbundling of Airflow has spread wider than its original footprint, to the extent that mainstream data engineering can be achieved mostly through the integration of event collectors in app platforms and the use of off-the-shelf or open source tooling (the open-source → cheap hosted → enterprise sales model gains steam).

I think it begs the question about whether we should build something bespoke for our data pipelines at all. A lightbulb moment for me recently, was when I was working with our CTO on some data for investors: we were working late into the night and got to a point where we decided to get our data out of Amplitude, and into our DWH, to enable splitting cohorts by specific segments. We ended up using Fivetran’s free trial to get the data out quickly, and I remember saying on Slack that we should be careful as it gets expensive if we start to depend on it… CTO’s response: “cheaper than a data engineer”.

It feels like this principle is even more true for the streaming use case:

Building an endpoint for software engineers to send data to (building SDKs for them to use is even more work)
Infrastructure that can cope with wildly varying volume
Event format standardisation on the fly and chunking
Schema validation
Reliable storage of events into a well-structured and configurable folder structure

This is not a trivial system to build, which is why the original data engineers who did this with things like Kafka and custom built services were so sought-after and expensive, but also very slow at delivery. Using a hosted version of an open-source tool like Rudderstack/Snowplow wrapped in Avo or Iteratively would provide much more than in-house data engineers could build in months, in a few days, with higher resilience and much lower total cost.

Rudderstack, Snowplow, Fivetran and more now offer dbt packages that deal with the loaded data in the data warehouse; this makes complete sense as it’s the same for all customers: why should everyone reinvent the wheel making their own dbt models, researching the API to document fields etc? You can have well-modelled, clean data, that you can use to support your business use cases straightaway, rather than worrying about how to incrementally load it or how to make a surrogate key. I feel like this is a realisation of DataOps; a parallel to how terraform providers exist for most things needed for DevOps, likewise we have dbt packages.

Am I the turkey that voted for Christmas? I don’t think so, as we’re still talking about automating how the plumbing is created. The analytics engineering work of how to compose this data with company specific product data, in a way that is relevant for your organisation, is still very unique. Until we can predict exactly how data will need to be used in relation to its underlying structure (the gap from the semantics layer to clean staging data) - let’s call this automated data semantics - analytics engineering will need to happen. It’s probably easier to change companies to operate in the same way and therefore care about the same metrics, than to automate data semantics. However, the fundraising processes in place today are funnelling private equity funded companies towards semantics standardisation.

I do still believe there are use cases which should be built in house, often which support your core competencies or product secret sauce (for companies in ELT, this is pretty much everything). However, the gap between this and backend engineering focused on data is beginning to blur; all cloud providers now provide managed message queues and lambda function equivalents: often this “data engineering” that needs to happen in house is looking a lot more like building backend micro-services that process data in a company specific context. Cloud vendors are also abstracting data streams for analytics engineering, almost entirely skipping data engineering in lieu of some modest infrastructure work upfront.

There are also companies technically capable and large enough to always want to build everything in house, and this is also a valid viewpoint if you have the time, funds and engineers. Some of these companies provide OSS that the rest of us use for these very use cases. However, for 95 to 99% of companies, the buy or borrow approach is more sensible. Contributing to open-source where it doesn’t do everything you need is sensible too, as then you get the features you need but it’s maintained by a whole community, for the benefit of all.

There will always be the long tail of remaining sources which no ELT provider supports - I think it’s reasonable to expect these vendors to at least batch upload to cloud storage in a decent format. Almost all companies who buy from these vendors have a cloud provider…. and all who have a cloud DWH do, so every company has S3, Azure Blob Storage, or GCS and one of Redshift, BigQuery, Snowflake, Synapse. That’s only 7 main targets to manage for any vendor you might want to use to send data to. I think it’s reasonable to expect them to at least do the first three well, as they are simply file systems, if not also the warehouses with good schema definition.

For most companies, the cost of hiring and keeping data engineers working on pipelines that could be built with specialist “unbundled” tooling is unwarranted. The real value is in the next steps of data-enabled decision making, whether human or automated: Is this sneaker actually the same as this other sneaker? Should we lend to this customer? Should we deploy this product feature? Should we spend more on this marketing channel than the next? The curation of clean incrementally loaded data into well-defined data models and feature stores, to enable this decision making, is still a key value of analytics engineering that is not easily automated.

I’ve worked at a large listed company where the focus for data was not on enabling these decisions, but purely on building pipelines and infrastructure with the decision making use cases solved at the end of the roadmap (team of ~50 in Data Engineering using Hortonworks stack). The project took years and ultimately failed, with CDO and other data VPs fired… it’s a reminder that the purpose of data engineering is to later enable decision-making use cases and therefore, if we can get there faster and at the same or better quality, then we should try to do so.

davidj.substack

Discussion about this post