We don’t need data contracts

We need data to be part of product engineering

Feb 16, 2024

'Succession': Alexander Skarsgard addresses 'sexual tension' between Lukas Matsson and Sarah Snook's Shiv Roy — credit: HBO - Succession

There has been criticism of data teams that became very large during the last tech boom years, and have since had to make significant lay-offs. Sometimes whole teams have been removed. Part of why they became so large was the pattern of how they deal with data - a pattern imposed upon them, for the most part.

Product teams would make changes with little to no input from a data perspective. They would then send the “exhaust” from product features as events. Sometimes the data team would take extracts from a replica database instead. Either way, the data would end up in a data lake/warehouse, without any transformation or cleaning - the now default ELT pattern. The data would then be cleaned from raw to staging and then transformed from staging to marts. Finally, consumption in the form of metrics and dimensions in a semantic layer, exposed in BI dashboards and notebooks. Technological or mental, I promise the semantic layer is there. There is also consumption directly from marts into things like Reverse ETL tools.

If you think of the end “sausage” to be consumed as the consumption layer mentioned above, the provenance of what it is composed of is murky at best. Is it 50% pork or 80% pork (3rd party vs 1st party data)? Did they substitute some pork with horse meat (missing data filled in from the ERP)? Is the rusk gluten free or not (what do nulls in a specific field mean)? These aren’t gourmet sausages… these are week-old gas station hot dogs that were recovered from being dropped behind the till. One of the problems about how far the data team is from the product engineering team, is how much metadata is lost.

Metadata Quality

David Jayatillake

January 11, 2024

Read full story

The data team receives the data with little to no metadata. I don’t only mean data types, but actually the context of the data creation. The data model it belongs to, the intention of the engineers who made it, what it means to the business. This metadata allows them to use and test it effectively.

This has led to the topic of data contracts, which I’ve also written about. If we’re going to continue down this blind ELT path, where the data team has to look for value in the data swamp, then data contracts are probably the best solution to this loss of metadata.

https://www.forbes.com/sites/jeffmcmahon/2021/05/22/three-reasons-recycling-is-failing/?sh=3726602b7e09

TLDR: data contracts are an agreement between data producer and consumer about the nature, completeness, timeliness and format of data shared. If this sounds like an API… that’s because it is.

So why am I saying we don’t need them? They could be the best protection we have for metadata and data quality. I wrote about data being part of the product development process. This preserves metadata and makes data contracts much easier to define and deploy. Could we go further, though?

#100 - Playing Offense

David Jayatillake

November 15, 2023

Read full story

In the weeks since I’ve written this post, I’ve subconsciously been following this way of thinking to its natural conclusion. A product team should include data folks who produce and maintain a data model with a semantic layer. Yes, for internal use by the team, to measure product feature performance, but also for external use by other teams and even customers.

There is no lost metadata here during the engineering process. The embedded data people in the product teams live and breathe what their team is building. If they are data engineers, they may even craft part of it. They then bake this knowledge and metadata into the products they offer onwards to other consumers. The other consumers know what’s on the menu from the semantic layer that’s exposed. They don’t need to invent some way of using the data - intended use is pre-defined, using the semantic layer.

When the data team is separate from the product engineering team, the people involved are on different sides of the interface. The users of the data are separate from the producers, leading to a need for a contract. When embedded data folks are generating metadata and presenting the semantic layer for consumption, data contracts are like having a contract with yourself… it’s not strictly necessary. This is direct-to-consumer, gourmet, organic, hipster sausage - it doesn't need pumping with antibiotics to be safe to eat.

Data maintained by each product team is easy to consume and be understood by other teams. They don’t need to do lots of their own transformation to get data ready for their own use. The data people in each product team can work according to common standards on common infrastructure, but they are still immersed in their product team’s domain and data models.

It starts to become clear which parts of data to centralise. The data people that need to be central in an org are the ones who look after the data platform. The data platform is the infrastructure the whole org uses for data work. This team builds the common standards and common infrastructure that enable embedded data people to work without worrying about these things, while remaining aligned with other data people elsewhere. This is how software engineering has worked in tech companies for some time.

What about things in data which concern many teams? Things like identity resolution, revenue recognition, customer acquisition and attribution. They could be better embedded. Even something like identity resolution can belong in growth or a product team focused on users/customers.

As the data platform team doesn’t actually use the data, they are never on the other side of the interface to a data producer. Therefore, they don’t need a contract. They can enforce standards, but this should be a part of infrastructure - an example could be rejecting events that don’t have the required metadata. These requirements are not specific to any one data producer, but to all data producers.

The rise of massive dbt projects is a symptom of the blind ELT pattern. dbt has made it easier for centralised data teams to cope with data swamps, but this has, in turn, encouraged these data swamps to expand and remain toxic. This is far from the original vision of what dbt would be used for, which was closer to standardised industry-specific data models.

Rare earth discharge liquid flows from a pipeline into a "rare earth lake" near Xinguang village on Nov. 29, 2010 in Baotou, Inner Mongolia of China. — https://abcnews.go.com/Technology/toxic-lake-black-sludge-result-mining-create-tech/story?id=30122911

In the better scenario above, data transformation is much lighter. Data input from production into the data lake/warehouse will still need incremental loading, but this can be done in a common pattern. The data platform team can also automate this. The data models inside the product teams are already in a good state. There is little need for extra transformation. Any slight filters or abstractions can live in the semantic layer for that data model.

Ah, but what about third party data, you say? ELT SaaS providers, like Fivetran, are building standard well-documented data models for popular data sources like Hubspot and Stripe - including dbt packages. It would only be a small leap to include a semantic layer in a popular open-source format like Cube or dbt/metricflow along with these packages. The way of interacting with this data is then akin to interacting with data from another product team in your own org.

Package Deliverator

David Jayatillake

September 7, 2022

Read full story

Transformation for product funnel analysis and marketing attribution mostly requires amalgamation of event data from different product teams into one place. This could be part of the data platform standard and infrastructure for product teams. I’ve seen a central event bus stood up for use by all engineering teams. Metrics and dimensions can then be defined in the semantic layer and performance=cost for queries can be enhanced by the cache in the semantic layer. This is an added benefit of a powerful cache in a semantic layer - it can reduce the amount of transformation (and dbt models) needed. A lot of final mart-level data transformation is really a form of caching.

Revenue recognition operates on small data and is reduction of revenue events, which may change the state of a revenue entity over time. It may make sense to have a small amount of transformation to codify the definition of varying kinds of revenue and cost. This is more for governance than actual performance - you could also do this in the semantic layer to reduce transformations. Finance can own this domain of data and make it available for other consumers, such as Marketing.

Identity resolution is the final common type of transformation. It is most valued by Marketing teams (attribution, CAC, ROAS, single customer view). It could easily be pushed down into Marketing from being a central concern, as I said earlier. Cookiemageddon has reduced the ability of internal teams to do identity resolution. This kind of use case is being out-sourced to third parties, who have proprietary ancillary data with which to enrich identity resolution. So, even less transformation done internally than before. Marketing can then own this identity domain of data and make it available for other consumers, like Finance.

The arrows here aren’t specifically data movement, they are a presentation of data via APIs, by the domain owner to others. This can even be a shared package on the same piece of platform infrastructure that the next team can build upon.

The design above is as lean as it gets for data. Where does data leadership like VP of Data etc fit? They could stay in data platform, as I have done before, or could belong to some kind of senior leadership team. What about staff/principal level data/analytics engineers and analysts who work across teams (if you have them)? They could also go to data platform, as they provide standardisation of method. They are also well-qualified to make requirements for the data platform, that will serve their discipline well across the whole org.

Embedded data people are part of those teams’ budgets and therefore provide ROI - the exception being Finance, who, like data platform, are not a profit centre. Having more embedded data people allows data platform to be as lean as possible.

Does this mean we don’t need data contracts in this operating model? Yes.

Is this implicit data contracts via pushing producer and consumer to be as close as possible, along with use of APIs and data platform? Yes.

This is swimming with the current, rather than against it.

Dan Twomey

Really like this approach David. It takes a more holistic view of product development. Love the idea of PEDD teams (product mgmt, design, engineering, AND data) focused on achieving outcomes, especially with data intensive applications or components.

Expand full comment

Andrew Jones

Feb 18, 2024

I certainly agree with the subtitle: "We need data to be part of product engineering"!

If data applications are supporting key business process, driving ML models that power product features, then they should be built in the same way, and with the same discipline, that product engineering use for their services.

And of course, data should be owned by the team who produces it.

My goal with data contracts was always a way to facilitate a move to this model, without changing the organisation structure first.

That's why my book talks mostly about that, and much less on the technology.

Even with the perfect org structure, there is still a need for an interface to access the data.

Often that would be a table in a data warehouse, with historical data, because the people consuming this data are often using tools like dbt or SQL-based analytic tools like Looker.

And that's the interface that can be driven by a data contract.

4 more comments...

davidj.substack

Metadata Quality

#100 - Playing Offense

Package Deliverator

Discussion about this post