The first time I heard the term ‘Data Contracts’ was when I read Andrew Jones’ post on it, relating to the work he was doing in Data Engineering at GoCardless. Andrew spoke at London Data Quality Meetup #2 and I had a chance to speak to him after the event. As I mentioned in my last post, we thought Andrew was the first to coin the term and concept, but he mentioned there was a team in China who had thought of something similar around the same time. This is beginning to sound like academia.
I do genuinely believe that bigger tech companies would have invented something like Data Contracts., but possibly by another name. They may have used APIs without adapting them for Data Contracts specifically, or they may have made something specific to their stacks. I haven’t seen anything open-sourced that fits the description but part of me wonders, why? Was it too specific to their org infrastructure to share, or was it considered part of their competitive advantage?
I asked some data legends on LinkedIn and got some answers - even one that was very close to home!
As expected, the concept had existed but not been named. It reminded me of how I had been doing entity resolution for my whole career, but only learned it was called that last year:
I’m beginning to think that if we excavated even further… we’d start heading back to the 80s and IBM. Just as data observability has existed for nearly as long as databases, maybe Data Contracts have, too.
One thing really stands out for me: the companies that have “won with data” have been using the concept of Data Contracts for a long time, regardless of what they called it. It’s a logical solution, and possibly one of those things in engineering that has one correct solution (with many monikers), which is probably why many are coming to the same solution at different times. Like many concepts that have come out of Big Tech and proliferated, I think it’s time for Data Contracts to do the same. We need to end the disaster of Data Engineering without Data Contracts as an industry.
One of my key memories on this topic was when we had an incident with one of our data pipelines. There was a field in one of our events which was boolean and previously not allowed to be empty. An engineer changed the event so that the field could now contain blank strings, which broke our data pipeline. I had implemented data observability, so I was able to pinpoint the source of the issue very fast and start the incident management process. However, what would have been better would have been an enforced Data Contract that prevented the pipeline breakage in the first place.
What is a Data Contract?
I wrote this section before some of the archaeology above - I wanted to simply think about what a Data Contract should be and not think too much about execution just yet.
In short, a Data Contract is an enforceable agreement on structure and format between the producer and consumer of data. You could even define it in a simpler way: a Data Contract is a guarantee on structure and format by a producer of data.
There are a lot of technical ideas and terms flying around in relation to the topic, but this is the core of it. A lot of the complication being addressed is in relation to how to execute Data Contracts in practice.
Why are Data Contracts a good thing? Up until we started thinking about this concept, it was accepted practice that upstream data sources from data pipelines could change without any notice. This was just normal. Data Engineers would watch their pipeline break, debug what went wrong, find out it was upstream, ask why they weren’t told, throw their hands in the air in frustration and then fix it. Rinse and repeat. Try to think about that without the context of what you’ve experienced in your career so far… it’s actually ludicrous.
In no other form of engineering would this be tolerated (engineering failure outside software and data can result in deaths). It was tolerated in Software Engineering until people started to quantify the cost of outages and the revenue of tech businesses soared, such that outages were even more expensive. Then it was dealt with - there is a whole field of Software Reliability Engineering with associated CI/CD, Test Driven Development, observability tooling, Infrastructure-as-Code, SLAs, SLOs… and many more principles/tools dedicated to reducing outages to the bare minimum. This has, by and large, been successful - you use many services provided by tech companies with very little expectation that they should go down, because they very rarely do.
One of the core solutions from SWE that is being borrowed from, for Data Contracts, is the API. What is an API?
APIs are mechanisms that enable two software components to communicate with each other using a set of definitions and protocols. For example, the weather bureau’s software system contains daily weather data. The weather app on your phone “talks” to this system via APIs and shows you daily weather updates on your phone.1
The way that an API is guaranteed, based upon its definitions and protocols, is a big source of inspiration for Data Contracts. APIs have changed the world beyond recognition - knowingly or unknowingly, you will have made thousands of API requests today through your use of applications. I hope that Data Contracts will have a profound impact on our industry, but everything I’ve seen to date also suggests bumps in the road. Unlike an API, a Data Contract doesn’t have the same uptime or accessibility elements that an API has. You can fulfil a Data Contract by sending no data, if no data should have been sent. A Data Contract can be fulfilled as long as data in transit and at rest meets the specification of the Data Contract.
The last big topic in the Data space was that of Semantic Layers. VCs and other industry leaders rightfully slapped this topic down, on the grounds that these concepts are far up the stack, which is currently built on top of wet sand. Part of why I’m excited about Data Contracts is that they are a way for us to go down the stack and change our foundations to be on rock. If Data Contracts proliferate widely and cross the chasm, this enables concepts like Semantic Layers to also proliferate widely and provides them with a bridge across the chasm. It’s about getting our fundamentals right as an industry. If I were a VC, I’d want to invest in Data Contracts-aaS type companies, as it increases the TAM of the rest of my data portfolio, and makes ideas that are not investable now, investable in a few years’ time.
What a Data Contract should contain:
Format and Structure
The fields and nested fields that can or should be contained in the data. Which fields should always be present, which are optional
The types of the fields in the data:
data types like string, boolean, number types and arrays of these
whether they are nullable
whether they are unique
whether they can only take specific values (eg enums and numeric ranges)
whether they have to fit a certain format or regex (eg UUIDs)
Metadata such as descriptions of the data, its grain, its context and also descriptions of its fields
The version of the data: versions of data should be immutable. If the data is changed, then the version is bumped
0.0.1 to 0.0.2 is a small bump for a small change, like adding a field without changing grain
0.0.2 to 0.1.0 is a medium bump for a moderate change that could be breaking, such as removing a field or changing its format
0.1.0 to 1.0.0 for large structural changes to the data, which mean the consumption method has to change materially
Does the data contain sensitive information, and in which fields?
Standards around structure (eg no arrays in the middle of the structure please)
Standards around which fields are required (eg if this is a product clicked event, there should be fields about the product included)
Environment, eg Test, Dev, Prod
SLAs and Owners
When and only when should this data be sent
Timeliness
Bifurcation (splitting where data that meets the Contract, and that which doesn’t, goes respectively and is treated)
Who is the producer of the data
Who is responsible for the data
Who is accountable for the data
Who should be consulted about the data
Who should be informed about the data
How long will old versions of the data be available
When should the data expire
Semantics
Which entities are at play in this data
Is this data about an entity or an event (all data is about entities, but not all data are events)
Which metrics can be calculated from this data and how (eg in order state change events:
revenue = sum(case when state = ‘completed’ then order_amount end) - sum(case when state = ‘refunded’ then order_amount end)
How does the data relate to other data (eg foreign keys)
I think this is mostly complete and uncontroversial, except for the last part about Semantics. Some would ask why should you need to know how to calculate metrics from the data as part of the Contract? The whole point of collecting the data in the first place is to measure something, as we’re not talking about data stores that the application directly needs in order to function. If the aim is to measure something, why not understand up front how to do this?
It’s much easier when the data model is designed and agreed upon, to know what it means and how to use it. Why wait until the data is produced and only then ask the Data Engineer or Analytics Engineer to figure out how to use it? They then have to look at the data, understand it, experiment, test their assumptions and then apply calculations either in Semantic/Transformation layers… this is slow and prone to errors.
Imagine if the metric definitions stored in the Data Contract could be inherited by a Semantic Layer, like imports from a package. In essence, once the Data Contract is implemented, it’s possible to immediately go from data creation to meaning immediately, without further engineering. Of course, there are more nuanced and deeper analyses to be had through Analytics Engineering: things like LTV and attribution span multiple datasets and would be difficult to define, given that the Data Contract is for one type of data.
At this point in time, I believe a Data Contract should be for only one type of data and limited in scope as to the contents of this data (which may include foreign keys). It would be hard and probably wrong to get Application Engineers to think wider than the data they are responsible for.
The relationship between data from different data contracts is something I believe the Consumer is responsible for understanding. This is where the Consumer needs to understand that something like a grain change in Data Contract A could affect how they use it in conjunction with data from Data Contract B. The consumer then has the choice of whether to challenge and block a Data Contract change, or do work up front to mitigate. If data is versioned with overlap in emission, the consumer has time to adjust and dovetail the data from old and new versions of the Data Contract.
Next Parts
I’ll look at:
What kind of data can have Data Contracts applied
Emerging tools in the space
A roundup of other posts and articles written about Data Contracts and how they align with my thinking, and if they change what I think
It’s not too late to choose to come to Coalesce in New Orleans. I’m speaking and really looking forward to it as a conference (I’ll particularly enjoy it after my talk 😅).
Use code SPEAKERGUEST15 to get 15% off your ticket.
Metaplane are also rolling deep - we’ll have a booth, so come by and say hi! We have some very cool new dbt specific features to show off 🤫. Needless to say, I’ve been spending a lot of time looking at catalog.json, run_results.json and manifest.json 🤖. Tip: don’t try to put these into json crack, it won’t like it.
https://aws.amazon.com/what-is/api/
Great work, David! Love it.
Spot on: 'Data Contracts is that they are a way for us to go down the stack and change our foundations to be on rock' & 'companies that have “won with data” have been using the concept of Data Contracts'. I totally aree.
I would propose these additions to the contents of a data contract:
1) Access: The tech details of where to consume the data and pointers to the process of getting access. These technical aspects like the storgage system do matter as they often imply contraints and capabilities of the underlying system and hence of the data.
2) Monitoring: Many (but not all) parts of the data contract can be monitored. That is the big advantage of going down the stack. Executable data contracts imply that monitoring for it can be automated. The data contract should include which aspects are already monitored by the producer. Monitoring may also include a producer vs consumer view. As a producer, you want to monitor what you produce. As you mention correct, the responsibility of the quality of the beer is with the producer, not the consumer. But the consumer may have specific usage requirements that go beyond the default producer checks. It should be possible to also add usage data contracts that expresses the specific usage requirements. This also enables better impact analysis.
You mentioned not having seen anything open-sourced. Check out https://docs.soda.io/soda-cl/soda-cl-overview.html It's a Apache licensed and big step towards data contracts imo.
Thanks again for sharing your thorough research!