The third and final post in my series on Data Contracts - covering OSS Tooling and a roundup of other perspectives on the subject.
OSS Tooling
Buz
Buz is an OSS tool which can accept events from different SDKs (it supports Snowplow currently with others coming soon), pixels, webhooks and direct sends, and can test them according to a schema registry. It can then send them onwards, to other message queues or sinks. Its core maintainer, Jake Thomas, is also speaking at Coalesce, with Emily Hawkins, on a related topic:
Benthos
Benthos makes fancy stream processing mundane
Like Buz, Benthos is OSS, written in Go. Its primary aim is to make working with streaming data very easy. However, it also has a JSON_schema processor:
Checks messages against a provided JSONSchema definition but does not change the payload under any circumstances. If a message does not match the schema it can be caught using error handling methods outlined here.
Here the JSON_schema provided is the definition of the Data Contract. It’s not the sort of workflow that would be easy for PM to be involved with, but could be a Data Contract between SWE and DE/AE, given both would be able to read and contribute to the JSON_schema.
Memphis
An open-source, real-time data processing platform for in-app data streaming use cases.
The Memphis team reached out to me to show me what they’re building. Again, OSS, mostly in Go. It doesn’t currently have a Data Contracts feature, but they’ve committed to it on their front page:
Unlike Buz, they are committing to building their own SDKs to use in app development for all major languages. There are pros and cons to this, in that you have to get teams to move off the SDKs they currently have instrumented, whereas Buz steps into existing flows. However, having to interface with your own SDKs, rather than having to adapt to someone else’s, is an advantage.
It’s beginning to feel like Go is a natural programming language for data folks to learn after Python (which many already know). With lots of great data OSS being built in it, plus existing projects such as Terraform (and therefore providers), Docker, Kubernetes, Encore. Type safety, single opinionated code format, speed, simplicity… there is a lot to love. Neelesh Salian on learning Go:
Schemata
Schemata is an opinionated data modeling framework that programmatically measures the connectivity of the data model and assigns a score to it. We call this Schemata Score.
Observability metrics like SLO & Apdex Score inspired the formation of Schemata Score. A lower Schemata Score means lesser data connectivity of a data model. It allows the teams collaboratively fix the data model and bring uniformity to the data.
Ananth’s recently announced project is one that I’m really excited about. It’s adjacent to Data Contracts in the way that data observability is adjacent to testing. It suggests that there may be a problem with the Data Contract in terms of connectivity with other data entities. It’s a non-assertion based way to look at Data Contracts, which is really interesting.
Other voices in the space:
Starting with the post that I found the most fun, but that I also felt I aligned with the most:
I really hope I get to meet Stephen Bailey next week at Coalesce! There is some serious overlap in our humour Venn diagram.
In my post last week, I wrote a small script that I suggested (not seriously) could be a bit of Coalesce performance art. I think the court case from Stephen’s post would be even better! Despite the obvious humour, I feel like I’ve been here before and more than a few times, except I had no recourse and had to pay for my own damages.
Stephen’s post really captures the Data Practitioner’s point of view. There are so many parts I want to quote that resound with me. So I’ll just leave you to read it - it’s probably one of my favourite Substack posts in a long time. If you’re a SWE, this is genuinely how we feel. If you’re a Data Practitioner, enjoy something very close to home.
Chad Sanderson’s Substack covers Data Contracts in as much depth as any other blog that I’m aware of. There are a few posts to go through covering the problem space, technical and process implementations. The key takeaway for me is the concept of non-consensual contracts. It’s a really important perspective: that of the data producer. By relying on the data that a data producer emits, without expression of this dependency (verbal and informal don’t count), and with the expectation that they will conform to a standard, there is a non-consensual contract. We, as Data Practitioners, are expecting data to be the way we want without the producer ever having agreed to it - contracts must be consensual to work.
Benn proposes a self-sufficient model of Data Contracts for Data Practitioners, where the onus isn’t on the producer, but on the Analytics Engineer to ensure that the data is good, as a process between staging and production. I agree that this should be done, and this is where we also need to introduce true unit testing into our use of dbt to ensure that AE work doesn’t also break the contract between AE and the data consumer. However, I think not having a contract between producer and AE, in this context, is asking for problems. Whilst Benn’s point about making this a people process is also valid:
A smart person once told me that the most foolish thing you could do is turn a technology problem into a people problem. For all their faults, they said, computers aren’t fickle or unpredictable. No matter how painful it is to reconcile mismatched code in a computer or messy data in a database, neither are nearly as hard as getting ten people to agree on anything.7Â
I don’t think this has to be the way. Yes, in my previous post I described a multiplayer game where producer, consumer and other parties agree some kind of contract (as did Stephen). However, I don’t believe this ideal state is necessary for every piece of data or in every situation.
In fact, just having the data producer say that they will produce data to a certain standard and commit to it, maintaining it for a defined time ahead of changes, is much better than what we have today. At least the data consumer would know what they should have and could rely on it. If it didn’t enable every use case, then that’s a problem for the Product Managers to prioritise. I found that, in a previous role, before we had state change data available to us for an entity in our data, none of our consumers asked to understand the state change, as they knew it wasn’t available. Once the business consumers decided they needed it, the work got prioritised by Product Managers and then we had the data to enable this use case.
Data producers must be able to test whether their changes in development would break a Data Contract. Otherwise, there is no way they will be able to guarantee that they can meet the contract.
Jake Thomas, core maintainer of Buz, has written a great piece on the ideal way things should work with contracts in a data platform - which is essentially a PRD for Buz.
There are many other posts on Data Contracts, good ones, ones I haven’t read, ones I haven’t seen or heard of. These are the handful, in addition to others I’ve mentioned in previous posts, that I particularly wanted to share.
I’ve said my piece on Data Contracts for now. If I write about this topic again, it will be in the context of new ideas, solutions or features. 🚀
If you’re in New Orleans for Coalesce next week, and want to catch up, please reach out on Twitter, LinkedIn or dbt/LO slack! 🛫