We don’t need data contracts

Feb 16, 2024

We need data to be part of product engineering

6 Comments

Feb 16, 2024

Really like this approach David. It takes a more holistic view of product development. Love the idea of PEDD teams (product mgmt, design, engineering, AND data) focused on achieving outcomes, especially with data intensive applications or components.

Expand full comment

Andrew Jones

Feb 18, 2024

I certainly agree with the subtitle: "We need data to be part of product engineering"!

If data applications are supporting key business process, driving ML models that power product features, then they should be built in the same way, and with the same discipline, that product engineering use for their services.

And of course, data should be owned by the team who produces it.

My goal with data contracts was always a way to facilitate a move to this model, without changing the organisation structure first.

That's why my book talks mostly about that, and much less on the technology.

Even with the perfect org structure, there is still a need for an interface to access the data.

Often that would be a table in a data warehouse, with historical data, because the people consuming this data are often using tools like dbt or SQL-based analytic tools like Looker.

And that's the interface that can be driven by a data contract.

Expand full comment

Ivo Albrecht

Feb 17, 2024

Great article! In this setup, would the data platform team own any of the (business logic) or does it only provide the tools, practices and infrastructure for the decentralised data people?

Expand full comment

Reply (1)

David Jayatillake

Feb 17, 2024

Thanks Ivo! I guess typical example could be logic to determine marketing channel - a staff analytics engineer in Data Platform could own it but so could an AE or analyst in marketing. I think the marketing data people should own it with review and guidance from data platform.

Expand full comment

Ergest Xheblati

Feb 16, 2024

I fully endorse this idea!

Expand full comment

Anthony

Mar 13, 2024

Completely agree. I am a dev in a product team and I try to foster a "semantic layer" around our data by listing all the core attributes we use - in code, databases, JSON payloads etc - in our (Confluence) "attribute index" wiki page for our sub organisation. Each attribute is section heading in the page which handily gives every attribute their own unique HTTP anchor link, which fits in well with the "linked data" / RDF / JSON-LD notions of how information really ought to be defined also. This way we can provide the metadata about this attribute - when it was introduced, example values, screen shots of UIs/letters where it is displayed (sometimes with no label), what it is used for etc. We can also reference these attribute definitions again and again in other wiki pages detailing specific contracts for specific APIs etc. Using "attribute first semantics" you can gather people's understanding of what these attributes mean in a context-free crowd-sourced wiki style way without a specific integration or use case in mind, with very little governance, or input from data specialists.

That said, it could certainly be easier to entrench and stick to this approach if we had data specialists in our product teams. Originally I had wanted our business analysts to own our attribute definitions, but they struggled to care sufficiently to really take ownership. With some help from data specialists it might have been seen as less of a burden "just helping out fussy devs" etc.

On the subject of hanging semantic meaning on attributes rather than containers, payloads, tables etc. If I see an attribute called `familyName` I can write down what that means in a wiki page, regardless of whether it is used in an email, a database table, code, an API, JSON, a technical design document etc. It's an attribute of a real person, it is the "family name" often following a first name in western cultures etc, often synonymous with "second name" or "surname". What we can't say is whether it is mandatory/optional or even what form it's values might take. That can be buttoned down more specifically in individual API contracts. Even so people using this attribute anywhere should stick to those context free semantics.. If two parts of your organisation both use `taxCode` in different ways, they could each have their own "attribute index" wiki pages, where `taxCode` is defined differently in each, and they could even cross reference each other to help avoid confusion.

To me the difference between defining attributes first and contracts/payloads later is the difference between data and information. The word "data" literally means (in Latin) "that which was given (when you asked)". The question you asked usually gives meaning to the answer. The word "information" literally means (in Latin/Greek?) "shaping inside (the mind)". This is where the attributes themselves teach you what they mean, so you can make sense of it wherever you find it, even if you had found that information without any specific question in mind, i.e. by browsing code, databases, logs, etc.

Back to the subject of your blog post - in my organisation we have data specialist teams and I could never figure out what they did. They don't seem interested in helping product teams define the data semantics better. Now I see that they are very busy tring to ingest the far from nutritious data sausages our entire organisation is feeding them, and that seems like a very difficult task.

Expand full comment

davidj.substack

We don’t need data contracts