Yet another post on Data Contracts - Part 2

Where and how

Oct 05, 2022

worm's eye-view photography of ceiling — Photo by Joshua Sortino on Unsplash

This is the second post on a series on Data Contracts - last week’s post:

Yet another post on Data Contracts - Part 1

The first time I heard the term ‘Data Contracts’ was when I read Andrew Jones’ post on it, relating to the work he was doing in Data Engineering at GoCardless. Andrew spoke at London Data Quality Meetup #2 and I had a chance to speak to him after the event. As I mentioned in my last…

3 years ago · 4 likes · 2 comments · David Jayatillake

As mentioned at the end of the last post, in the next posts in the series I will go through:

What kind of data can have Data Contracts applied
Emerging tools in the space
A roundup of other posts and articles written about Data Contracts and how they align with my thinking, and whether they change what I think

What kind of data can have Data Contracts?

Andrew’s post was focused initially on data coming from CDC (Change Data Capture) streams on OLTP databases, which is structured tabular data (with document stores like Mongo as exceptions to this). So, clearly, it’s possible to have Data Contracts on structured data. In fact, as its structure is known from its nature, that part of the Data Contract is defined by default (from Format and Structure section in my previous post) as we know there will be a list of fields which must be included, with no nesting.

I would argue that extending Data Contracts on CDPs and event tracking systems (eg Rudderstack, Snowplow, Segment) is a logical next step. Most of these systems output events in JSON format, regardless of how the data is stored at rest.

David Jayatillake @DSJayatillake

I would extend the list ⁦@frasergeorgew⁩ shared with: Custom event system, schema controlled by product engineers. This is where a contract is even more important as the schema is less constrained. ⁦@jthandy⁩

open.substack.comContracts have Consequences.Do you accept them?

George Fraser @frasergeorgew

@DSJayatillake @jthandy Yeah file systems also look like that. The next row could be anything.

These event systems also have even greater flexibility than most OLTP databases for Application Engineers to be able to change the data schema - it’s easy to add and omit fields or change the data structure entirely without breaking the application. However, with this greater flexibility comes a greater likelihood that changes will break data pipelines downstream. Semi-structured data like this is often key telemetry on an app or website and is in just as much - or more - need of Data Contracts.

The most common forms of unstructured data produced today are text, images, audio and video. I think Data Contracts could be applied here, too. If you think of an image as a rectangular layer of pixels (a video being a cuboid tub of pixels, or an array of images), a Data Contract could determine:

The dimensions of the layer (resolution)
The detail of colour representation (eg 8-bit or 16-bit RGB)
Metadata fields (eg location, device, timestamp)
Standard (eg PNG, JPEG)

SaaS and OSS Tools in the space

I have previously mentioned that tooling is relatively thin on the ground in terms of data quality. This is still true, but we’re now seeing new entrants into the space, with features related to Data Contracts.

SaaS Tooling

Avo

Avo is a data governance platform that error-proofs your analytics events and speeds up implementation.

Property comment — https://www.avo.app/docs/data-design/day-to-day-workflow

Avo provides a collaborative way to go through the life cycle of telemetry change in the context of feature changes, usually involving Product Manager as protagonist, SWEs and Data folks. This allows for a balanced discussion, perhaps like this script (maybe we could turn this into some kind of live performance art at Coalesce 🤼):

[In the context of a new feature in the user interface where a user can click on a product to see more info.]

Analytics Engineer: “I want the product blob in the event to contain: product_id, product_slug, price, stock_status, shipping_cost, promo_code”. They would ask this knowing that these are the desired dimensions for someone wanting to use the data for reporting.

SWE: “product_id, product_slug and price are fine but size data isn’t available until the user clicks on the size drop-down and promo code isn’t available in the context, as the product API just provides the current price. We could choose to load the size data early, but then we may make an unnecessary call to the product API - increasing load. To find the promo code, we’d have to call the promo API, which is entirely additional work, that isn’t required to deliver the feature.”

Product Manager: “I know that not having the sizes on the initial product click event will mean we’ll have to do some additional AE work to bring these details in from later events, but we have to be mindful of load on the product API. However, we need to know if a product is on promo to understand its effect on performance, so we’ll have to add the call to the promo API.”

SWE: “We have a record of product price state changes in the data lake - can’t we use this to determine if a product was on promo at the time the user clicked on it?”

AE: “We can, but this is actually difficult and expensive to do after the fact - it’s much better to add the data at the time. We could spend a lot of money on Snowflake credits if we have to join a state change table to the user events.”

PM: “I don’t want to inflate our Snowflake costs unnecessarily. We already spent $4k on one query last week. Let’s do the work to make sure we have the promo data at the time of the event, this way we’ll be able to split by this in Amplitude too. However, stock status will be known on the next event if and when the user clicks the size drop down."

This discussion, which includes some negotiation, is essentially how the Data Contract is derived. From the thread, you know that the product click event must have user_id, product_id, product_slug, price, promo_code. The type constraints can be defined between the AE and SWE, without the PM having to be involved and this then becomes the technical form of the Data Contract. From the context of the thread, a lot of the SLA and RACI context is known by default, given who is included in the conversation and who is doing what.

Avo then has an Inspector tool, to test whether events generated in dev, test and prod environments meet the Contract. Personally, I really like the approach Avo takes and it’s as close to a Data Contracts-aaS platform as exists. However, it does take additional instrumentation from engineering teams. Even at Metaplane, where I have considered implementing it, I have had questions from engineers, wondering if this is overkill for our stage of org etc… which I’ve accepted.

Segment Protocols

At Metaplane, we use Segment as our CDP (it’s a real CDP, not a reverse-ETL tool with a wig on 🙊). Segment has a relatively new feature called Protocols, where you can provide a tracking plan for your events. Segment can then inspect for compliance, ie testing the Contract. Here, there isn’t the collaborative workflow Avo provides - the Contract is agreed elsewhere and then put into the tracking plan. It’s more likely that there won’t be consensus on the Data Contract put in place through the tracking plan, than if Avo had been used. However, it has the excellent benefit of not needing extra implementation in applications.

We have opted for this solution at Metaplane - I’m yet to use it, as I’m working with our engineers to decide which of many events we need to have in our tracking plan. I’ll share more on this once we’ve had a go at implementation. Segment is known to be on the more expensive end of CDP/Event Tracking systems and using Protocols on top of this is additional lock-in. However, using Segment is heavily ingrained into our processes and we don’t have such a large event volume that the extra expense is something that concerns us for now or the foreseeable future 🤷.

I’m looking forward to trying out Reflekt on the tracking plan, to generate dbt models directly from it. As I mentioned in my previous post, having a Data Contract should make analytics engineering work orders of magnitude quicker and easier… perhaps even partially automatable.

Iteratively

Iteratively is now owned by Amplitude and is the closest alternative to Avo. It looks to have the ability to allow developers using it to see if they’ve broken the Data Contract in their IDEs during development! This is the right place to start validating a Data Contract: before data is even created! However, it’s not clear to me exactly how this works and the docs have been absorbed (hidden) into Amplitude’s.

I imagine, with time, it will integrate more closely with Amplitude and become a source of lock-in. If you’re already all-in with Amplitude, it probably makes sense over Avo, but if you’re not, I’d go for Avo.

Next and Final Part

I’ll look at:

OSS Tooling
A roundup of other posts and articles written about Data Contracts

davidj.substack

Discussion about this post