Why you need both Data Contracts and Data Observability.
As I have covered before in my series on Data Contracts, I believe they should be adopted by any company who wants to “win with data”. The companies that have already succeeded based on data have done this.
Data Contracts can exist and be enforced in development (SDK/IDE extension), CI/CD and in-stream checking (Buz.dev). Are Data Contracts enough on their own, to ensure data quality? In short, no.
Why not?
Data Contracts, and data testing, ensure specific things about your data are true with regards to completeness, structure, types and SLAs. However, there are certain things which will meet all of these criteria but can still cause silent data issues. For example, if your marketing channel became set to organic 100% of the time, this won’t necessarily break a Data Contract or test. The event structure remained good, fields were complete and present, types were good and the event was delivered in a timely manner. The data contained was just wrong.
I believe it would be a mistake to put thresholds and statistics into Data Contracts, like expected splits of fields (33% organic, 33% paid, 33% CRM in marketing channel, for example):
Firstly, because the contract would then have to be enforced in batch and not at a single request/event level. This would make enforcing the contract very difficult and complex and it’s already hard enough.
Secondly, because natural data drift will occur which isn’t wrong, but will mean the contract will be regularly breached… resulting in alert fatigue and lack of trust in the contract.
Finally, because it’s just not explicit enough. A software engineer should clearly know from the contract what should be present and what values/types it can take. It’s pretty much impossible for them to know in their dev environment if it can’t be known at a single event level; I believe it’s crucial for SWEs to know if they’ve breached the contract from their dev environments.
Implementing Data Observability allows Data Contracts and tests to remain clear and concise. Data Contracts tooling and things like dbt tests should be used to assert known boolean truths about the data and schema. Data Observability shouldn’t be used to test these assertions - the algorithm used may learn from the past that what should be the case, isn’t always so, and then accept a certain level of deviance.
Data folks, we do need to invest in our test suites and schema enforcement and maintain them. Data Observability shouldn’t ever be expected to do this for you, but it will find silent data issues for you, where the outcome is not boolean but scalar in nature. For example:
An event or table which has a foreign key in it that is null 8 to 12% of the time, but the percentage rises at the weekend and leading up to Christmas. 15% is out of the confidence interval on a Tuesday in June, but not in December.
Where you received 500 events today instead of the usual 820 to 880.
Where the marketing channel split changed from the usual 20 to 30% organic to 60% (usually this means tracking of another channel, like paid, has failed, as organic is the default channel).
These issues will be caught by Data Observability if they are out of expected normal ranges, but they won’t be by testing and schema enforcement. You need both approaches to have a hope of high Data Quality.
This is why I believe Data Observability is necessary in every stack - you might think that’s a bit biased, given I work at Metaplane, but I joined Metaplane, in part, because I already believed this from my years as a Data Practitioner.
I have approached this topic before, when I wrote this post:
However, I’ve recently been making a greater effort to speak to more data leaders from all walks, backgrounds and geographies; there is confusion about what the approach should be. To tear another page out of the SWE’s guide to the galaxy, we need to do testing, schema enforcement and observability together, harmoniously.
Software has unit tests, integration tests, end-to-end tests, schema enforcement for APIs and observability tooling (Datadog, New Relic, Splunk… there are many). Software Engineering is a much more mature field than data - the approach that I’ve seen here is that test suites are built internally, as they are too important and context-specific to buy external tooling for. I could also go into test-driven development TDD, but that could be a whole book and there are many options out there. Observability tooling, like the ones I mentioned above, is SaaS or OSS and paid for (OSS ain’t free). I think Data Quality tooling will follow the same example here as it matures, with test suites being maintained internally and observability being paid for.
This is what makes Data Observability such an attractive segment for Venture Capital: it’s a tool needed by everyone, that is a bad idea to build internally. It is perfect to be delivered as SaaS. Observability in SWE is already a proven segment, with multiple unicorns and great exits into the public markets.
You might say that it’s too hard to do TDD for data, as it’s rare that you know exactly what the end goal is without building a fair bit first. I’d say that if we were to start smaller, at the unit test equivalent level for data, it would be possible. Imagine writing a query like this:
select
date_trunc(event_date, month) order_month,
marketing_channel,
count(distinct order_id) as orders,
sum(gross_merchandise_value) as revenue
from revenue.all_orders
where 1=1
and test_order = false
group by 1, 2
order by 1, 2
Trailing commas are right, leading commas are wrong. Yeah, I said it. If you don’t like it, come to data-folks.masto.host and vocally disagree with me 😆
Before writing this query, you could assert that you expect order_month to be before or equal to the current month and after the company was founded (the number of orders I have found from the future or 1970…🤦♂️). You could also assert that you expect it never to be null.
You could then assert that you expect the marketing channel to never be null and always take values within [‘paid’,’organic’,’crm’] (this is contrived - I’m aware that we all have about 5000 marketing channels). Better yet, you could assert that the marketing channel can only have the values stored in some central store (perhaps a data catalog) that everything uses to find possible marketing channel values.
You could assert that orders is an integer greater than or equal to zero. You could also assert that orders isn’t some unreasonably large value, but what that might be isn’t actually straightforward to define. This is where I’d let an observability tool track the field values and automatically define what normal looks like.
You could assert that revenue isn’t negative (in theory it’s possible, depending on the treatment of refunds in your orders data model, but hopefully not), and that it is a number with two decimal places. You could try to define thresholds for avg, min, max, variance for revenue - but again, this is where observability shines over tests.
As you can see, observability and testing, including schema enforcement, go together hand in hand - each applied where the other is not.
I have a thread open called “Quality Duopoly” if you’d like to interact on this subject. Substack app is now available for Android too!