Compounding Quality

Quality != Observability

May 25, 2022

Barr posted an interesting article titled "Data Stewards Have The Worst Seat At The Table" on Linkedin.

Which member of the data team has the worst seat at the table?
5 years ago, Maxime Beauchemin wrote that this role was held by the data engineer.
Times have changed, and I can’t help but think today it might be the lovable, under-appreciated data steward who now shoulders this burden.
It’s hard to architect systems that can move and derive meaning from data at scale, but it might be even harder to ensure those systems are well-governed.
In his original article, Maxime wrote: “[data engineering] is something that people take for granted and bring their attention to when it’s broken or falling short on its promises.”
I think we can apply this to data stewardship now, too.

Stef of Avo.app followed up with a comment I agree with:

People should not be hired as data stewards (or data governors). It should not be a full time position. Instead, aim towards proactive data management. Specifically: 1. Creating good data should be a cross-functional collaboration. 2. Implementing analytics should be a core part of the product release cycle. Just like teams make unit tests a part of the release cycle to make sure code (read: product) works; analytics should be a part of the product release cycle to make sure product delivers value.

Tristan also released a great and relevant post on maintenance and associated pitfalls in data:

The Analytics Engineering Roundup

Avoiding Traps

This week Julia and I spoke to Justin Borgman, CEO @ Starburst. Starburst is responsible for Trino, a compute engine that lets you process queries using standard SQL across multiple different disparate datasets. It’s a fascinating model, very different from the load-all-data-into-a-single-place model so common today…

3 years ago · 13 likes · 3 comments · Tristan Handy

My take on this is that a big part of why data stewardship isn't much fun (regardless of who does it), is because we haven't adopted consistent approaches and tooling to ensure quality and monitoring. These have been present in SWE for over a decade, but are still nascent in Data, which is a less mature field. This also explains some of the maintenance burden related to the traps Tristan highlights.

Observability tools watch your system for data artifacts and tell you if activity and distribution has deviated from the expected. This is from a statistical viewpoint, and is mostly unsupervised, but can take user feedback. Examples of alerts can include:

Freshness monitoring (why has this table not updated for 2 days?)
Uniqueness monitoring (eg on suspected primary keys)
Distribution monitoring (it looks like we've got a new marketing channel or one has disappeared)
Data Structure monitoring (the structure of this JSON event has just had a field removed, type changed, nulled, added etc)

Observability tools suggest there could be a problem with Quality. Monte Carlo1 and Metaplane.dev are examples of these, which aim to span across the whole data observability area, illustrated in the diagram above.

Quality Assurance tools assert what "good" looks like and tests whether this is the case. There are still relatively few off-the-shelf quality assurance tools, especially when compared to the number of observability tools in the market. Example methods can include:

Managing the tracking workflow to ensure that complete and correct data is captured. eg Avo.app, Iteratively (acquired by Amplitude)
Testing event data against a data dictionary/JSON schema to validate that it is complete and of the right format and structure. This should deal with versioning of data too - data should be tested against the version of schema that relates to it. Usually accomplished with open-source or custom in-house tooling, Segment protocols is one tool that currently exists. Event versions should have major and minor increments, depending on the changes made:
- Added a field then 0.0.1 to 0.0.2.
- Made a field boolean that used to be string then 0.0.2 to 0.1.0.
- Changed the structure of the event, including putting an array in the middle of the structure! Then 0.1.0 to 1.00!
Making assertions (eg nulls, uniqueness, referential integrity) about tabular data and testing these are true at the appropriate time eg dbt tests, Great Expectations
Testing dev/test environment data against prod to ensure that merges to a codebase don't damage data quality. A typical example of this would be where an incremental dbt model was changed in dev, the changes required the model to have a full-refresh: the historical data before the point of the merge shouldn't change and needs to be tested. This is a pain at this point in time! I know most Analytics Engineers basically eye-ball their pre and post merge data, cloning the old data so they can go back if needed. Currently, no vendors as far as I know, only partial homebrew solutions exist

I honestly don't know of any other vendors in the data quality space beyond those mentioned above - it's truly underserved. Many vendors are saying that they are a data quality solution when they are actually a data observability solution. Maybe this is because it's such a difficult problem to solve.

When quality goes wrong

Where quality issues persist further to the right in the diagram above, the org suffers more broadly. If it persists to the decision stage, the whole company suffers, as stakeholders have to troubleshoot their own data or, worse yet, make the wrong decisions. ML models with data quality issues can make the wrong decisions at scale. Both of these types of suboptimal decisions from poor data quality can actually hurt your revenue... pounds and pence, dollars and cents... lost due to poor data quality. Loss of trust, as Kevin of Metaplane.dev describes, is another cost:

Trust is easy to lose and hard to regain, especially in the world of data. Data teams are held responsible for maintaining a continuously expanding surface area of tables and dashboards and syncs. Unfortunately, executives only care about one dashboard: the one that they are currently looking at.

painting of man — Photo by Aarón Blanco Tejedor on Unsplash

At the metric and feature part of the diagram, you get the needle in the haystack problem, leading to data downtime, as you figure out what happened and where. The issue could be with tracking or the application itself, it could also be a transformation or ELT problem... I've personally lost many a morning or afternoon dealing with these incidents. This is how I came to bring Monte Carlo into Lyst, as I needed to know about data incidents before my stakeholders told me about them. Having a data observability tool like this was transformational; we were able to be proactive about incidents and better zero in on any problem.

However, what we managed to do is understand where quality issues had occurred and more quickly. If there were issues in the tracking data and the data warehouse, then there was a problem with the application or tracking, not with our transformations. If it was just the data warehouse, then it would be a problem with our transformations. While this encouraged quality, it didn't provide a way to assure it. Assertion-based testing, much like unit/integration/e2e testing in SWE, is the only way to assure quality in data.

Where quality issues are discovered at the tracking or EL(T) stages of the diagram, they are the most easily pinpointed and resolved. ELT jobs can be re-run or reconfigured and changes to tracking or applications can be reverted or fixed. Testing for tracking using assertion, during development and the product management life cycle further prevent tracking quality issues. As product managers make changes which affect tracking, the structure and allowed schema of the events can be defined up front. Events created can be tested against this schema in development, testing and production. Where the issues are discovered and resolved in development and testing, data with quality issues never enters production data stores! This is quality assurance at its best in data; at the very left of the DAG.

Two ends of the spectrum

Good tracking from automated data quality assurance leads to faster analytics, leading to faster and better decisions, which lead to changes, with more good tracking... you end up being constrained by how fast dev teams can iterate, because the impact of the change is clearly seen very soon after it is implemented. Progress is as fast as possible within engineering constraints, capital efficiency is maximised, competitiveness is as strong as it can be. This is a fighting chance for success; the best companies in the space are operating this way.

Bad tracking leads to slow analytics, leading to poorer and fewer decisions being made (not making a decision is a decision) which leads to further changes with bad tracking, if any at all... The value of the tracking diminishes, as it's no longer seen as helpful. Decisions are made on shaky data or none at all, so they are made more slowly and with possibly worse outcomes, leading to a situation where opinions will easily trump poor data. Changes made cause no benefit, or actually detriment, and then deep debriefs are needed to decide what to do next or whether the changes should be scrapped altogether. Progress is slow, capital efficiency is low, competitiveness in the market is weakened. This is a recipe for death, especially in the current climate.

Few companies live at either end, but fluctuate along the spectrum over time, based on many different factors.

woman walking in hallway — Photo by Kevin O'Connor on Unsplash

The impact of Data Quality compounds over time, both negatively and positively, and this is reflected in the value of the company.

2When software company Unity Technologies held its quarterly earnings call. Unity announced that it was slashing its yearly revenue target by $110 million, causing its stock to tumble by 37% that day. The company attributed the loss to bad customer data, which led to incorrectly targeted advertisements.

Hours after I finished this post MC became a unicorn! https://www.montecarlodata.com/blog-monte-carlo-raises-135m-series-d-to-accelerate-the-rapid-growth-of-the-data-observability-category/

https://www.forbes.com/sites/kenrickcai/2022/05/23/monte-carlo-series-d-unicorn-fourth-funding-round-in-two-years/?sh=17a6e8cf18fe

Tristan Handy

May 26, 2022

Love the emphasis on compounding at the end. All of this stuff can be a virtuous cycle or a doom loop. Sometimes it's hard to even realize how much better things could be when you're in a bad state because the whole industry is so new that most people don't have the experience of "good" in the first place.

Expand full comment

Tom Baeyens

Jul 8, 2022

Totally agree that observing != data quality assurance and that data quality is underrated.

Observability is great for getting coverage quickly as it is unsupervised. But it can only learn what normal data looks like. The risk is alert fatigue as unsupervised tends to find too much false positives or too much time spent on analyzing unimportant issues.

In software engineering it's common knowledge that test suites also take time to build and maintain. So I think you make a great point that data quality assurance also requires time. Data producers and consumers should participate in building up and maintain the picture of what good data looks.

At Soda, we believe that both approaches need to be combined: observability to get coverage quickly, extended with data quality. To make it easier to express what good data looks like, we created a dedicated language SodaCL https://docs.soda.io/soda-cl/soda-cl-overview.html

1 more comment...

davidj.substack

Discussion about this post