Discover more from davidj.substack
Five Nines, Three Nines, No Nines
Data needs service delivery parameters
Often, Tech service providers guarantee a number of nines, meaning a guarantee of 99.999% uptime if you offer five nines. This is an example of a Service Level Indicator (SLI), which is a key metric determining when a Service Level Objective (SLO) is being met.
It's quite rare for any such guarantees to be given to and from data teams. They often just have to cope with outages and changes from upstream teams or vendors. Equally, they often provide no guarantee to stakeholders or services, depending on the data they produce.
These guarantees of service to other parties are often called Service Level Agreements (SLAs):
A service-level agreement (SLA) is a commitment upon between a service provider and a client. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user.
With the SLAs described above, someone is responsible for the service, someone is accountable for it, others are consulted about changes and many are informed about changes and outages. The process around creating the SLA defines what to do in the event things change or go wrong. These four elements mentioned above also form the RACI framework.
A (data) team should publish an SLA congruent with the urgency and importance of tests that fail, and have ways to understand how frequently that SLA has been achieved.
In data, we have already talked about SLAs within the context of data contracts, as described here by Andrew Jones and data mesh, as originally described by Zhamak Dehgani, defined some principles around responsibility and accountability.
Data mesh defines who is responsible and accountable for services which provide data and systems which process data - data owners. At a previous company I worked for, every product team had its own SWEs who were responsible for the data their services and apps emitted and for getting it to our data lake bucket. This was due to the way we were organised in squads and tribes. We had data mesh by default; showing how integral data mesh is for organisation and process. It's not really a technology framework, it's a people and process framework.
Data contracts define standards for what these data owners are specifically responsible for providing in terms of: timeliness, cadence, format, completeness, veracity, accessibility, definition. In the example above where I described the data mesh we had in place, we did not have data contracts in place.
We often collaborated with the product teams who were making or changing data, but they also made changes without consulting us at times. This could lead to breaking changes in our processes (a large part of why I put a data observability tool in our stack):
A field in a json (does this need capitals?) structure could be removed or changed without us knowing
A process which ran on a schedule might often fail and not emit the data we used (or needed?)
A categorical column could be grouped up or split down, further breaking historical trend data
Randy Au's https://twitter.com/randy_au?s=21&t=X4Au4AdHpkbxDKZzwjJ4hA post this week, covering the difficulty of data pipelines with covid reporting, is a great example of the difficulties of where different teams who touch data at various points of a DAG, but act independently, can cause issues.
Data being thought of as a service or a product, with associated SLA/SLO/SLI/RACI, using frameworks like Data Mesh and Data Contracts, is a way to ensure maximal data service. This is not to say things won't go wrong... they will, but it should be much less often with an intentional approach to service throughout the data life cycle. It should also avoid the needle in a haystack problem that many data practitioners are used to dealing with; if data owners look after the data they produce with monitoring and testing, as happens for a service, problems are much more easily pinpointed.
High risk changes are known and stakeholders to the right of the DAG can be informed ahead of time. Teams which are dependent on the associated data can be consulted as to whether or not the risks are acceptable.
An example could be a credit risk team who are dependent on a specific data pipeline for their risk score models. A team which is the data owner of the source data of that pipeline wants to make a change that could cause an outage for a few hours. This would prevent the risk score model from functioning during this time and stop the business from approving credit. This has an associated cost which can be brought into the discussion about the risks of the change. If the risk is too high, then extra care can be built into the deployment process for the change. This has extra cost too, and is brought into the balance for the decision.
Whatever is decided, any subsequent issues aren't unexpected or unknown, quick response is possible, having the right people available and ready to act is possible. The strength of the RACI model is evident.
If the worst happens and there is an incident, most of the understanding for the debrief is already in place. Who, what, when, where, how, why are already known and agreed before the incident. The risk appetite and decision to accept was also agreed beforehand. With data, there are permanent implications for any incident - you can't go back and change the past, if tracking is broken or data is lost. Depending on what data is affected, you may choose to have a lower risk acceptance level than you would from a SWE perspective.
The approach I have described above is mostly focused on mid to large size organisations who have multiple teams working together. Early startups can often work without this level of governance, as the whole company knows what's being worked on, what was released in the last few hours and exactly who is responsible and accountable for what. The cost of any issue is often lower and stakeholder tolerance is higher (they know they have bought a product that will change fast and occasionally have bugs). The frameworks become needed when organisational information osmosis takes too long due to the scale of the organisation and when the number of human interface points is high.