When I heard that Coalesce 2022 was going to be in person… I knew I had to be there. I also felt I should try to serve the community in the best way that I could. As I had recently started this substack and I had been on a panel at Coalesce 2021, it felt like giving a talk would be the logical next step. I’m an introverted person, so this idea is well out of my comfort zone, but perhaps good for me and certainly a way for me to give back to the data community from which I have so greatly benefited.
However, I had no idea what to talk about! As you can see from my substack, I don't have a problem wading into most data topics (except for the black box of ML, I'll leave that to Kevin Hu). So I reached out to Anna Filipova - Director of Community at dbt Labs - to ask what would be helpful for me to speak about.
Anna asked me if I could speak about this topic:
Something we're hearing from folks in the Community that can be a helpful prompt -- how do I know I'm doing data modeling or analytics engineering well?
My immediate thought was: you know you're doing it well when you're having the impact and feeling the benefits of the discipline. This is how I came to the title: "The Return on Analytics Engineering". Looking back at Anna's message 5/6 months down the line, I realise I glossed over the data modelling part. I think this is naturally because of how opinionated people are regarding data modelling - there is the Kimball Klan, the Inmon Inclusion and the Data Vault Diaspora. I truly don't belong to any of these - I like to sit on the fence. My answer is: "It depends". I'll let them duke it out over who's right and wrong, while I do what I think is best at the time.
I aimed to explain to orgs new to the discipline, and those who are a bit further down the line, what the impacts of different aspects of the discipline are and what value they generate for an org. It's clear why I would explain this to an org just starting out with Analytics Engineering or considering it, but many orgs with established AE disciplines may also lose sight of what they were trying to achieve in the first place.
Sometimes they may have even started out in Analytics Engineering in a scrappy way, just trying to deliver the basics, without considering some of the higher level objectives. It's also very easy to become a bit embattled when working in the discipline. Data teams, often being cost centres, are underfunded and AE is more often centralised away from profit centres. AEs can end up spending their time firefighting, despite the efficiencies gained from the discipline.
I grouped my talk into three sections: Collaboration, Quality and Value.
Collaboration
Readability and Navigation
Prior to Analytics Engineering, whether you use dbt or something else (I know there were some homegrown alternatives), the norm was to use stored procedures to build data models.
There was no inherent lineage or modularity to this way of working. It would involve managing long stored procedures which performed tasks in a serial way. New joiners would take a long time mapping the entities and tables needed to explain org processes and outcomes to different paths in these stored procedures. I think this is one of the key distinctions between the AE and BI dev disciplines. AEs build modular, easily readable, discoverable models with clear lineage.
Fundamentally, this allows new joiners to become productive more quickly and for everyone in the org to understand the provenance of their data more easily. In the BI dev era, stakeholders and data consumers didn't know anything about the provenance of their data - it would be in the black box of those stored procedures.
Extended and updated by others
Prior to AE, it would only be BI devs who would work on data models, or the data models would be in some other silo like Salesforce or SWE tooling. With AE in place, anyone can contribute. It enables data mesh, as SWEs and other teams can look after their data from source to presentation. This is something I've seen in the past, where teams which needed to operate independently could do everything they needed, but within a framework that others can also benefit from. This allows for single source of truth, as different teams can collaborate on the same data, rather than duplicating it.
Business Continuity
Everything that is in production is in the main branch in git.
Everything that is in flight is in a branch or is in infant state, such that starting it again is not much work.
Anyone can pick up the stack if they know the tooling and understand how it works.
Agile is not necessarily strictly implemented, but it requires some form of it or Kanban etc. There is a formal process of understanding what to work on. It can be overridden, but there is a process that minimises context switching.
I feel knowing what to work on, and being able to prioritise the work that is most needed, is part of business continuity. Otherwise important work to keep infrastructure and core models up can be deprioritised over new features. This is part of the SWE workflow that has rightly been inherited by AE.
What's gold and what's not
AE teams know what data is important to their org and it's a fraction of all the data out there. We're talking about the data that the company at large is run on - it's not experimental and it doesn't change that often. AEs are able to keep this data and its pipelines stable and reliable.
This allows the data team to go at two speeds - slow and process driven for gold data and fast and loose with experimental data. It's not appropriate to always be one way or the other.
Increased Development speed and output
It's quicker and easier to ship changes. Your changes feel safe because of your CI/CD processes and tests in place. You can do much more with fewer people and you ship in the confidence that you've tested changes to the appropriate level of rigour.
Quality
All-encompassing
All data modelling is done within the AE discipline whether it's done by the AEs or not, meaning there aren't siloed data models any more. BI tools don't do much other than present the data with some light casting or formatting.
You are able to have a single source of truth for data. I know that there are things standing in the way of this, like Amplitude/Heap/Mixpanel, where PMs make a schema out of vapour. However, generally speaking, you know how data should be modelled and what it means because it's done once and for all.
DRYness
This allows you to change business logic easily because it's defined once and centrally. If the same logic is defined in multiple places, say goodbye to any hope of data quality - there will always be a place you forget to look when changes are made. DRYness helps you have greater confidence in changes you make and fewer bugs as a result, leading to higher quality.
Tested and Observed
Tests are used to check for boolean outcomes, ie this column should be unique/not null, referential integrity etc. Observability is implemented to monitor scalar trends which fluctuate with seasonality, as well as longer term trends. Allowing for balance in your test suite, you can't test for every eventuality. Where it's certain what should happen - test, where it's not - observe with anomaly detection, so you don't get caught out.
Data Shame is that sinking feeling you get where your stakeholder tells you there is a problem with your data before you knew about it. These incidents should be rare, and much less often than before AE was implemented.
Incident Management
Even with everything above in place, things will still go wrong. You will have an incident management process that is calm and without blame, to deal with incidents and learn from them. While there isn't blame, there will be some pain because your team will be trusted by the org and therefore you have trust to erode, but this is a good motivator to incorporate learnings.
Gold depends on gold
The lineage and provenance of your gold data is well understood and governed and it never depends on some experimental source that marketing somewhat maintains. You can guarantee each step with SLAs and owners.
Your gold data is trusted, other data is not and especially when it disagrees with gold data.
Value
Safety
Stakeholders feel safe to use the data without needing to double check everything with the data team or, worse yet, other teams. They trust that they would be told if there were any incidents, perhaps with some kind of service update.
Decisions made
Where appropriate decisions should be made with data - not every decision should be a "data-driven" decision, especially when there isn't sufficient or relevant data. However, you can point to decisions that have been made with data.
You know which business processes are enabled by self-service data or data being operationalised from AE.
You won't have a team, or even team members, who are just data monkeys pulling data for others with no context.
Super Stakeholders
You have some adventurous and empowered stakeholders, enjoying the benefits of some of the advanced data models the AE team has delivered.
For stakeholders that have been around for a while, and longer than AE has been at the org, they will remember the mess beforehand.
Some of your stakeholders are so empowered that they think of novel advanced use cases to build, in partnership with AEs. BAU is rarely spoken about.
Proactivity
You're not running around stressed out and feeling undervalued. You can deliver the big OKRs and some of the rest. Your data models and services are a product - they aren't something hacked together.
As you are proactively watching the new org OKRs, you are anticipating your upstream needs and working with engineering teams to ensure data requirements are met, rather than retroactively fixing. This retroactive fixing doesn’t always happen when the workflow is reactive.
You have sufficient time to innovate and make your infrastructure better, so that you can then deliver more and have sufficient time to innovate and make your infrastructure better... and so on.
You can lose this state from time to time because life happens, but return to it. The implementation of the AE discipline has also baked in some resilience.
I have achieved most of these aspects at different times - rarely has it all been at the same time. We're all human and we have limited focus and time, but we have lofty goals.
Despite being nervous before the talk, I really enjoyed giving it and have been so grateful for all of the positive feedback I’ve had!
So far Coalesce 2022 has felt like I was wandering around a village, whose inhabitants were all data people. Everywhere you go, you get stopped by someone you know… there’s a word for this: community.