2024 - The year of the carrot
Can the promise of AI provide the return required for critical data infrastructure investment?
In the past, when we’ve tried to justify expense on data infrastructure work, there was no juice for our stakeholder teams.
Data teams didn’t always quantify what the benefits were for them. This is partly because it was difficult to do so. It is hard to say: "We’ll be able to deliver 28 sprints of work in 24 sprints’ worth of time, for sure." Even if they did, a stakeholder who needed something now would choose immediacy over the long-term benefits. Chances are, they won’t even be here by the time they see the benefits of better infrastructure. A small carrot offered at the end of a rainbow.
Data leaders with an engineering background may also struggle to form a business case. Doing this is not something they've spent their careers or education doing.
Instead, data teams would warn them of impending doom… the data stack will fall over! The proverbial stick. "You’ll have to go days without your reports, as we have to fix or handhold the data pipelines!" They were thinking: "I have to wait a couple of days for you to turn it around anyhow… I may not end up waiting any longer in practice." They don’t know what this proverbial stick means in reality, they don’t know how bad bad can be. Everyone says data infrastructure is bad all the time at every single company they’ve ever worked at. So, if it’s always bad - why care?
The promise of having safe, quick and easy access to data with AI could be the carrot we were waiting for. We also don’t have to boil the sea. Imagine if Product wants to be able to access a set of key metrics based on telemetry data, using AI. We can lay out a project plan to deliver the pipelines and build the data model needed. It doesn't need a giant data lake project, where we collect all the data first, without knowing what it's for 🤦. We can have a very narrow focus, deliver value as soon as possible, then move on.
If stakeholders keep trying to tack things on in the future, we have a real stick - they won’t be able to get good answers with AI any more. We need to be able to make the investment in the pipelines and data model to do changes properly. We can’t just get Teddy from Finance to do a CSV import from his spreadsheet into the DWH, cleaning the data by eye beforehand, changing the columns each time and expect the pipelines to stay up and feed the data models well.
Data model and pipeline performance meant very little to stakeholders before. Now they mean: "Your AI will be dumb and unable to answer your questions well or quickly."
The problem before was that the analysts who used to stand in for the AI were incredibly adaptable primates who, over time, could just about manage with the lack of data model and poor pipelines. They could churn out dashboards under difficult circumstances. However, they needed to eat, they needed to sleep, they needed “free time” away from work, they took vacations at inopportune moments, they decided to leave for better pay, the wrong ones were laid off by accident, they didn’t respond quickly enough to our Slack messages, they made people fill Jira tickets in (oh, who made this awful Jira thing!).
Getting greater performance from these analysts through investment in infrastructure wasn't clear, though. They still got the job done, perhaps a bit slower than otherwise, but it still worked. Justifying a multi-month infrastructure project didn't make sense.
Now, though, the AI just stops working… you'll have to wait for an analyst! Yes, that means days and not sub-minute responses like you're used to.
This data model thing is beginning to sound important… maybe having a business ontology wasn’t pie in the sky. Oh we can’t get Teddy to be a human data
pedepipeline any more, we need this thing to be robust, to be on time, to be real-time (I think, although I’m not exactly sure what streaming is… I thought that was to do with Netflix), we need testing. I can’t go back to making Jira tickets or bothering the overly amenable analyst in Product, and waiting days for my data. It needs to be like ordering an Uber, I don’t want to have to be nice to anyone to get my data, I just want it to work, and quickly, and cheaply, and I don’t want to be grateful for it. Well if that requires investment in engineering, we can afford more nerds on the 2nd floor.
When Looker promised to “end the data breadlines”, this was on the basis that our semantic layers would resemble “mind palaces”, and not the Burrow. However, the reality is that they’ve ended up a mess of extension upon extension, duplication upon duplication - data teams aren’t allowed the time to maintain them, new features are prioritised above all else, with no time to consider if refactoring is needed. Let’s get remodelling, there are incentives now. When a stakeholder asks for AI on data, they’re also asking for a clean, well-documented semantic layer and data model - they just don’t know it.