Once more unto the breach...

...my AI friend

Mar 01, 2023

Explosive ordnance disposal EOD technicians assigned to Explosive Ordnance Disposal Mobile Unit EODMU 5 perform a free fall jump from a C-130...

I remember being a young analyst, spending most of my time either dealing with bad data or performing simple data requests. One week, I managed to crank out 77 of these requests, which was a team record. It made me realise, even then pre-Tableau, pre-dbt, that this kind of work could be automated… It was deterministic in nature. I would:

Understand the request, clarifying if necessary
Write a SQL query to generate the right dataset
Share the dataset and pivot/graph the data in Excel, if needed
Email over the report with a summary
If I was unsure that the report met the need, I would then check with the user that it met their need

Doing this, week in and week out, I realised that it could be possible to automate these simple requests. The users could have an Excel front sheet (user interface), where they could choose:

The things they wanted to measure
The dimensions they wanted to split these by
Any filters, such as date range or sector, to limit the scope of the data

These inputs could be used to generate a templated SQL query. They would be sufficient to define field selection, aggregates, table selection/joins and group by.

Yes, I’m describing building a self-serve BI tool using Excel and VBA. It’s not something I would recommend today, given the current level of choice.

A short interlude on what was holding us back - in short, data engineering

Most of the complexity preventing self-serve, as described above, lay in dealing with our brittle data pipelines. Our transformation processes weren't much more resilient - often requiring multiple rerun and debug cycles. The tools we had for ELT were poor and ill-suited to the task.

The rest of the complexity lay in knowing how this data mapped to the business. This was in my head at the time, I was the semantic layer.

Even though I believed self-serve was possible at the time, we had to fill in some missing pieces to achieve it:

A way to get data reliably from production databases and 3rd party systems (Fivetran/Portable/Striim/Streamkap/Rivery etc) EL
A way to transform data from sources into a useful data model (dbt) T
A semantic layer to abstract the data model from data users, and to define business concepts in data once, consistently, with easy, safe access via APIs. (dbt[Transform])
Query engines that dealt with moderate sizes of data well (Snowflake/Bigquery/Redshift/DuckDB/Clickhouse)

Loosely, you could describe this as a lack of maturity in data and analytics engineering. All of this tooling now exists, as shown in the brackets above, and integrates well with each other. As I have described before, there is still a desire for bundling to reduce stack-size.

While the EL engineering work is abstracted away from data teams now, as is most of the management of query engines, the work for analytics engineering to transform data and define semantics is code-based and non-trivial. There are some out there trying to streamline this work, by providing no code/low code ways of doing analytics engineering and others providing packages to standardise analytics engineering from known sources.

black sailing boat digital wallpaper — Photo by Johannes Plenio on Unsplash

Dreams and reality

One of the things I dreamt of when I had thought of automating the simple data requests above, was if a machine could understand natural language well enough so that humans didn’t have to learn how to use a tool. This was really only a bit of a day-dream - I couldn’t really conceive of a machine being able to deal with natural language, at this time, in the way that is now taking the world by storm.

The Analytics Engineering Roundup

Analytics Intelligence Everywhere

Pop quiz! Did you know that the European Southern Observatory took the first ever picture of an exoplanet (aka a planet outside our solar system) in the year 2004? Do you think an incorrect answer to (1) is worth 100 billion dollars? If by any chance, you are Google’s much anticipated new chatbot Bard, then no - you did not know this, falsely claiming in a…

2 years ago · 10 likes · 1 comment · Jason Ganz

As my wife is a linguistics graduate, and has spoken to me at length about some of the different topics she studied, I thought a more deterministic way to derive meaning from language would be possible, rather than the probabilistic way that is popular today. I thought, given the restrictions languages have in order to function, that it would be possible to write rule-based software to understand language. However, in hindsight, I realise that even for people, there is always a probability of misunderstanding in language, and nuance of interpretation (the legal profession exists in part because of this). Maybe a probabilistic way being the winner is inevitable… certainly looks that way in hindsight.

All self-serve BI tools work in a similar way to pivot tables, whether the user is dragging and dropping fields (Looker/Lightdash) or typing out a sentence in a special syntax, which has to contain known fields (Thoughtspot/Veezoo/Metabase). They all also require the user to have some knowledge of which explore, workbook or model to be using, in order to find the metrics they want. They can’t just express themselves as they would naturally to ask a question of the data, and they have to know where to ask. They don’t really understand language, they provide a syntax which allows a machine to interpret the request in a deterministic way.

Being able to use a pivot table is actually quite a high level of assumed skill. When I learnt how to use Excel as a graduate, using pivot tables was considered an advanced skill. Many who work in business would struggle to do this and therefore struggle to confidently use BI tools that are rooted in them. This is why analysts are still struggling with many requests to make dashboards and pull data, even with self-serve BI tools in place.

Self-serve BI tools today actually suit analysts more than the average business user, as analysts understand how to pivot data - using a tool like this is a no-code version of their normal workflow. They are a great way for analysts to quickly create content for less tech savvy users and for certain groups of stakeholders to self-serve.

Enter the LLM

There are huge drivers from both data teams and stakeholders for more automation. Analytical teams who are drowning in a sea of requests never get to show real value at the strategic or tactical level. Their stakeholders wait days and sometimes weeks for small requests, due to lack of resource. In this era of tighter budgets, data teams are stuck between a rock and a hard place. They need to deliver value for the business, but they need to deliver the basics first.

benn.substack

Do data teams have product-market fit?

When a startup launches a new product, the second worst thing that can happen is that nobody buys it. The worst thing that can happen is that five people do. Five customers could be the beginning of something big. Five customers could be a sign of much more to come. Five customers could be, in the eyes of a foun…

2 years ago · 29 likes · 17 comments · Benn Stancil

I've already heard of whole data teams laid off - if they had product-market fit and were already delivering ROI, this may not have happened. Automation is possible - and needed - to deliver the basics, to allow data teams to focus on driving value.

Human data teams are not best suited to dealing with high concurrency request flows. These kind of workflows have only previously been solved, with humans, in call centres and Uberisation. Both of these examples have had significant technology investment to make the workflow efficient and to distribute it. Data has never had this investment before.

The approaches above were the only approaches to self-serve until the post GPT-3 era we are now in. I’ve written about approaches to use LLMs to convert data requests in natural language to SQL, loosely being classified as text-to-sql. There are a large number of companies entering this space.

However, there are problems with text-to-sql… it doesn’t always give the right answer. LLMs have been known to “hallucinate” and, as I found in my previous post, they will come up with an answer, whether correct or not - no answer is better than a wrong answer in analytics. Trust is King. Where the data model is more complex and less clean, this is even more likely to happen. If a self-serve analytics tool is only right 70% of the time, I don’t believe users will be confident enough to trust it. They will simply pester analysts to check the output each time it is used. The analysts then have to debug possibly strange machine-generated SQL.

What would be even worse is if text-to-sql is right enough of the time to build trust, but interlaces some false but plausible answers between good ones. It’s possible that data users could make the wrong decisions based on this! Essentially, what is happening with text-to-sql is that the LLM is generating a minimal semantic layer automatically upon each request, and so it can vary between each request! 😨

Data models are created by humans (at least for now). How they fit together and what they mean is also defined by humans. When humans generate transformation code and semantic layers, they are incorporating this knowledge. If we rely on LLMs to infer what this knowledge is, we are having to introduce uncertainty into metadata where we inherently can have certainty. If you want to argue that in many circumstances this knowledge is lost or siloed, I doubt your data quality and model, in this instance, is good enough to get good results from text-to-sql anyhow: catch 22.

Is there another way?

Michael Irvine recently shared a Loom of Delphi, which is a text-to-semantic layer request prototype tool, as recently featured in the Analytics Engineering Roundup:

Self serve for real this time - stable interfaces and dynamic engagement
It’s not really theoretical at this point how all of this plays out for data teams - Michael Irvine and Delphi have shown us what this looks like. Because data practitioners have a new tool in their toolbox - one that fits in with dynamic chat perfectly and whose strengths are tailor made to help compensate for areas where LLMs and chatbots fall short.
An interface that allows for the creation and curation of business metrics in a way that enables you to query them and trust that you’ll get the same answers every time - no hallucinations. That allows your data team to build a robust and stable interface for your data model on top of a mature transformation framework. The dbt Semantic Layer.

The Analytics Engineering Roundup

Analytics Intelligence Everywhere

2 years ago · 10 likes · 1 comment · Jason Ganz

I was one of the first people to try out Delphi. I thought that what Michael had built had so much potential - the guardrails and rigour of a semantic layer with the accessibility of text-to-sql. I believe it could be part of how the analytics workflow is transformed for both data stakeholders and data teams. So much do I believe this that Michael and I, as cofounders, are going forward to take Delphi to every data team we can speak to!

Wish us luck on this journey!

timelapse photo of road during nighttime — Photo by Ricardo Rocha on Unsplash

While this will be a huge part of my life going forward, I don’t want this Substack to become a “buy Delphi” blog. So, we’ve set up a new blog to talk all things Delphi - follow along if you want to see what we’re building!

Delphi’s Substack

Coming soon

This is Delphi’s Substack. Follow along to learn more about Delphi, we’ll have: Data philosophical thought leadership posts 😅 Product releases and rationale Backgrounds and thoughts from David and Michael Customer case studies…

2 years ago · Delphi

To the Metaplane crew, you have been awesome and I hope to remain friends for life! Metaplane is going from strength to strength and will continue to do so, as I transition from being full time staff to an advisor.

Kevin and I will continue to co-host our webinars on LinkedIn, including this week’s exciting episode on Data Contracts, with Ananth Packkildurai!

davidj.substack

Discussion about this post