We told you so

LLM x Semantic Layer is a marriage made of data

Dec 01, 2023

23 Best Robot Movies | POPSUGAR Entertainment UK

I’ve written about the semantic layer… a lot. I think only people who build and sell them write about them more. I’ve been saying for some time that, for true value in data to be had with LLMs, we need to use semantic layers with them. I am also just one voice of many:

For bots to be successful query writers—and even harder, for them to be proper analysts that can answer questions about a business—LLMs will likely only be a small part of the solution. There will also have to be semantic models, methods for mapping vague requests onto those semantic models, frameworks for governing access control, ways to test if it said the same answer today as it said yesterday, and more.1 -
Benn Stancil

LLMs and Generative AI systems are going to need a standard interface to access our organizational data.2 - Jason Ganz

We can’t feed LLM with database schema and expect it to generate the correct SQL. To operate correctly and execute trustworthy actions, it needs to have enough context and semantics about the data it consumes; it must understand the metrics, dimensions, entities, and relational aspects of the data by which it's powered. Basically—LLM needs a semantic layer.3 -
Artyom Keydunov

These are just a few but there are more, that’s why it’s “we” told you so.

The two driving themes for why we need to use semantic layers with LLMs are context and constraint. We see more and more teams moving towards providing context through prompt engineering and doing things like providing the LLM with a knowledge graph to interface with. This is definitely a step in the right direction, but you also need to constrain their output.

Semantic layers provide both a knowledge graph and a constrained interface for an LLM. As a special type of knowledge graph, a semantic layer provides the LLM with a model of a world, composed of entities and their measures and dimensions. LLMs have been trained on language, and all language is composed of entities (nouns) and their dimensions and attributes (adjectives). Language also contains mathematical terms such as ‘total’, ‘running total’, ‘average’… as a result, LLMs are well-suited to knowledge graphs.

The reason LLMs are also quite good at SQL is because of how many documents and articles (think Stack Overflow) contain SQL. However, SQL is a minuscule fraction of what LLMs have been trained on. An interface closer to natural language is a better fit for LLMs, and a constrained one, to reduce chance of error and hallucination, is even better.

However, none of what I have been preaching has been conclusively proven in public. What we needed was a benchmark - this is the way that things are tested in engineering. Microprocessors, software, automotive… things that are engineered are compared with benchmarks. We didn’t have a benchmark though, until now.

The Analytics Engineering Roundup

Semantic Layer as the Data Interface for LLMs

On November 14th, Juan Sequeda and the data.world team dropped a bombshell paper that validates the intuition held by many of us - layering structured Semantic Knowledge on top of your data leads to much stronger ability to correctly answer ad-hoc questions about your organizational data with Large Language Models…

2 years ago · 35 likes · 8 comments · Jason Ganz

In short, what we see from the benchmark and further replication is:

Semantic Layer > Knowledge Graph > Context with SQL > SQL alone

All of the engineering we have done with Delphi adds further constraint and context to the semantic layer and this is why it outperforms just using the semantic layer on its own with LLMs.

It’s great to have this benchmark, but what it has shown us is that we probably need a more challenging benchmark. One with obstacles, ones with gotchas, ones with purposeful mistakes, ones with duplication. We need to test these systems on how they deal with bad conditions, not ideal conditions. This is because semantic layers are like this and worse in production.

The inverse argument is, if data teams want to use LLMs with semantic layers, some of the uncleanliness that has been tolerated in the past: duplication, poor naming, no descriptions… needs to be improved upon. As the benchmark showed… on a clean semantic layer, with no duplication, clear naming and descriptions… it’s actually possible for an LLM to answer questions perfectly.

I’m pretty sure that if we asked a human to answer all of the questions in the benchmark, without a good deal of training on the data model, that they would get some of the answers wrong. Where humans outperform LLMs on semantic layers, at least in the short-term, is where the human has adapted to the flaws in the semantic layer - thus becoming part of it.

benn.substack

We don’t need another SQL chatbot

Another week, another text-to-SQL chatbot. Ask it a question; it’ll write a query. Ask it a question, it’ll send a thin prompt to ChatGPT: “You are a senior data analyst. Here are some notes about a few tables and a sample query.” Ask it a question, and it’ll send your entire schema, your dbt project, and your data dictiona…

2 years ago · 51 likes · 43 comments · Benn Stancil

The Analytics Engineering Roundup

Three missions for the Community

One of the first things I did this year after coming back from Coalesce was reread the dbt Viewpoint. If you’re looking for the foundational document of the dbt Community and analytics engineering, you don’t need to look any further than the Viewpoint. It lays down a set of guiding principles about how data teams can operate like software engineers - wri…

2 years ago · 17 likes · 1 comment · Jason Ganz

https://cube.dev/blog/semantic-layer-the-backbone-of-ai-powered-data-experiences

davidj.substack

Discussion about this post