Semantic Superiority - Part 4

Expansion and future

Apr 12, 2023

galaxy wallpaper — Photo by Felix Mittermeier on Unsplash

Expanding Everywhere

In the first three parts of this series on semantic layers, I covered:

What they are
What benefits they bring in terms of accessibility, governance, consistency, speed and efficiency
What the difference is between a semantic and metrics layer
The existing ones on the market, attached to a BI tool and standalone (the future is standalone)

One key point that I want to reiterate directly, is that there is always a semantic layer in play if you are converting data into information or using it for analytics. Even if you don’t have one as a piece of technology, all this means is that you have one or more humans acting as semantic layers (with probably varying coverage and definitions).

This concept leads me to consider what else is a semantic layer? Is, in fact, anything that wraps a database or data store, abstracting away the data structure so that people and other systems can purely access entities, a semantic layer? Could ORMs be considered semantic layers? What about CRMs? In fact, a lot of APIs which abstract data stores could also be at least partly considered as semantic layers.

If you consider this wider definition, then semantic layers are everywhere. If you only consider the R in CRUD, then ORMs, CRMs and many APIs are semantic layers optimised for row-oriented databases, rather than columnar. Semantic layers in analytical use cases don’t need CRUD functionality, because ELT processes are intended to deal with this instead.

PMF

Semantic layers, when applied, have product market fit - thousands of organisations use them in solely analytical settings. Looker’s success entirely hinged on this PMF. This is the snag though… the semantic layer must have available applications to get value from it.

As I described in my previous posts in the series, semantic layers provide many benefits to the organisations which deploy them. These benefits accumulate into a competitive advantage - if a company can more quickly use data to make decisions, it will increase its probability of success. This was the whole promise of data - “Winning with data”, “Data is the new oil” etc. Whether through analytics and faster, better human decisions, or ML and faster, better automated decisions, the use of data was always meant to provide an edge over the competition.

Just like anything else, executing well is important, too. There are some principles to stick to, plus some wishes:

Good naming conventions - if your metrics and dimensions are difficult to use and have duplicate names, you lose clarity and trust in the semantic layer.
Well-governed changes - core definitions shouldn’t be changing frequently: users need to be able to trust what something means and be well-informed, with plenty of notice if something is changing.
The opposite is true for experimental datasets and there is a good argument for them not being in a semantic layer. One of the main benefits of semantic layers is not having to write the same code repeatedly in order to answer similar questions. If you’re having to change this code as often as it’s used, this suggests the dataset is too experimental to be in the semantic layer yet. Worse yet, stakeholders could accidentally end up using this data, reducing trust in the semantic layer. Just use a notebook to run your SQL and graph your results - it’s easier and prevents all the CI/CD work that will irritate the data team that looks after your semantic layer.
Invariably, you end up needing to create some metrics which are pre-filtered versions of other metrics. This is annoying, but allows these metrics to be viewed side by side. Try to minimise these, as the more variants that end up in the semantic layer, the more confusing it is to use. Being able to extend metrics from others in a semantic layer, as you can with entities, could be a real benefit. You could have a main metric called ‘revenue’ with subvariants like ‘net of refunds’, ‘with promo codes applied’, ‘without chargebacks’ etc. The subvariants would just have filters defined in the same place as the main metric.
Owners of objects or, at least, entities. A whole RACI matrix could be useful to be able to define here - not that you would always want to, but for certain cases where extra governance is required.
Access control - is the data fine for public consumption, company wide use, specific departmental use or only for certain individuals? The semantic layer should make clear if it supports these features.

person holding clear glass glass — Photo by Drew Beamer on Unsplash

The Future

The Analytics Engineering Roundup

Ep 41: dbt Labs + Transform join forces on metrics (w/ Nick Handel + Drew Banin)

Nick Handel, as co-founder at Transform, helped develop the popular open source metrics framework MetricFlow. Drew Banin, a co-founder at dbt Labs, helped build the initial version of the dbt Semantic Layer, which launched last year. Transform was acquired in February by dbt Labs, and in this conversation with Tristan, they talk through their collectiv…

2 years ago · David Krevitt

There are a key three and a half minutes at the end of this podcast that I think are worth quoting:

Tristan Handy - Question re impact of semantic layer proliferation:
Let's go 24, 36 months out into the future. What does the world look like? How are practitioners working differently? How do companies interact with their data differently now?
Drew Banin:
I think we've seen this kind of transformational effect with dbt to date. So if you cast your mind back to 2016, 2017 you couldn't really just build a BI experience that query tables and showed you really great charts and dashboards, things like that. There was this baked-in assumption that you would have to do some sort of transformation.
It was inevitable that you'd have to do some sort of transformation in order to have a table that is ready to analyze. So we saw that with Looker PDTs, and Mode Analytics had some sort of create a table on a schedule.
So to me, I think the big longstanding impact of this work is going to be that people building tools can start with a new set of assumptions, which is that all the very important aspects of an organization the metrics, the entities, how they relate to each other.
These things are already defined and the challenge is not how do we create a really good interface for people to define this logic. it is how we create the best possible experience for interacting with it, understanding it, and sharing it. And I think it shifts us up this, pyramid of needs if you will.
And I think that there are going to be new types of tools that assume that you will have a semantic layer from the jump some subs they'll make. And I think for the people using those tools, it will better connect the actual work that they're doing day-to-day to the experience of interacting with data.
So less of open up a link to a dashboard and much more see that data in context in the place where you're already working.
Nick Handel:
Yeah, I totally agree with that. I think that there are so many companies in the past that have said I need to build a lightweight semantic layer into my product in order to accomplish the product experience that I want to deliver.
And it's catalogs, experimentation tools, BI tools, analytics tools, activation, reverse ETL tools. They all have thought about the concept of metrics. They've thought about dimensions. They've potentially tried to build SQL. It's a hard problem. And having this kind of foundation that allows companies to just from the beginning say, I am going to build the best-in-class version of this product on top of a solid foundation of metrics, I think is going to lead to much better, much more domain-specific data tools in the next few years.
There's also the experience that the analysts have who use this, and I think that's also an interesting thing to talk about. I think part of the role of an analyst is to do a lot of translation work. They interact with the domain expert in the business. They go back, they write some SQL, they show some data, and that person asks a follow-up question.
They're acting as this translation layer between the data warehouse and the person who's trying to make decisions and understand and do analysis. Everyone is capable of doing analysis. Very few people have the ability to freely ask questions and see responses and have that kind of flow to just do analysis and make decisions.
And I think that what this will enable with some of those new kinds of more dedicated ex-product experiences is that the analyst will actually be freed up from doing a lot of that translation work. And the metrics and interventions will just find their way into the tools, and instead, they'll get to focus on the really difficult analysis.
The stuff that really enables the business to discover net new opportunities requires disentangling complicated data. And that's the kind of analysis that I think analysts sign up for and they get excited about, and then they join companies and their jobs become being this translation layer.
I'm really excited to see that. because it's a real skill that I think people don't have the opportunity to flex enough of.

I agree with this quote, but semantic layers have been around for some 30 years or so, in one form or another. Why haven’t we seen ease of access for all and analysts being freed up from translation work? I think part of this has been that semantic layers have, for the most part to date, been hidden behind the user interfaces of BI tools. We haven’t seen the explosion of applications on top of the semantic layer because they’ve been locked away for the benefit of these BI vendors. This is a different story today with the likes of dbt/Transform, Cube and AtScale available - this move towards standalone analytical semantic layers is very new.

Some of it has also been because we weren’t in the cloud yet. SaaS tooling, and integrations between them, depend upon relying on the cloud for networking and security done for you by others. In the pre-cloud world, having a bunch of disparate applications leveraging a common semantic layer wasn’t impossible but it was just improbable. This is also partly why semantic layers have been bundled with BI tools, which come with a user interface to derive value from them - it wasn’t easy in the pre-cloud world to just hook something else up to another remote system and for it to work well.

It’s also partly because these interfaces have previously been limited and assume a high level of skill - what’s easy for us is not easy for the majority of non-technical stakeholders. There has been an assumption that everyone in a company should be able to use a BI tool, and everyone in a company should have a level of proficiency with data. This assumption has been proven wrong time and time again, organisation by organisation.

We don’t even have the requisite number of data folks, let alone stakeholders who are data literate. The BI vendors have assumed that an interface somewhat like a pivot table to access a semantic layer is sufficient. You can see why - the number of people with that level of skill is orders of magnitude higher than the number of people who can write SQL on a complex data model competently. So, with this type of interface, they have greatly opened up access to data - it’s just not even remotely close to the majority of people in an organisation.

Next time LLMs, AI… everybody’s talking about it. The final part in this series.

davidj.substack

Discussion about this post