I have written about semantic layers before, in specific contexts, but I thought it would be worth going back to basics and explaining the “what” and the “why” in depth - as well as the “how” - in this series.
What is a semantic layer?
A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms managed through Business semantics management. A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization.
By using common business terms, rather than data language, to access, manipulate, and organize information, a semantic layer simplifies the complexity of business data. Business terms are stored as objects in a semantic layer, which are accessed through business views.1
The term been around for a while, as you can see from one of the notes in the Wikipedia article:
On May 29, 1992, Business Objects obtained U.S. Patent 5,555,403, which "provides a new data representation and a query technique which allows information system end users to access (query) relational databases without knowing the relational structure or the structure query language (SQL)".[1] Over time, some competitors like Cognos paid licensing fees.[2] However, in 2003, Microstrategy successfully defended a brought suit by Business Objects alleging patent infringement.[3]
Given the semantic layer was patented in 1992 by Business Objects (the first BI tool I ever saw), let’s assume the concept was at least 10 years older than that… maybe these folks would know. Let’s not make the same mistake as with some other topics of the day, assuming we’ve invented things for the first time. It’s probably safer to assume that, if a concept is related to an underlying technology, then it was probably thought of close to the advent of said technology - which, in this case, is the database.
I have experienced the semantic layer in various forms, almost solely inside other tools. SSAS (MSSQL Server), LookML (Looker), Lightdash, MDX (Power BI/Excel) and pivot tables are a very basic form. I would say it is:
A mapping of real world entities with their associated metrics and attributes, to a logical data structure.
At its most basic, I believe this defines its core principles. Now, without a means of access and most likely a compiler, it’s not as helpful: someone would have to manually apply the definitions each time they accessed the data, most likely in code. Before I had a semantic layer to hand, at work, these mappings would mostly live in an analyst’s head, in an Excel formula/pivot table, in the code that materialised a data object or perhaps in some stale documentation.
I have tried in the past to define each object in the semantic layer as SQL, similar to how you could have the name of the object as a key and the SQL to derive it as a value. It has the advantage of not really needing a compiler. However, you run into problems pretty quickly:
Unless you have just one table/type of data in the logical data structure, you have to define where the data is each time.
The data required to derive an object may be in multiple places and then you have to define how to join the data together.
Without some templating language, defining each thing gets very repetitive, as soon as you try to abstract any part of it… it’s no longer SQL and you then need something to interpret it afterwards into SQL (a compiler).
As each piece of SQL actually ends up being a whole query, you can’t systematically compose them together. If you did try to make something to do this, it would end up needing to parse your SQL into some more useful form for comparison, then composing requires splicing these together 😬.
This is why SQL on its own isn’t really expressive enough to define a semantic layer. It has often ended up being expressed in something like YAML (avoiding using a full blown programming language or an entirely new DSL) with an associated compiler to translate the definitions into SQL that can be run.
It’s possible that someone like Snowflake could extend SQL (as they have already extended it a lot for some of their features) to include ways to define semantics. However, Snowflake would probably be the only existing data warehouse provider big enough to do it, that didn’t have a conflict of interest (GCP owns LookML and BigQuery, Databricks isn’t SQL first).
The final missing piece is the means of access - an API. The API needs to be able to take a request which expects objects from the semantic layer as inputs, with some parameters (for example to define filters). The request is then compiled (often to SQL) and run against the data store. The results are then the API response to the request.
The API may have a caching mechanism where requests which are identical to, or are subsets of, previous requests hit stored results, rather than needing to have a new query run against the data store. The caching purely improves performance/cost and doesn’t provide additional functionality.
Why is it an organisational competitive advantage?
Semantic layers provide a mapping between the real world and data. They are where key objects (entities, metrics and associated dimensions) are defined in terms of meaning and calculation. Without them, an organisation doesn’t have consensus on what any of these mean, or how anything can be measured. Without this consensus, confusion and poor outcomes occur.
When different people and teams within an organisation use varying definitions for the same object, they can’t discuss performance from the same perspective. They spend a huge amount of time debating whose perspective (data) is right. I’ve experienced this first hand on many occasions and at different companies. It makes it very difficult for decisions to be made quickly, and it actually makes decision-makers more cautious than they should be, as they have doubts about their information.
Where there is no explicit, shared semantic layer for an organisation, analysts or other data users end up becoming the semantic layer themselves. They have to remember how to use the data or save queries and code to reuse. This results in inconsistency, which increases exponentially as an organisation scales - the larger the org, the more valuable a semantic layer. Decision making becomes very difficult, and data teams are often forced to highly govern a tiny amount of data used by the C-suite and prevent direct access from other teams to avoid “different” numbers being produced. Everyone else is wrong by default, unless they agree and then they are coincidentally right.
I don’t believe it’s possible for an organisation to have a strong data culture or be truly “data-driven” without a semantic layer. Even with one, it takes dedication and care; without one it’s impossible.
The flipside of this is, where an organisation has a well-maintained2 semantic layer, everyone can know what an object means, how metrics about it are calculated and what their associated dimensions/attributes are and their meaning. It is possible to then make quick decisions with data, without regularly having to question its validity. The more people in an org who can safely access the data and who are familiar with it, the more likely the org can quickly reach consensus on decisions using data.
Making good decisions at a higher volume and lower latency allows an organisation to respond quickly to new threats and opportunities. It also allows organisations to confidently manage higher levels of risk - where risks are chosen, spread and hedged well, this results in higher reward and profit.
For most companies, this is how they will “Win with Data”. Not ML, not real-time… knowing clearly what’s going on and being able to act on it quickly and decisively.
I was almost surprised that Wikipedia had this, but once again I shouldn’t have been… we’ve been reinventing the wheel https://en.wikipedia.org/wiki/Semantic_layer
I may go into this in more depth later, but some core tenets are: simple and clean naming, no duplicates of objects, descriptions where needed, coverage of only the data that should be accessed widely across the org. It should be minimal - think Marie Kondo. Just because someone wants to graph or explore some data, does not mean it should be defined in the semantic layer - this is one of the problems of having them attached to BI tools. Disciplined minimalism.