Last week, I posted the first part in this series on Semantic Layers:
I spoke about some of the benefits of semantic layers at the end of my last post.
Benefits continued…
Keeping you DRY
When an organisation doesn’t have a semantic layer in place, analysts and Data Science folks end up writing many queries in a repetitive way. The same analyst has to adjust existing queries or rewrite them for similar purposes.
Different analysts working on similar things end up writing the same queries, as they don’t know how others have done it before. Often, analysts find it easier to write the queries again for themselves than try to find out how someone else might have done it in the past.
In particular, the joins and filters for specific queries become boilerplate code, as this is essentially defining the data model. Semantic layers provide a framework for many variant queries to be run with consistency and a huge reduction in writing boilerplate code.
Imagine if the starting point for your notebook, canvas or dashboard was a semantic layer instead of having to extract and manipulate data. Time to value is much lower, risk of error is much lower.
Semantic layers provide access to information - SQL and other data store interfaces provide access to data. By defining how data should be used to relate it to real world entities, after it has been transformed into a useful structure using something like dbt, what is accessed from the semantic layer is data processed into information. You no longer ask to get field A from table B - you ask to be shown metrics about entities by dimensions with filters etc - this is information.
Governance and Security concerns
One of the topics I’ve been asked about a lot while going through the fund raise for Delphi, is governance and security. We have plans on what we specifically will do in this regard, but I keep coming back to what is inherent through interacting with semantic layers (and not directly with data stores).
Semantic layers served via APIs, as a means of access for data, have a number of boons in this area:
As you have to purposefully expose things in your semantic layer, it’s much harder to accidentally expose PII or sensitive data, compared to giving access to your data store directly.
It’s very hard to be sure about whether the data being ingested into a data store - whether that is a warehouse, queue or file system - contains PII or other sensitive data.
Often, even when using off the shelf ELT tools like Fivetran, you can end up ingesting custom fields from systems like CRMs, which are usually polluted with PII or sensitive information.
If data is in nested structures or needs parsing before use, it’s very easy for sensitive data to enter your data store unknowingly. For example, I’ve seen email addresses in URLs many times.
Analytical data stores are often used for other purposes such as operations, which require PII or sensitive data in order to be useful.
This makes it hard to give access to data stores without major infosec concerns, but access to semantic layers can still be safe provided that PII isn’t deliberately made accessible.
As analytical data stores become more commonly used with tools like Hightouch and Census for operational ends, this problem will persist and possibly expand, making it harder to give access to analytical data stores directly.
Some semantic layers today have strong RBAC models built in.
It’s actually easier to think about RBAC with regard to a semantic layer than a data store. Groups and teams in an organisation, by nature, should be able to know about entities and their metrics.
For example, Finance in listed companies are allowed to know about Revenue but access to other teams needs restriction.
Customer Service are allowed to see a lot of what would be considered PII in relation to the Customer Entity, but other teams should not.
Data tables can be pretty raw and it’s hard to know what they are specifically about. In semantic layers, it’s much clearer.
It would be better to rely upon RBAC for a semantic layer than directly on the data store.
Data store RBAC features vary wildly! Some are on the data store, some are at the cloud provider level… even experienced professionals can get confused and make mistakes.
It’s easy to accidentally give access to unsafe data in a data store.
By definition, semantic layers are minimal in only exposing what you want. If a new table appears in a schema, it’s not the case that new data is accessible from a semantic layer - whereas this could be the case for a data store.
Lots of data in a data store ends up being experimental or to do with CI/CD processes, or is just of unknown quantity. You don’t really want users accessing this data - it’s only meant for development. It’s hard to guarantee that this data isn’t accidentally used for the wrong things when giving access to a data store. Not so with a semantic layer.
Some data entities and related metrics are actually regulated by law.
If your company publishes information about these incorrectly, serious trouble could lie ahead. Examples include profit in various forms, bank account holders, debtors… it’s a long list.
Semantic layers provide a way to explicitly define these and prevent them being changed without authorisation, while still allowing for easy access and even easier calculation.
I’d actually love to hear about how some experts in this field think about this - Chris Tabb, Bart Vandekerckhove, looking at folks like you.
Semantic layer, metrics layer… what’s the difference?
There are different opinions on this, but most would consider a metrics layer a subset of a semantic layer. Metrics layers, out of necessity (especially if they are of the approach 2 type, which is now rightfully the dominant type) end up defining how the data model fits together. However, they don’t specifically need to define entities beyond needing to count them or sum attributes about them.
My understanding is that semantic layers should also define entities, and this is probably the biggest difference between a semantic layer and metrics layer. I’m sure there are opinionated voices out there who would want to add to this, but based on a quick Google it seems I’m not far off.
There are different ways to implement this, for example: in Veezoo’s semantic layer, you explicitly map entity definitions to logical data structures, like defining classes in OOP:
MetricFlow assumes they have been created in data transformation, which is why it works well with dbt. Therefore, rows in certain tables are entities, in order to use MetricFlow effectively.
Cube has well… cubes, which can join other cubes. In Cube, each cube is an entity with mapping to a logical data structure, using SQL.
Entities allow for easy segmentations, you can specifically request all the entities that meet certain criteria. In the era of using analytical data stores for operational use cases, entities will be particularly useful. Previously, OLTP databases have served these use cases but pulling customer/product segments in the millions of rows is not optimal for them, and can cause problematic load - they are meant for single record CRUD at low latency.
Entities also allow for inheritance from base entities to create others eg customers from users. This in turn allows for code to be more concise in expressing the semantic layer. Metrics and dimensions can be inherited from one entity to an another - if a count was defined on the user entity as count(user_id) then this same metric can be inherited by the customer entity, without needing to redefine it.
Understanding what exactly is or isn’t an entity is key to implementing a semantic layer well. Some of the flaws in data models, leading to issues like fan-outs, stem from not understanding what the entities in the model are and how they relate. For example, revenue is not an entity, revenue is a metric about other entities: sales/orders/bookings. Each sale/order/booking has a value, and the sum of this value is revenue - revenue does not have an id but those entities will.
Next time, part 3: “State of the Join” 😉
For me, in a very simple way, the metrics layer is a subset of a semantic layer - as you pointed out where the semantic layer is covering a lot more use cases (like entities).
What is in practice the difference between a semantic/metric layer and the old datamarts?