You can take your gold and shove it...

...into my automatic semantic layer generator

Jan 13, 2025

Man flinging a large gold ingot aggressively at the viewer, with the ingot having already left his hand

In this week's post, I want to delve into medallion architecture again, specifically focusing on the gold layer. Medallion architecture - a concept that I’ve written about before - organises data into bronze, silver, and gold layers, each doing a different type of processing along the way. While the bronze and silver layers are clearly needed to use data confidently, the gold layer has side-effects that may now be avoided. Let's explore why we should ditch the gold layer in favour of universal semantic layers.

Modellion

David Jayatillake

November 25, 2024

Let’s start with a TLDR; Medallion Architecture is not a form of data modeling.

Read full story

For most organisations, the functions of the bronze and silver layers are clear:

Bronze Layer: Often referred to as the staging layer, this is where data is initially processed after being ingested. Here, we standardise and rationalise raw data, dealing with formats, data types, duplication, and other issues to make it usable and safe for further processing.
Silver Layer: This is where we abstract data into business entities and events related to the organisation. Entities, like customers and orders in an e-commerce data model, are built here. The silver layer comprises tables representing a relational data model (or otherwise), allowing for safe and easy (at least easier) joins to derive meaningful insights. At the end of silver layer data transformation, you have a data model that can capture the events and entities in your organisation’s world; limited to what exists and is usable in your data.

Despite its structured nature, the silver layer has some limitations:

Complexity for Analysts: Using the silver layer effectively requires a deep understanding of its structure - an ERD etched inside your skull. There are often baked-in assumptions and nuances in how tables should be joined or filtered in queries.
Scalability Issues: As the silver layer grows in size and complexity, it becomes challenging for even experienced analysts to navigate. Determining how to write the best query can be daunting when faced with many tables and relationships.
Cost: Hitting the silver layer every time can be expensive or tricky to keep optimal. Even ensuring the right date filters are used in all tables to prevent unnecessary data from being scanned isn’t trivial. Most of the time, queries to the silver layer don’t require the full grain of data available there, which means data is almost always aggregated from the silver layer. It’s cheaper to aggregate once and then store it, than do it repeatedly for every query.

This complexity makes it difficult to consume data directly from the silver layer, driving the need for an intermediary - the gold layer.

a pile of coins — Photo by benjamin lehman on Unsplash

As expected, the gold layer comprises aggregated tables created from the silver layer's data. Typically structured in star schemas, these aim to make data consumption easier, more presentable, and faster to access. The gold layer simplifies querying and improves performance for common analytical tasks by pre-joining and aggregating data.

Fool’s Gold

While the gold layer serves a purpose, it has side-effects:

Loss of Information

Gold layer tables inherently lose information from the silver layer. Since they focus on specific aggregates and dimensions, they do not cover all the data available in the silver layer. This loss makes it impossible to surface certain pieces of information here, especially when the gold layer is the only source accessible to BI tools or other consumers.

Fixed Grain and Inflexibility

Gold layer tables are built with a fixed grain, including predetermined dimensions and measures. This rigidity resembles building an OLAP cube in the past. When users require new dimensions or different granularities, data/analytics engineers must rebuild or adjust the gold layer tables - a Sisyphean task.

Maintenance Overhead

Maintaining gold layer tables requires continuous effort:

Rebuilding Tables: Adjusting to new analytical requirements means frequently updating the gold layer. While data teams spin their hamster wheels as fast as possible to make these changes, other teams often go straight to raw, staging or silver layer tables and make their own truth instead of waiting.
Querying Complexity: Users still need to write SQL queries against these tables, join fact and dimension tables, and apply correct aggregations - a process prone to errors and inconsistencies. Gold doesn’t remove all the complexity of using silver layer tables.
Variance Risks: Different teams may not query the gold layer tables in the same way, leading to variances in reported numbers and decision-making paralysis. Trigger that meeting where everyone spends the first 25 minutes comparing numbers and the last 5 scheduling the next meeting.

How to Get the Master Sword in Zelda: Breath of the Wild

Modern data infrastructure offers a powerful alternative: universal semantic layers. They provide an abstraction over the silver layer, offering dynamic, flexible data consumption without the drawbacks of the gold layer. Artyom Keydunov recently wrote a neat history of OLAP cubes and semantic layers if you want to know their origin story. They’ve been around for some time in different forms and capability.

How Semantic Layers Work

A semantic layer codifies:

Entities: The entities in the data model are clearly defined and explicitly drawn out of the silver layer tables.
Relationships: Defines how silver layer entities and events are related, including join keys and relationships between tables.
Measures and Dimensions: Specifies how to aggregate columns to derive measures (e.g., summing orders.order_amount for revenue) and identifies dimension fields (e.g., users.user_country).

With this metadata, the semantic layer allows users to request data using business terms without writing complex SQL. For example:

A user asks for "revenue by user country".
The semantic layer codifies how to join the orders and users tables, which columns to aggregate, and which dimensions to include.
It compiles this request into executable SQL, runs it against the data warehouse, and returns the result. Humans aren’t writing SQL whenever they want to use metrics and dimensions from the data model, which leads to consistency, accuracy, accessibility, speed, and enhanced security. The semantic beatitudes.

Semantic layers are not constrained by fixed grains like gold layers. Users can select any combination of measures and dimensions as needed. There's no need to predefine every possible aggregate, eliminating the need to rebuild tables for new analytical requirements.

While concerns about query performance and cost are valid, semantic layers address these through:

Caching and Pre-Aggregations: They can cache query results and pre-aggregate data for common query patterns - much like a gold layer table, but without rigidity.
On-Demand Processing: The semantic layer generates and executes SQL on the fly for less common queries, ensuring users can access any available data without additional engineering work. Where engineers see this happening increasingly often, they can add more pre-aggregations to cover this usage, and the semantic layer provides metadata to make this easy to discover. In the future, AI will add and deprecate pre-aggregations as needed, according to AI product folks at universal semantic layer vendors.

Conclusion

The evolution of data analytics tools has reached a point where we can - and should - rethink traditional architectures. Once a necessary component for performance and usability, the gold layer now presents more drawbacks than benefits. By embracing universal semantic layers on top of our silver layer data, we can provide a more flexible, efficient, and user-friendly data environment.

Let's stop relying on the gold layer and start utilising the power of semantic layers in our medallion architecture. This is the real gold layer we always wanted.

I know there is an elephant in the room: the fact that you have to create a semantic layer, which might be scary and new. However, I’ve shown recently that it’s possible to automatically generate semantic layer code from gold layer SQL queries. Existing gold layer transformation code, or gold-style queries if you don’t have a gold layer, contain the perfect metadata to parse and use to generate a semantic layer automatically.

sqlmesh cube_generate build part 1

David Jayatillake

December 17, 2024

I’ve been hacking away at this today and got so far:

Read full story

sqlmesh cube_generate build part 2

David Jayatillake

December 18, 2024

Yesterday, I got to a point where I got a CLI command to output JSON, which described the relationships between sqlmesh models and how fields were aggregated. This is the fundamental metadata required to build a semantic layer. When I set out to build this integration this week, I knew it was unlikely I would build something that deterministically gener…

Read full story

davidj.substack

Modellion

sqlmesh cube_generate build part 1

sqlmesh cube_generate build part 2

Discussion about this post