Unified Star Schema - Part 1

What has come before

Aug 02, 2024

Last week, I wrote a post about a core part of the Unified Star Schema - the “Puppini” Bridge. As I mentioned, I feel like this approach is one that is different enough, and potentially valuable enough, that I really want to deep dive into it. It has the potential to solve a lot of problems that make self-serve analytics more difficult and narrower in scope than desired.

The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design: Amazon.co.uk: Inmon, Bill, Puppini, Francesco: 9781634628877: Books

The first eight chapters of the book (Part 1), written by Bill Inmon, really look at the history of data warehousing - why things are the way they are and how they have evolved over time.

These chapters cover how data marts came to exist and the dimensional models that we are familiar with - fact and dimension tables, star schemas and snowflake data models. They also explain how data silos came to be, in terms of orphaned data marts per department/application and how duplicates of relatively inexpensive data marts proliferate over time. I’ve often seen replicas of production databases used as a data mart, too. These factors drive the data chaos that many who read this blog have experienced. Data Quality is still the number one issue faced by data teams.

The book is written from the perspective of data warehousing in relation to applications with production OLTP databases, rather than the event-driven streaming world that I’ve spent the more recent years of my career dealing with. ETL is the presumption, rather than the ELT pattern that has been adopted in recent years - encouraged by cheap cloud storage, highly scalable and separate compute and the rise of dbt as a transformation framework on the data warehouse. Nonetheless, the principles of the book regarding data modeling still hold true, and, as an industry, we’re heading back towards deeper consideration of data modeling.

We may have swung too far: inhaling the data exhaust of event driven systems, and failing to handle the onslaught of changes, even with the power we have with ELT stacks. I’m not sure I want to go back to ETL as a way of doing data pipelines. I want to swing back even further, towards something you could maybe call DEL - design, extract and load. Instead of doing transformation to data emitted either as events or replicated from a poor prod OLTP schema, we should design the data model to spec, as the application is also developed - thus eliminating the need for heavy transformations to data. You could imagine just the deduplication required from loading patterns and casting to data warehouse/lake format being all that is required. This is similar to how we go from raw to staging today in many data warehouses, we just wouldn’t need the steps afterwards.

Of course, as the book is so heavily focused on data modeling, it brings up a lot of thoughts about semantic layers when you read it. The “integrated data mart” approach, which contains a central data warehouse with a relational model and dependent star schema data marts, which are, in turn, abstractions of small parts of the central relational model, is particularly interesting.

Most modern semantic layers assume a central relational model exists in the data warehouse. The dependent data marts that I assume are materialised as tables, in part, cause some duplication of data from the central relational model. Semantic layers can easily remove the need for these materialised dependent data marts, by providing a limited-scope abstraction of the central relational model that forms a star schema. This then leverages the already defined joins, dimensions and metrics from the central relational model governed in the semantic layer, avoiding duplication of definitions and data and gaining the further benefits of a semantic layer in terms of caching, security and APIs.

There are also chapters dedicated to the value of logging and metadata, but reading the chapters really shouted out to me that this value is hugely enhanced by the semantic layer.

The calls that are made to the data mart are intercepted and read. The responses to those calls are also read. A log records the activity. From the analysis of the calls and responses, a record is made as to what data is being accessed and what data is being returned as a result of the access.

I’ve really heavily used the query history of my data warehouse in the past to monitor its usage. Yes, it’s fairly easy to understand what tables are being accessed and what fields are chosen, but complex queries are harder and more time consuming to decipher. The logs of semantic layer API access are so much clearer: clear entity names and the explicit names of their dimensions and measures pulled. These are also possible to document in the semantic layer. Rather than knowing what data structures are being accessed, you can know what real world business objects are being accessed.

Even though it often does not get much attention, one of the most important components of the data mart environment is the metadata that describes what is in the data mart. The most important reason why metadata is important is that the data mart end-user needs metadata to describe what is in the data mart. Stated differently, the more the end-user can know about the data mart, the more effectively the end-user can be in doing his/her analysis.

Every data mart I’ve ever encountered had stale documentation. The documentation is not directly linked to the engineering work done to create and maintain the data mart. It’s something that is additional work and is not automatically generated as an artefact of the engineering work. The metadata that is generated as part of maintaining a semantic layer is effectively in production and it is used by the compiler of the semantic layer - if it’s broken then your production is broken. Therefore, the metadata of the semantic layer stays much more fresh. This, along with the governed meaning of the data being stored in the semantic layer, is the reason why it’s the ideal place for a catalog that is useful to business users.

A combined metadata repository might be useful where there are two or the widely used data marts. Of course, there may be numerous smaller data marts that do not participate in a combined metadata repository.
One of the issues with any data mart metadata repository — either combined or singular — is that of maintenance of the metadata. Over time the data mart and its contents change. In some cases, the change is gradual. In other cases, the change is sudden. The metadata needs to be changed in coordination with the changes to the data mart. Figure 7.5 shows that the metadata that reflects the contents of the data mart needs to be periodically maintained.

The RBAC features of semantic layers allow you to know what type of user is accessing which real world objects. Not all data warehouses which would host a data model have this level of sophistication in roles.

The first part of the book leads us to the present situation and no further. It quite concisely does this and sets us up to look towards how things could be better with a new paradigm.

You can purchase a copy of The Unified Star Schema here.

davidj.substack

Discussion about this post