I remember meeting the founders of Reconfigured.io for lunch at Spitalfields Market one afternoon a couple of years ago. This was the first time I had heard the term “Puppini Bridge”, from Niko, who excitedly told me about how he felt this concept could revolutionise data modeling. Whilst I didn’t forget about it, it seems to have lingered in the back of my mind until now…
Semantic layers, including those bundled with BI tools, invest a lot of time and effort in generating the correct SQL, based upon the inputs to the semantic layer’s APIs via a BI interface or otherwise. There are more complex types of data model that can result in problematic SQL being generated.
A typical star schema with a fact table joining many-to-one onto dimensional tables doesn’t usually cause an issue. However, as soon as the relationship between fact and dimension table changes from many-to-one to many-to-many (usually not allowed in semantic layers) or even one-to-many, fan-outs can become an issue. This is exacerbated by chains of joins.
Fan-outs are where a join can cause duplication, leading to incorrect aggregate results. For example, if you had an orders table which joins a products table. Let’s imagine in this instance that an order can only have one product, but the products table has many rows per product (eg perhaps you have a new row for each product each time the price changes). If you join the orders table to products in a one-to-many fashion (allowed in most BI tools and semantic layers), without taking steps to ensure only one product row is joined to one orders row (such as taking the product row which has the price as at the time of the order), then the query will duplicate the orders rows. Thus, if you were to sum an order value field to calculate revenue, it would probably give you an inflated number.
Where there are multiple fact tables in the data model that need to be joined in order to use measures related to more than one of them, then chasm traps become an issue. A typical example of this could be where you have a fact table for orders which joins a dimension table for products (many-to-one, so far so good) which then joins another fact table for shipments (one-to-many - here comes the problem). In essence, orders now has a many-to-many join with shipments, and the query generated will possibly inflate measures from both fact tables.
Semantic layers and BI tools have had many approaches to solving these issues. Typically, they work well where it is explicit that issues will occur. For example, where a fact table is joined one-to-many with a dimension table and the join type has been set to this, semantic layers have typically had ways to ensure that the query generated has the correct results and doesn’t inflate measures. Chasm traps are generally less well-handled, due to their indirect nature preventing explicit labelling by data teams.
Recently I’ve been working with one of our sales engineers at Cube, Morgan Asher, who has been working on some POCs which have brought the Puppini Bridge to the fore.
Hi Morgan, can you tell us a bit about yourself? Background, what you do now, skillset etc?
I’m a solutions engineer at Cube.dev. I help data teams understand the value of a semantic layer and assist them with implementing one in their data stack. My background is from the Department of Customer Love at Looker, where I helped customers implement and troubleshoot the LookML semantic layer.
Wow, was it really called that? Did you ever do the support work where people would use the chat function in Looker to get help with weird liquid syntax quirks etc? Part of me wonders whether I ever spoke to you!
Yes that's how I got my career started! I was on live chat from 2016-2017, then I was managing a team that was doing chat for a couple years before I moved to a support + engineering liaison role. So unless you were using Looker early we probably didn't talk directly, although you may have talked to my team :)
What problems have you been seeing with data modeling that have led to your research of this solution?
One of the most valuable aspects of a semantic layer is the ability to define join logic once and reuse it over and over again. Joining tables correctly is one of the most important parts of writing a query to get right, and it’s also one of the sneakiest causes of incorrect results.
A really common way that joins can cause bad results is when you are performing an aggregation after joining two tables that have a one-to-many relationship. When one row in the left table matches with two or more rows in the right table, the values in the left table become duplicated for each of the rows that it gets matched with. If you aggregate the data that has been duplicated, you’re often going to get incorrect results.
Fortunately most BI tools and semantic layers have functionality where you can set table relationships as one-to-one, one-to-many, or many-to-many. These tools can then make sure to write SQL queries that acknowledge the data duplication and avoid overcounting. This is great for simpler use cases, but once you start chaining these relationships together with three or more tables, you can run into problems like the fan trap or chasm trap that are very difficult to deal with within a single SQL query. Some semantic layers and BI tools can handle these more complicated relationships by writing complex, unintelligible SQL, but you start to run into limitations on how the data can be displayed, what fields can be queried together, and major performance issues.
What is the Puppini bridge? Why does it resolve the problem?
A Puppini bridge refers to a table in a database that is structured in a particular way - its purpose being to map join keys to one another. This is one part of the “Unified Star Schema”, a structure developed by Francesco Puppini. The idea is that the bridge table sits in the middle of the schema, and all other tables are joined to it directly. It takes some wrangling to create the bridge table, but once it is created, as long as you start your query with the bridge table and define aggregations against the bridge table itself, you can join to your heart’s content without worrying about having to define relationships or deal with chained joins.
How would you implement it?
I've implemented this in the Cube semantic layer by creating the bridge logic in the sql parameter of a cube file. It is also common to implement this in the data warehouse side and materialize it as a table/view.
cubes:
- name: bridge
public: false
sql: >
SELECT
'Line Items' AS stage,
"line_items".ORDER_ID AS _KEY_Orders,
"orders".USER_ID AS _KEY_Users,
"line_items".ID AS _KEY_LineItems,
"line_items".PRODUCT_ID AS _KEY_Products
FROM ECOM.LINE_ITEMS AS "line_items"
LEFT JOIN ECOM.ORDERS AS "orders" ON "orders".ID = "line_items".ORDER_ID
GROUP BY
1,2,3,4,5
UNION ALL
SELECT
'Users' as stage,
null as _KEY_Orders,
"users".ID as _KEY_Users,
null as _KEY_LineItems,
null as _KEY_Products
FROM
ECOM.USERS as "users"
GROUP BY
1,2,3,4,5
UNION ALL
SELECT
'Products' as stage,
null as _KEY_Orders,
null as _KEY_Users,
null as _KEY_LineItems,
"products".ID as _KEY_Products
FROM
ECOM.PRODUCTS as "products"
GROUP BY
1,2,3,4,5
joins:
- name: orders
relationship: many_to_one
sql: _KEY_Orders = {orders.id}
- name: users
relationship: many_to_one
sql: _KEY_Users = {users.id}
- name: line_items
relationship: many_to_one
sql: _KEY_LineItems = {line_items.id}
- name: products
relationship: many_to_one
sql: _KEY_Products = {products.id}
dimensions:
- name: pk
sql: "stage||NVL(_KEY_Orders,0)::VARCHAR||' '||NVL(_KEY_Users,0)::VARCHAR||' '||NVL(_KEY_LineItems,0)::VARCHAR||' '||NVL(_KEY_Products,0)::VARCHAR"
type: string
primary_key: true
public: false
measures:
- name: order_count
type: count_distinct_approx
sql: _KEY_Orders
- name: order_completed_count
type: count_distinct_approx
sql: _KEY_Orders
filters:
- sql: "{orders.status} = 'completed'"
- name: user_count
type: count_distinct_approx
sql: _KEY_Users
- name: total_amount
type: sum
sql: "{line_items.price}"
pre_aggregations:
- name: bridge_rollup
measures:
- bridge.user_count
- bridge.order_count
- bridge.order_completed_count
- bridge.total_amount
time_dimension: orders.created_at
granularity: day
As you can see, all the joins are outwards and many-to-one from the bridge - this eliminates the possibility of chasm traps since there are no chained joins. Fan traps are avoided by defining any metrics at risk of explosion in the bridge cube itself rather than in their own cubes.
Since starting this post, I have received a copy of “The Unified Star Schema” by Bill Inmon and Francesco Puppini. I’m going to write a few more posts on this book as I go through it, and think about the details in more depth. My summer reading!
London Analytics Meetup #7 is also scheduled for 17th September, which is the evening before Big Data London 2024. Francesco will be speaking - don’t miss out!
Sign up for the meetup here.
I remain unconvinced. But I'll have to give it another look.