Metricalypse Soon

To pre-join or not to pre-join

May 11, 2022

close-up photo of monitor displaying graph — Photo by Nicholas Cappello on Unsplash

I recently attended the Future of Metrics Layer and Metadata hosted by Atlan, and it got deep into the weeds of some of the differences between metrics layers - Transform and dbt being two of the foremost organisations in the space.

dbt is, of course, more than a metrics layer, being a whole data transformation framework that has gained traction in the last few years. This has led to building a semantics layer on top of transformation as a logical next step. The goal of Analytics Engineering could be described as enabling semantics with data.

Tristan talks about this in last week's Analytics Engineering Roundup:

What I think the article is really advocating for is a semantic layer without joins…! Tables plus metadata to make those tables consumable in a BI layer. Which, notably, is exactly how dbt’s metrics are built :D
I think the conversation we’re starting to have as an industry is “How thick should my semantic layer be?” As the entire ecosystem charges headlong towards the semantic-layer-informed version of the world, this is a critical question. Do we want to attempt to define all of our business logic in the semantic layer, just like we did back in 2015 in LookML? Do we want to try to smash as much as we can into precalculated columns in dbt models?
Neither of these extremes are good answers. Try to do too much in the semantic layer and you lose all of the assertiveness (testing, CI/CD, lineage, etc.) of dbt. Try to do too much in dbt and you generate a bunch of junk models attempting to anticipate users’ interactive needs.

The Analytics Engineering Roundup

Everpresent metadata. Thin semantic layers. Accessible data.

This week I want to step away from the “thinkpiece” section of the Roundup and focus exclusively on sharing some great posts. There’s just too much good stuff getting written right now and I’m excited to read and talk about it! Enjoy the issue! - Tristan…

4 years ago · 11 likes · Tristan Handy

Transform has really come to the forefront of the metrics/semantics layer space in the last year or so. They have recently open-sourced MetricFlow, which is their framework for defining their metrics layer. Transform also have live APIs to call to access metrics data. dbt will be releasing their own APIs in due course (it would seem like Coalesce 2022 in New Orleans could be the perfect stage!).

MetricFlow can be used in conjunction with dbt and, in some ways, feels intended to be so1. It doesn't provide a framework to do data transformation. It relies on well-modelled, clean data to have been built with understood granularity and identifiers - dbt is the most popular framework to deliver this today.

In the session linked above, with Nick Handel of Transform and Drew Banin of dbt labs, there was a section focused on how metrics should be defined. There seems to be two active schools of thought:

Metrics are defined on "wide" pre-joined single table/view models. When metrics are called, the model is queried and aggregated according to the call parameters.
Metrics are defined on top of a relational data model which defines joins. When metrics are called, the compiler decides what will be an optimal query to run.

With the pre-join approach, all of the joins and work to produce data that metrics can be defined upon, is done up front. Metrics are then only associated with this one model.

Metrics appear as pink nodes in the DAG (for now)

With the second approach, how entities in your data model relate is explicitly defined or inferred. This is done in the framework, rather than in SQL. Metrics are defined using properties from any entity in the data model.

Here are some tweets Nick Handel of Transform posted on this topic:

Nick Handel @nick_handel

@DSJayatillake I think that this is what semantic layers have historically failed to accomplish. So, it's broader. - What is contained in my warehouse? (Dims, Measures, IDs) - How does it relate? Relations, Structures (SCD, DIM, Partitions) Then you just programmatically generate everything

Nick Handel @nick_handel

@DSJayatillake Those failures are numerous(messy join logic, lack of features that require escape hatches), but the core spec of a semantic layer determines what it can do and how. A messy spec leads to a messy framework. I don't think this is just about metrics.

Approach 1

Companies: dbt, Superset/Preset, Metlo, Glean (internal), Avora (internal)2

Pros

Testing is easier - if the dbt model that your metric depends on runs and passes tests, the metric is good to go. Metric lineage is very clear.
It's easier for Analysts to pick up and understand how a metric is defined. The underlying model is SQL, in effect creating a dynamic OLAP cube. It's dynamic because the granularity in dimensions and time is defined on request.
Access control is easier. In order to serve a metric, exactly one underlying resource needs to be accessed. Its content can easily be tightly governed to prevent things like PII leakage.
The amount of YAML required to define the metric is limited and clear.
It doesn't require much in the way of data maturity. An org with a sliver of good data, they want to define metrics on, can reasonably benefit. It's very similar to how some of these orgs use a tool like Tableau on top of a query, acting as an OLAP cube.
As you only need one query to work, the model query, unwanted problems like join disfunction/fan-outs are less likely (but not impossible) to occur due to a poor data model.

Cons

It doesn't solve for semantics in general - it's limited to metrics (currently). An additional concept of “entities” would need to be added. This could then require metrics to be extended to interface with these entities.
It doesn't define a world and what's in it - each model is more like a microcosm to enable specific metrics. Many of these models will be required to solve for all of an organisation’s metrics.
Each time a dimension needs to be added to a metric model, this requires the model to be re-built.
It does not prevent many similar throwaway models being built.
It doesn’t ensure metrics are defined only once and used everywhere, and having a single source of truth.
All of the joins in the model are used regardless of the fields used in the metric call, as the model is materialised ahead of metric call.

Approach 2

Companies: Transform, Looker (internal), Cube, Lightdash3 (internal)

These are mostly the converse of the above but it's interesting to discuss what's possible.

Pros

It aims to be a general semantics layer, not limited to just metrics but having other entities too.
Such a semantics layer defines all entities and metrics in a data model once: only one is needed per organisation.
Adding dimensions related to an entity used to define a metric doesn't require the model to be rebuilt.
As there is only one semantic layer in the org, it's not really possible for many duplicates to be built. It should allow for metrics to be defined once for the org.
Only joins required to calculate a metric are used on metric call, saving table scans and time/cost.
The definition of the semantics layer generates hugely valuable metadata. High quality automated docs could be generated from it. This could be the holy grail of metadata in conjunction with dbt docs (better yet, integrated) - it actually explains meaning in data!

Cons

Testing is harder or at least less clear. You end up needing a powerful compiler that checks for issues on compile. You would also probably need something similar to what Spectacles does on Looker, to ensure that many permutations of queries work for E2E testing.
It's harder for analysts to pick up as it wouldn't be defined in SQL - it would most likely be pure YAML (maybe CUE in the future!). Many analysts aren't well versed in data modelling and a semantics layer could seem abstract to them.
Access control is much more complicated - you would probably struggle to administrate it in the semantics layer and rely on row and column based access control elsewhere.
Metric lineage is less clear.
It requires a very high level of data maturity for an organisation to be ready. They need to have a good relational model, having dealt with complexities like entity resolution, test coverage, clean data etc.
Without the data maturity above, it's very easy to generate low quality or incorrect results. The data needs to be high quality and tested, for every part of the semantics layer, to prevent join disfunction and fan-outs.

I believe that both of these approaches may continue for a while in parallel. High data maturity organisations would probably be better suited to the second approach and the first for low maturity. However, I believe it would actually be beneficial for us to choose one as a community and stick with it, much like how dbt is the most popular OSS data transformation framework. Both dbt metrics and Transform's MetricFlow are OSS. As we've seen with dbt (for data transform), Terraform, Docker, Kubernetes... it's better for OSS to be the one tool to rule them all and in the data bind them. Prolonged fragmentation will most likely lead to both approaches being adopted less, in total, than one dominant standard. Adoption needs to be high for it to become the way we do things.

The idealist in me prefers the second approach. The realist in me says the first will also be needed as an intermediate step as we cross the Data Chasm.

One hour after finalising this post, Transform shared this blog post.

I’m a cofounder here.

I’m an investor.

Tristan Handy

Yes yes! It's not "either / or" it's about staging. Currently the idea of headless BI is confined to, mostly, Substack posts. If there is a single thing that Drew and I have really internalized about how to create standards over the past six years, it's that you have to start with simple, highly useful constructs and get more sophisticated from there as you train your user community to think about their work using the constructs you've already given them.

I absolutely will never advocate that a semantic layer should not have joins *ever*. But I really think a tremendous amount of the value can happen without them to start out.

Expand full comment

Andrew Taft

May 19, 2022

I agree with approach #1 to start, but would argue that you don't need a "very high level of maturity" to approach #2. Also, the "cons" of #1 quickly lead to an unmanageable environment, not much different from the old paradigm of using views as single reports.

If you're going to take approach #1 so the broader analyst community is able to engage without needing to know how to do joins, then perhaps build these wide one-big-table's from re-usable entities so that you're not constantly re-defining business logic in every model. For example, this single-table metric model is built from a re-usable dimensional model:

https://github.com/flexanalytics/dbt-business-intelligence/blob/main/models/metrics/product_monthly.sql

Another advantage of this approach is that, as analysts gain sql skills, they can easily create models on their own because all the joins will be simple inner joins on surrogate keys with the same name. Thus, rather than hindering progress, this approach actually enables progress.

Lastly, taking this approach allows the best of both worlds. If you want to use a tool such as Looker, you can do that too. One thing about the BI tools that I'm sure we can agree on is that complex business model logic does not belong in their proprietary metadata modeling components. I get the need for LookML, but let's start to push that logic to dbt, when possible.

1 reply by David Jayatillake

1 more comment...

davidj.substack

Discussion about this post