I’ve recently been helping Pete Fein with materials for his Carnegie Mellon University course on Data Warehousing. I have been building the dbt models as part of the course project. I followed the typical path of landing data into raw tables as dbt sources, then making staging models from these.
We’re using DuckDB as the data warehouse for the course, which could become the choice for education 🐥. This is due to it being free, easy to use and in-process - not needing a server. It can also handle larger than memory data, meaning you can do a lot with the power of a modern personal computer. Thanks to Josh Wills for the help with DuckDBT and the dbt-duckDB adapter 😄.
Usually, you have a clear goal in mind with a data project: report on revenue, create a data upload for Salesforce... I didn’t have this for the course data project. Instead, my goal was to show them the steps of the data warehousing lifecycle. So, when making my staging models, I used dbt codegen to make base models from the raw tables I declared as sources. I did this without thinking about which columns I would need from them going forward, bringing all of them into my staging models.
If you are to think of your staging models as public facing APIs on your raw data, you should expose only what you need and want to. You need to be able to rigorously test and know the data exposed; to be able to guarantee its quality. What I did above breaks the rules. In an academic context, it's not the end of the world. The data is public and not sensitive. There is low risk. In a real world production use case for a data team, the risks are real.
In a normal project, exposing all your data in staging - in theory - shouldn’t happen, as you know what you need up front. You pull forward the fields you want, clean them, test them and understand them. Then you add more, if you find you need them in pursuing your goal. I say ‘in theory’, as sometimes data teams, feeling unclear on what they’ll need going forward, do bring everything forward into staging. They will apply less rigour on the columns they don’t need immediately. After all, with cloud data warehouses like Snowflake, there is a cost to rebuilding every time you want an additional field, which you could have included on the last build.
On legacy systems, such as on-prem OLTP databases used for this purpose, storage was expensive and compute was a sunk cost. This encouraged promoting only the data needed into staging views, and rebuilding as often as required.
In software development, a leaky abstraction is an abstraction that leaks details that it is supposed to abstract away.[1]
As coined by Joel Spolsky, the Law of Leaky Abstractions states:[2]
“All non-trivial abstractions, to some degree, are leaky.”
This statement highlights a particularly problematic cause of software defects: the reliance of the software developer on an abstraction's infallibility.
Spolsky's article gives examples of an abstraction that works most of the time, but where a detail of the underlying complexity cannot be ignored, thus leaking complexity out of the abstraction back into the software that uses the abstraction.1
When the data is in the staging model, people and services can end up using it without a data team knowing. This results in unintended data accessible via your API - a leaky abstraction. To avoid this, as I mention above, you want to restrict what you expose in your public API.
Snowflake’s recent announcement of managed Iceberg really brought this home to me, too. With tables in Iceberg, many other services can use this data without your knowledge. Before, you could have used Snowflake’s query history to show you what was being used. You can manage column-level access control on Iceberg with things like AWS Lake Formation and Tabular, but I bet many won't.
We blew up Airflow - are we blowing up Snowflake, too? Let's save that for another Wednesday and another post…
What are the options for doing this, provided by the data warehouse alone?
One method could be to use column level access control to, by default, only allow intentionally-exposed columns to be used and seen. However, as I mentioned above, in the era of the Lakehouse, this doesn't cover all bases.
Staging views which don't cost anything to rebuild are a decent solution. Yet again, it has the drawback above in the Lakehouse context. Materialisation may need to happen downstream before consumption, to ensure performance.
Materialising tables with the minimal columns and maximal rigour is the only option which solves for Lakehouse access.
The problem is that the infrastructure for storing the raw data and staging data is also where people and services access it. Software engineering has moved away from people and services accessing the production database. ORMs allow a single point of access to data for many services, without giving access to the database.
With a data-focused ORM, we could bring all the data we wanted into staging, saving rebuild cost and without having unintentional API consequences. Just like ORMs, it probably won’t be entirely watertight as an abstraction, but it will be much better than where we are today.
The dbt-core team is working on something pretty interesting in this regard:
Phase 1: Models as APIs
Goal: v1.5 (April)
Develop new constructs that enable dbt developers to create, contract, and communicate data models like software APIs. This work should enable more scalable monorepos, while also laying the foundation for Phase 2.
Phase 2: Extend to many
Goal: v1.6 (July) This is an ambitious timeline! If the dates need to change, we'll say when & why.
We will extend the constructs above to multiple projects. Cross-project ref is the tip of the iceberg. We must enable seamless experiences around development & deployment, enabled by dbt metadata. Developers in downstream projects do not need access to the full source code of upstream projects. Instead, they should get only & exactly the information they need, when they need it.
There has been some debate about its implementation, which I don't want to spend much time on here. In short, yes it's more YAML, but if you want to abstract the actual data, you must define how in some way or another... 🤷. I'll let others, who are more qualified than me, do their technical due diligence. I'm sensitive to the idea that what sounds easy to a software engineer is not necessarily so for data folks.
The real problem with it, for me, is that it allows abstraction of the data with an API, but only within dbt. What about if we want to make this wider, for all data access?
I feel like a semantic layer could be a better solution for this. In my prior series about semantic layers, I described a "true" semantic layer as one that defined entities. In this case, we'd need to go even further by allowing access to entities. Most of the use cases for this would be marketing-related, such as: ‘pull me a segment of users with attributes [x, y, z]’. Imagine an Airflow or Dagster job that could simply make an API request, like this, to get the data needed without needing SQL or Python. Imagine the increased utility in Hightouch or Census.
I suppose dbt could offer this as a feature of their semantic layer in the future, using model contracts as a way of describing each API. It would be more direct for the semantic layer to enable access directly. You could imagine a semantic layer like Cube being used in this way very easily.
As of now, you can request metrics and dimensions with filters and sorts from the Cube APIs. You can also request the catalog of metrics and dimensions that have been defined.
There would need to be a new object, called an entity, that should also be in the catalog. Then you could request the entity with specific attributes, as you can already request a metric filtered by specific dimensions.
The strength of doing this in dbt-core is that you could define the API once, for both further transformation models in dbt and also for other services. It is unlikely that the API for further transformation models could exist outside of dbt-core - it needs the tight coupling with SQL and the data warehouse that dbt-core provides.
It's also the case that, just as Cube is compatible with dbt metrics definitions, it could also support exposing dbt model contracts as APIs.
I will look to start the “Human Interfaces of Data” series soon. I’ve had a lot of interest from great data folks who want to contribute, so it’s taking a bit more organisation than I had anticipated 😅
https://en.wikipedia.org/wiki/Leaky_abstraction
I’m a table-as-api fan. In my opinion it makes it clear to all possible consumers what the things are. Ofc depending on case one can layer different kind of api layers on top when needed. I think the best way to achieve this is to separate the public api (semantic entities) and internal apis (stg, int, what-not) to different schemas. Then it’s easy to just grant access to people to the public api schema