Highly Opinionated Integrations
...how any tool should integrate with the interoperability layer
The glue holding the modern data stack together
You might say the data community as a whole is a bit too enamoured with dbt, but that's like saying devops engineers like terraform too much; it enables them to operate in a way not possible in the past and dbt is the same for many data professionals.
I'm old and grey enough to remember what it was like when vendors used interoperability as lock in and I will never willingly go back to this. I started out my career in Data working on Oracle and MS SQL Server stacks, often running on servers less powerful than this year’s top spec Macbook Pro. It was so hard to just get EL and a little bit of T done, we would spend a week every month doing month-end; copy pasting from websites into Excel, editing, save to CSV, MS SQL Server Import/Export Wizard, Stored Procedure fails, fix data, try again… so on.
Then came the cloud and with it the cloud data warehouse. Suddenly what took a team of engineers babysitting it to keep alive, could be spun up in a few clicks by an analyst. Powerful and simple EL tools came out which also made this workflow possible with a team of one. However, even with the availability of the Cloud DWH and the removal of barriers to data access, analysts were still effectively using stored procedures for data transforms. They needed one more power to become data heroes…
Out of this need formed a community of analysts and engineers who wanted to solve this together, focused on what we now call the Modern Data Stack. The last power needed was dbt.
dbt is a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast, and fun.
dbt, like any other tool, isn't perfect, but as dbt core is open-source, we can change it or fork it if needed or ask dbt labs for new features and wait for them to be developed. I now expect data vendors to integrate with dbt as a matter of course.
With dbt, at this time, we have a chance to have a shared data interoperability layer that enables a vast amount of data vendors to enter the market; the foundations of a data OS.
We started to imagine the specifics of a world Benn painted in broad brush strokes a few weeks ago: one in which a data OS was not just a shared primitive across only data workflows — but a fundamental component of how all applications will be built tomorrow.
Being replaceable, it’s a feature not a bug!
When a vendor can be replaced easily, it’s a positive feature for the customer. It’s money stashed under the mattress for a rainy day! It’s like an employee who automates themselves - you don’t get rid of them, you give them a wider scope.
Interoperability lock in is something I feel some vendors still yearn for, which is why you see platforms that try to have a very large footprint and try to do as much of the whole stack as possible. It’s not possible to be the best at everything in the stack, all at the same time.
The last 10 years has seen specialist software companies, who are very focused on their use cases, succeed over large footprint software vendors. Some of the biggest movements in our space, in the last few years, are indicative of this success in focus: Looker, Tableau, Snowflake, Amplitude. This is not even considering the fund raises in the data space in the last two years alone - most of this capital has been invested into specialist software which abstracts complexity away from the consumer. It has been predicted that cloud vendors will be almost entirely abstracted from consumers by specialist software providers. I feel, in Erik’s vision of 2030, this interoperability lock in does not work for any vendor who wants to focus on solving their use cases in the most innovative ways possible. Therefore, it makes sense for there to be a shared open-source interoperability layer.
The issues faced
dbt labs is a very young company, I still remember the days when Drew manned Intercom on dbt Cloud! This concept of an interoperability layer is nascent, and where it ends and other providers begin still hasn’t been defined. dbt Labs must have “short toes” when working with providers who want to integrate with them and work in the space. I think any confusion with who is competing with dbt Labs’ proprietary product, dbt Cloud, is partly due to how new some of these concepts are and how undefined the space is.
I think what would really help is to define what a great integration with dbt looks like per category in the Modern Data Stack… here’s my take:
EL(T)
Reference integration:
Should be able to kick off a dbt run after one or many connector jobs have run via the dbt Cloud admin API
One or many connectors should be able to be called as a pre-hook in a dbt run or model, at least as a simple API request or preferably with a richer integration (syntactic sugar), where an EL connector is able to be defined like a dbt source that runs just in time
Uploads metadata about connector runs to the dbt Metadata API to be consumed elsewhere
IF SQL or simple transforms are defined in this tool they must have two way sync with the customers dbt repo where these are defined as high quality models
Nice to have:
Consumes metadata from dbt Metadata API and pushes this to a customer’s DWH or cloud storage
dbt packages for standardised data models based on ELT outputs into DWH
Boundary of operation:
What happens on the DWH should be defined in dbt whether there is a nice UI to this in the ETL tool or not. It would probably encourage ELT vendors not to do this as it then involves building or adopting two way sync with a dbt repo
What happens before the DWH should not be defined in dbt or run by dbt Cloud, it can be orchestrated by pre-hook API calls as described above. dbt should not move data
reverse ETL
Reference integration:
One or many connectors should be able to be called as a post-hook in a dbt run or model, at least as a simple API request or preferably with a richer integration, where a reverse ELT connector is able to be defined like dbt analyses
Uploads metadata about connector runs to the dbt Metadata API to be consumed elsewhere
Boundary of operation:
SQL transformations should not happen or be defined in the rETL tool, only things which aren’t possible on DWHs, for example conversion of tabular data to a custom/legacy format ahead of use in a connector (but this could be part of the connector itself)
dbt Cloud should not connect to targets which are not DWHs, or move data directly
Orchestration
Reference integration:
One or more dbt runs should be able to be nodes in an orchestrator’s DAG, either as a container with dbt core implemented OR dbt Cloud API request
Uploads metadata about connector runs to the dbt Metadata API to be consumed elsewhere
If the Orchestrator provides a tool for more complex transforms, it should be dbt and two way synced with a customer’s repo. If the customer doesn’t have a repo, this should be set up in the background for them and define the transforms with dbt anyhow
Nice to have:
Wrapper around dbt Cloud API request to trigger run, so customers don’t necessarily have to write API request themselves
Boundary of operation:
SQL transformations should not happen or be defined in the rETL tool, only things which aren’t possible on DWHs. For example, conversion of tabular data to a custom/legacy format ahead of use in a connector (but this could be part of the connector itself)
dbt Cloud should not try to orchestrate steps beyond the last step before or after a dbt run/model (post-hooks and pre-hooks)
Data Warehouses and other storage decoupled compute applied to analysis (DWH)
Reference integration:
dbt Adapter for DWH, fully featured (eg included incremental functionality and allows for specific features of the DWH to be exposed in config e.g. Snowflake warehouses, partitioning)
connection to DWH can be abstracted from other tools via dbt Server, allowing DWHs to only need to ensure connection to dbt Cloud and not every other tool in the ecosystem
Nice to have:
core dbt Adapter, supported by dbt Labs (currently Redshift, BigQuery, Snowflake, Spark/Databricks, Postgres)
Boundary of operation:
DWHs commonly allow for stored procedures, but building in a DAG-based transformation tool similar to dbt will cause fragmentation, so this shouldn’t happen. There is nuance between a cloud vendor incorporating this as a feature OR having a separate service like GCP
dbt Labs should not try to make a true DWH or even incorporate an open-source one into the dbt Cloud platform as this will damage the ecosystem. An exception to this could be to allow for something really small like TinyDB or DuckDB that would allow smaller orgs with tiny data to enjoy the benefits of dbt via dbt Cloud without need of a cloud vendor
Observability
Reference integration:
Consumes data from the dbt Metadata API about dbt runs and other data input there from tools like ELT/rELT/orchestration, and provides alerting on this data with understood lineage
SQL-based tests must be pushed down to dbt tests in the customers dbt repo with two way sync (so tests defined in the repo are also shown in the observability tool); a repo should be made for them if they don’t have one and the tests created generated as code there
Two way sync with dbt docs, should the observability tool allow input of documentation. Otherwise, taking the docs from the repo to display in the UI is sufficient
Nice to have:
Root cause analysis of where in a dbt run a failure or issue occurred and why, with rich but concise logging
Ability to interact with scheduled dbt Jobs in dbt Cloud, to suspend if more errors or damage will occur. eg Observability tool delivers a slack notification about a dbt Job failure, with button to suspend the Job from running again
Boundary of operation:
Testing which can be defined in dbt must be: these are the tests that dbt currently support in core. This is reflected in the second bullet of the reference integration above
dbt Labs should not create sophisticated ML based observability features in dbt Cloud, such as looking at categorical feature entropy, time series analysis, freshness, root cause analysis
Alerting is a grey area, I feel, as the observability tool is your command station to deal with data incidents. It makes sense for the alerting to be consolidated here. However, at the same time it would be strange to say dbt Cloud should not offer alerting, given it has scheduled runs which can have issues. Also, not every dbt user will have an observability tool
Data Catalog/Discovery
Reference integration:
Consumes data from the dbt Metadata API about dbt runs, and other data input there from tools like ELT/rELT/orchestration, and provides deep exposure of lineage with a broader scope than a dbt DAG
Two way sync with dbt docs, should the observability tool allow input of documentation, pushing of other entities like data owners into dbt entity layer
Nice to have:
Consumes data from dbt Metadata API about dbt Server usage
Boundary of operation:
Documentation about what dbt interacts on should be based upon the dbt repo and kept in sync
dbt Labs should not try to enable documentation and discovery about assets outside of what dbt operates on
Feature Stores and ML Ops
Reference integration:
Builds feature sets on top of existing dbt models and syncs these models into the dbt repo (can be after experimentation phase on deciding what features to put into production)
Predictions should also be models synced back to the repo as a new dbt entity, possibly called predictions, which don’t have to be on the left of the DAG like sources
ML Ops logging should be input into the dbt Metadata API for consumption by other other providers and services
Use of dbt metrics layer to use already defined metrics as features, or creation of new features layer in dbt for this purpose
Boundary of operation:
All data transformations managed by these platforms should be synced back to dbt repo with high quality (docs, tests, refs, format)
dbt Labs should not try to offer ML services, but allow dbt to act as a framework to be used by these vendors
What about streaming and cold start ML situations, you say? I believe dbt can be used for streaming data too, which I explain below. Then most of these principles remain. With DWHs like Rockset and Materialize emerging, the boundary between streaming and batch is starting to blur already.
Layer.ai has it’s own declarative ML Ops definitions which are very similar to dbt - for this reason, I think it’s the wrong idea and it would be better to extend dbt, especially as Layer.ai is entirely proprietary. However, I definitely feel this is up for debate!
BI Tools
Reference integration:
Sits on top of the dbt metrics layer
If any transformations are possible in the tool, these must be two way synced with the customer’s dbt repo
If it’s possible to define metrics in the tool, these must be two way synced with the metrics layer in the customer’s dbt repo
If the customer doesn’t have a dbt repo for metrics or transform definitions, one must be made in the background for them
Consumes and displays metadata related to data health from dbt Metadata API, in a way that informs users of safety in using the data
Nice to have:
Consumes metrics from dbt Server (moves to reference integration in a year or two)
Connects to DWH via dbt Server
Boundary of operation:
BI tools should not try to make a metrics layer or a transform layer
dbt Labs should not try to make a data viz tool
Hosted Notebooks/Analytical Research tools
Reference integration:
Custom queries should allow use of dbt refs and variables
Custom queries, when put into production or published, should be synced back to the dbt repo as dbt analyses, so that dependency can be understood by analytics engineering. This includes subsequent smaller in memory steps that are defined in SQL (first step from DWH becomes CTE, then this becomes the query)
Consumes and displays metadata related to data health from dbt Metadata API, in a way that informs users of safety in using the data
Nice to have:
Can consume metrics from dbt Server (moves to reference integration in a year or two)
Connects to DWH via dbt Server
Boundary of operation:
These vendors should not try to make a metrics layer or a transform layer
dbt Labs should not try to make such a tool
Product Analytics
Reference integration:
Sits on top of a dbt modelled Activity Schema/s. If the tool allows automation of this modelling, then this must be two way synced with the customer’s dbt repo
If it’s possible to define metrics in the tool, these must be two way synced with the metrics layer in the customer’s dbt repo, once these metrics are saved and shared
If the customer doesn’t have a dbt repo for metrics or transform definitions, one must be made in the background for them
Nice to have:
Connects to DWH via dbt Server
Boundary of operation:
Product Analytics tools should not try to make a metrics layer or a transform layer
dbt Labs should not try to make a product analytics tool
This clarity of integration defines where the ends of the “short toes” are.
These integrations may be harder to build the first time round, but if we have shared ways to do this for all, then it's not as hard or slow as some may think (eg dbt supported libraries for two way sync).
Not everyone will agree with my take on these integrations, but what's more important is that what is agreed is consistent.
Currently vendors compete on dbt integration but once they've got to an agreed reference integration, the competition should be on features not related to dbt.
I don’t believe dbt tools should use ML in any way, I don’t think it works with defining clearly how something should work in code which is core to dbt Lab’s mission. It also starts to impinge on how vendors can compete in the ecosystem. ML applied on top of dbt constructs is a great space for vendors to exist (Monte Carlo, Metaplane, Continual, Hex, Deepnote…). dbt is about engineering and mindfully so; ML naturally introduces uncertainty and risk which is not compatible. Imagine if Hashicorp - Terraform offered a service which used ML to predict what size resource you needed… suddenly you haven’t clearly defined something as code.
Hopefully, if we get to a point where something like my take on the integrations above is agreed upon, it will help the modern data stack avoid fragmentation and interoperability lock in.
dbt for streaming and beyond SQL
I don’t feel dbt is locked into SQL, but more about a standard abstraction of DWH APIs, which usually happens to be SQL. If DWHs support Python or R or Julia... then dbt can also have these languages directly in models, with some adapter modification. Run operations and headers could also be used where there is no materialization like model training. Because of Snowpark and Databricks, Python, Java, Scala and Javascript can already exist in dbt models.
I see no reason dbt can't support streaming... Each initial event stream can be seen as a source, and used to make another model materialized as a stream or table while being referenced. As we have dbt adapters for the DWHs currently supported, we would have them for the likes of Kinesis, PubSub and Confluent. dbt would probably need to support multi-architecture projects (PubSub and BigQuery in one project, for instance). Parts of the DAG which were streams built on streams would have no need for runs; config from run after merge persists in stream configs.
If you’ve made it this far, thank you for reading! I know it’s been heavy going this week! Dissent and debate are welcome as always.
Really interesting article!
One thing though: It would be great if you could include the names of the tools/companies you've included not just the logos. Or maybe put a link on the respective logo.
Most of them are big/common enough to know but there were a few I had to look up via Google's reverse image search. :D
Interesting article.
I think what was apparent is how you see the core position of DBT cloud (proprietary SaaS product) as a centre of gravity (Metadata API, etc.). This is something that is not necessary - as long as the format is standardized, any tool could host a metadata layer, or it could be directly stored in the orchestration engine as artifacts (so DBT run can be operated by any agent(s))
Furthermore a unified metadata serving layer could be a job job of data governance/(meta)data exploration tools, such as Marquez, Amundzen etc.