Its all about the metadata

222m 22/2/22

Mar 02, 2022

weird day to be announcing big news, but here we are... $222m series D, $4.2b valuation Round led by our existing investor Altimeter with participation from Databricks and Snowflake

blog.getdbt.comThe next layer of the modern data stackdbt Labs raised another round of funding– $222m at $4.2b valuation. Existing investor Altimeter led the round, with participation from Databricks, GV, Salesforce Ventures, and Snowflake. The raise will fuel our investment in building the next layer in the modern data stack.

Some fantastic news last Thursday on what was globally a very dark day. Good luck, dbt Labs people, with the next phase of this journey!

I want to share my take on where I feel this amazing success, mostly in the last three years, has come from.

I think one way to explore this is to ask the question:

Why hasn't anyone come up with anything able to displace dbt?

GCP acquired their own version with JS instead of Jinja. I have yet to meet anyone from an org that uses it. When I started my new role in January, the data team had been using BigQuery for the best part of two years without touching it and using scheduled queries instead. The fact that it is proprietary, owned by a specific cloud vendor, and with restricted use on its own data warehouse, means it won’t ever replace dbt; it’s just too closed and will struggle to innovate at the pace of the dbt community.

As I was writing this, I wracked my brain to remember the name of the tool made by some Looker engineers, but couldn’t. I Googled: “Looker SQL Replacement” to finally find it. It assumes that expressing data queries and transformations in SQL is not ideal and provides a way of expressing queries that is similar, but also allows for “modern programming paradigms”. However, it could cause more problems than it solves:

SQL is the most used language by data professionals. Even as someone who uses other languages, and is enthusiastic about them, I understand the value of having as wide a reach as possible. The only way of interacting with data that is more widespread than SQL is Excel. As soon as you move away from SQL, you alienate a lot of the practitioners you want to engage with.
Malloy queries still compile to SQL, which is already going to be interpreted by the db engine to tasks in another language. There are already some concerns about how SQL is divorced from what a db executes, causing inconsistency. I can’t help but think that another full layer of abstraction could exacerbate any of these problems.

It’s also quite late to the scene - it came out at a point where the dbt community was probably 20000 strong already, with many of them being Looker users who had accepted dbt and Looker as a great union.

There have been more frameworks/companies than these two and I’m sure many more that I haven’t been aware of, but perhaps that’s a point in itself. A lot of data practitioners have been enjoying taking up dbt and building in it; we haven’t felt the need to look at alternatives which cover the same scope.

I think we can safely say that attempts have been made to create alternatives to dbt, but none with any great take up.

So let’s then ask the question: what’s great about dbt?

It's not the systems and tooling - this is still progressing towards maturity.

Even dbt core, while well loved, has technical limitations in its performance, as you might expect from OSS that has only recently had such focus, along with a huge demand for new features. The dbt core team has made great strides in the last year to hugely reduce compile times, but some practitioners who come from a software engineering background have tried to recreate dbt core in other faster compiled languages like Go (Terraform, Docker and Kubernetes are also written in Go) and I’m sure we’ll see a Rusty fork in the future (#iron-dbt).

So, if it's not the system itself or tooling, what's so special about dbt that makes it able to raise 222m on ~ 22/2/22?

It's the metadata! More specifically, the metadata that is created automatically through the activity of analytics engineering.

The code is the docs and the docs is the code. There hasn’t ever been something this elegant available for data people:

The lineage provided by the simplicity of ref()
The assertions about your data that can be made as tests
Documentation in a consistent format that can be synced to your database, with the ability for this documentation to be automatically generated to start with
Data sources which are declared as nodes in the DAG
Analyses and exposures to declare how data models are used, exposing dependencies on analytics engineering, and now the metrics layer too!
Macros for DRYness, and much more. I have personally built, and seen macros built, to allow for configurable multi-touch attribution models, vector similarity, collaborative filtering and more!

The act of analytics engineering, and using these features, generates metadata that can be used in many different contexts. dbt has made analytics engineering possible and enjoyable, and given many data people an identity as analytics engineers (myself included) who were data “somethings” before.

This is the first and only time I've seen "living" documentation made and the first time in my career I've believed it could be valuable, as it would stay up to date.

Data people provide data for the rest of an organisation or customers to operate, but the data we need to operate is metadata. dbt offers analytics engineers a virtuous cycle where doing their job further enables them to do their job well by providing more metadata for future work. Data people who use dbt are generally more effective than those who don’t.

How could you more efficiently create this metadata than declaratively in how the project is built? If you remove it from this process then it's stale immediately. The fact that this is declared in SQL (with a slice of Jinja and scale of Python), means that this declarative code as docs as code can be made by a huge number of people in the world.

Most characters in a dbt repo are pure SQL or docstrings... this would have to be true in any system that tried to replace dbt too. dbt is the first of its kind in data (at least at scale), so why would something that would be forced to be so similar be able to replace it? Plus, be open-source at core and create a community so well-loved and large? For every day that goes by, it looks less likely and this is reflected in the valuation of dbt Labs. Ultimately, dbt Labs as an organisation, and the dbt community at large, can adjust dbt to be what it needs to be and further it as a driving force in how data technology evolves.

The fact that dbt core is OSS forever further cements its place in practitioners’ hearts. I think as long as DWH's are king for processing data, dbt will continue to be the most popular metadata and interoperability layer... and I'm just fine with that. dbt’s leadership in this space is stronger and more secure than Snowflake’s dominance in the data warehouse space.

https://future.a16z.com/sql-needs-software-libraries/

I often like to joke that dbt was analysts’ first good relationship, and so they’re all intensely loyal to dbt.

Data people prefer to get things right first time rather than MVP with many further iterations - our reputations depend on it. If dbt is our first good relationship, you may also find it to be our last for a long time, we’ll just want to get more out of it:

The Analytics Engineering Roundup

Disjointed Lineage...

Season 2 of the Analytics Engineering Podcast is here! In Episode 1 Julia and I talk to Ashley Sherwood and cover a lot of ground. The most surprising point to me occurred at the very beginning of the episode: (…) I think it depends in an org on whether that work is truly data first or not. And I think especially when you're a data person, data first fee…

3 years ago · 3 likes · Tristan Handy

Strength in community

Quoting Benn about Coalesce:

If Coalesce didn’t fix my heart, it gave me hope for the industry’s soul.
The analytics community is, if not concentric with the tech industry, directly adjacent to it. Tech companies are the biggest employers of its members. Analytics is as demographically homogeneous as the larger tech industry. And analysts are just as inclined as software developers to wrongly turn emotional problems into empirical ones. The analytics community could’ve easily fermented into another toxic backwater, poisoned with everything from juvenile boorishness to dangerous misogyny.
But it didn’t. Though it’s far from perfect, it blossomed into something better, healthier, and safer. Among all the characters in the community—the jesters, the cheerleaders, the philosophers and deep thinkers, the educators, the entertainers, the eager learners—one is conspicuously missing: the arrogant jerks.2 And that makes all the difference.

The dbt and Locally Optimistic communities (which heavily intersect), have some of the kindest people I’ve ever come across in professional life. We help each other with nothing expected in return, but of course, when we in turn need help it’s always there. I have met collaborators as part of these communities without ever having synchronously communicated with them, and many I have gone on to know even better in a Zoom box. It feels like the advent of the internet, with everyone talking to other people around the world in chatrooms and mIRC, which they never could have before. This community is why an organisation as small as dbt Labs can have such a large impact, and why its practitioners are so effective beyond the powers afforded by analytics engineering. Commercial and adoption success is hard enough to recreate, but community is even harder. More than not particularly wanting a new metadata or interoperability layer, I really don’t want or need a new data community, I love the one I’m part of.

Our stack may be unbundled but our people are closer than ever.

Matt Arderne

Mar 2, 2022

As probably the oldest continual user of Dataform, and the only one to really write anything (thanks for the link!) I can chime in:

> proprietary, owned by a specific cloud vendor, and with restricted use on its own data warehouse

DF is OSS core https://github.com/dataform-co/dataform (puts to rest any hopes of compatibility, for now), but GCP restrictions led me to migrate two clients off Dataform to dbt, and I agree that this is where dbt has the value, in an open ecosystem.

And I agree with you entirely on the point that "It's not the systems and tooling", and it was a very sad day for me when Dataform was acquired by GCP because it ended the healthy dynamic of multiple options, as putting the writing on the wall for a GCP walled garden.

I will say that Dataform cloud UX (development stalled since the acquisition) is still a much more pleasant experience than dbt cloud, and maybe (maybe) dbt's value is within the "MDS middleware" layer anyway, as the metrics and metric server is only possible due to their widespread adoption.

Expand full comment

davidj.substack

Discussion about this post