Public Transformation

Why we should open-source our dbt repos

Jul 13, 2022

There are probably 10000+ organisations with active dbt projects, given that the graph below was produced in February. However, there are relatively few dbt repos that have been made public - open sourced for all. It's my belief that most dbt repos could be made public. There is a big shortage of Analytics Engineers (AEs) out there, so sharing methods that could save time for other AEs is a good thing. While I really enjoy being an AE, there is a lot of repetitive work, particularly when setting up a project or using a new source of data.

Once again, tearing a leaf out of the book of the recent history of software engineering... before OSS became prevalent, software was either bought or built internally with no sharing (apart from engineers wandering off with codebases 😮). The proliferation of OSS enabled agility in SWE never seen before - huge parts of startup codebases and tools are OSS. Until recently, we haven't had the ability to share data transformation codebases in a way that others could benefit from. With dbt-core we now have this and I feel it's actually easier to share our data transformations than our software. With software, it's often intellectual property that has been created by an organisation looking to monetise it. Making it OSS could be seen to reduce its value, as, in theory, everyone then has access to it.

In Data, the value lies in the combination of the data itself and the transformation of it. If you only have the way in which it is transformed but not the data itself, the organisation which produced the transformation processes hasn't, in theory, lost any IP. This is also why it's very possible for AEs to open-source what they've built without making their org less competitive in the market. However, this isn't the case for Data Analysts (DAs). A DA’s work includes the data and the processing of it, with hopefully commercially sensitive information and recommendations. A Data Analyst's work should make their org more competitive, so therefore is more sensitive.

What about security concerns, you ask? I see a few possible problems:

What if there is hardcoded PII such as email addresses in the repo? Remove it! PII should not be in any git repo.
What if seeds contain sensitive information? Some orgs do store things which are commercially sensitive, like cost and spend in seeds, but it's possible to achieve this using other methods.
What if the logic of a model or macro itself is sensitive? An example could be in risk modelling, where rules could be built into a dbt model. If bad actors could easily see the logic, they could reverse engineer a way to evade the rules implemented. I think this does preclude some things from being in a public dbt repo. Some larger FinTech orgs who have large dbt projects could split them into private repos which are sensitive and public ones which aren't. I do understand that this comes with its own extra overhead.

two grey CCTV cameras — Photo by Miłosz Klinowski on Unsplash

I do believe that, for many orgs, the concerns above aren't really an issue, and there would be many benefits to making their repos public:

As AE’s we have all benefited from being part of the dbt community, whether directly or indirectly. We should look to give back to it.
It benefits us as Analytics Engineers in the long run. When we move to a new org we don't have to figure out how to do something we've already done before. We can also learn from how someone else has solved the exact problem we're about to solve. This is the OSS philosophy.
Solving problems together as a community is fun and fulfilling.
It could save us all cost/energy if the most efficient solutions are shared and widely used.
It allows small AE teams to punch above their weight in terms of what they can deliver. Some of these teams work in non-profits and charities, where this kind of efficiency has real benefit in the world.
Some problems are a bit niche and not the core focus of an AE's role. If they can be solved with OSS packages, then this frees up an AE to focus on their main job. SWE has had OSS solutions for some of the most niche use cases for a long time now.

There are some examples of public repos out there, which is great, and I hope to see them increase.

There are also source specific dbt packages becoming increasingly available, and increasingly maintained by the source organisation. This is really positive and fits with a data contracts approach where the source organisation owns the API and provides good tooling to interface with it. In many ways, this is better than documentation and, as they are dbt packages, documentation can also be included:

Rudderstack have published a few packages on dbt hub to solve different use cases, with data output from their platform.
Snowplow also have some.
Fivetran have a huge amount - for many of their most popular sources.

You could argue that vendors are trying to increase their retention by being integrated further up the stack. There is almost certainly an element of this at play, but at the same time I would say it’s a fair trade for the work saved and knowledge gained.

Here are some examples of public dbt repos I've made - mostly basic ones, but I do go back and refer to the odd test and method I've used before:

As described in earlier posts, I was only at Ruby Labs for a short time, but the repo I made here is still something I find helpful to refer to.
As an advisor and investor at Gravity Data, I helped set up their data stack including this repo. While a microcosm of a repo, it enabled them to have reporting in Lightdash going from sources to metrics, defined in dbt.
And I have recently started this repo at Metaplane.dev while I set up the stack here (just an init but watch this space as I will blog the stack set up).

In each of these cases, I have already or will introduce dbt to potential Analytics Engineers of tomorrow. I aim to continue to do this at any org where I’m in-house data, advisor or consultant. If all of us do this, we won’t have a shortage of AEs for long in the market.

So I’ve put my money where my mouth is, and I’ve asked previous companies where I set up private repos whether they would be willing to make them public. I would ask all Analytics Engineering teams to consider whether they could feasibly make their dbt repos public.

white and black Together We Create graffiti wall decor — Photo by "My Life Through A Lens" on Unsplash

davidj.substack

Discussion about this post