TCO - Total Cost of Ownership
The TCO for using a tool is all of the costs of using it, including license, SaaS subscription, labour to host it, labour to use it, cloud instances used for self-hosting, integration costs and opportunity costs.
Using open-source software and self-hosting it eliminates any license and SaaS subscription costs. However, it ramps up labour costs, usually increases cloud costs1 and often increases integration costs2.
Using a hosted version of open-source software eliminates labour to host it3, often reduces labour cost to use it due to the software having enhanced features (typically RBAC, authorisation protocol support), does not increase cloud costs and has lower integration costs than OSS4.
Opportunity cost is one I want to home in on. Where we’re talking about self-hosted vs buy for data tooling, this cost is related to the work that could have been done elsewhere, had the data team chosen to buy a tool rather than self-host, with the time saved. It’s hard to measure, but it’s real. This opportunity cost is always borne by your stakeholders - it’s the cost of not running the “last mile” in data. Your stakeholders don’t get what they need to get value from data.
Data teams who take an “IT” view of cost think they’re doing a good job by saving software and cloud costs. It’s easy to take this stance, as they’re the things you can easily measure and control. They don’t account for their own labour costs and the knock-on opportunity cost borne by their stakeholders. These teams act as a cost centre and then are surprised when they are treated as one. In order to be treated as something other than a cost centre, you need to generate or help another team generate profit.
What we’ve seen over the last couple of years is that many data teams were laid off in part or in their entirety, because they were seen as bloated and not delivering value. If a data team tries to be purists in only using open-source software that they self-host and glue together themselves, managing scaling etc, there is a really big risk that they spend a very long time, years even, setting up their stack and not delivering any value to stakeholders. It also means they need to have bigger teams to manage this platform.
Hiring extra people when you have an alternative is probably the wrong idea - it’s costly and time consuming to hire well. The new people you bring in initially slow you down - as you have to integrate them into the team, too. This exacerbates the increase in time to value for a data team that wants to self-host or build everything themselves.
I think there is a key distinction between product engineers at tech companies who should build the products that are the core competency of their organisation, and data engineers who shouldn’t build data infrastructure that already exists and can be hosted or bought. The problem is that these engineers often sit in the same management structures and cultures, that then make the same decisions about build/host/buy. If you don’t work for a giant tech company that has thousands of engineers and your company doesn’t sell a data software product, then you probably shouldn’t be trying to build complex data software like an orchestrator, data warehouse or BI tool.
A lot has been said about bloated data stacks too, but if you’re only buying the tools you need and that you would self-host if you didn’t buy… then data stack bloat isn’t a factor at play.
Post-modernism
Tomasz Tunguz wrote a post on the PostModern Data Stack. I wouldn’t necessarily agree with all the same tools, but there is a theme.
These companies start with great open-source offerings, then build services and a cloud offering on top in order to monetise. Cube, where I am today, is a good example of this, but so are Dagster, Motherduck (they have a different but very good relationship with the provider of OSS they build upon), Tobiko, Clickhouse and Lightdash.
Omni does not fit the bill here, as they have the same proprietary approach as Looker before them - Tomasz has invested in them. Lightdash is the true post-modern successor to Looker. I’m not saying this because I invested in them, I invested in them because they are this.
What’s so good about this method of starting with OSS and moving to paid services and cloud offerings? As a data tool buyer, most of the time, you can assume that the TCO of running OSS will be more than the TCO of buying the cloud/service offering. You can try before you buy on battle-tested software that has a great reputation. You get an off-ramp to OSS, if you need it, in the future.
Why didn’t I include dbt in the list? It’s because the situation is less clear now. From a transformation point of view, it used to fit. dbt Cloud used to be an inexpensive way to run dbt transformation jobs, and I’ve bought that version of it a couple of times to avoid Airflow5. However, since then, dbt Cloud has become more expensive and charges differently to run transformation jobs.
Many data teams have chosen to run dbt-core jobs themselves or using other tools like Dagster - this is evidence that the TCO may be lower for running dbt-core elsewhere than paying for dbt Cloud. Part of this is also the bundling of dbt-core into tools like Fivetran, excellent support from tools like Dagster, and backwards compatibility with Tobiko… meaning it’s even easier to run dbt-core without dbt Cloud than ever before, having a dual effect with a dbt Cloud price increase. This is like being a victim of your own success - every other tool supports dbt-core because of its ubiquity.
dbt’s second generation semantic layer, based on MetricFlow, is not open-source and doesn’t fit this postmodern pattern.
Where tools are not open-source, once they become popular, innovation stalls and prices rise so that the company can achieve operating leverage and provide a higher return for investors upon exit, or they make great acquisition targets where the acquirer can leverage the product and customer base while cutting costs. This is not wrong, but it’s also the end of the great relationship once had with customers. It is also part of the reason why data teams try to self-host and build their entire stack, but this is an over-reaction, where post-modern tools exist.
It feels like the open-source offering is the sacrifice to be made for loyalty and love of customers and users, providing it continues to evolve. Savvy data leaders know they need a plan B for any proprietary tool they use - lock-in has proven to be a disaster with this kind of tool. dbt-core has made it fairly painless to switch between Snowflake, Databricks and BigQuery, which in turn has made competition in the data warehousing space pretty fierce and kept the market healthier. dbt has partly eliminated the lock-in of data warehousing vendors.
Open-source creates internal competition within a company. If TCO is ever lower for OSS, then paid suffers. If OSS significantly falls behind competitors, paid can suffer because the off-ramp is no longer a valid feature. GTM folks may not like this competition, but product folks know it keeps the company honest.
I remember when I used to know all the companies using Snowflake in London - everyone else was still on SQL Server, occasionally Big Data. I feel like it’s coming round again. Now, many companies have made it onto MDS tools and a new generation is coming up with a new paradigm. Unlike the Big Data stacks that were too hard to run despite being OSS - these are also OSS, but it’s easy to buy a paid version and start going in seconds, just like the MDS tools before. It’s the best of both worlds.
Oftentimes an organisation has a really big cloud provider commit, so spending some cloud compute cost doesn’t have a financial impact - it’s already paid for. This so often drives tooling choice, as infrastructure managers want to use up their cloud commit.
I’ve seen teams choose an open-source tool because it’s “free” but then spend a lot of time building custom integrations with the rest of their stack when a paid tool wouldn’t have required this.
The exception to this is where large enterprise requires software to be run on their infrastructure and then there is labour associated with provisioning this to the software provider.
When you pay for a service, you expect it to have the integrations you need now or soon. Companies that provide this hosted software are often willing to build an integration you need in order to win your business.
Airflow is from the Big Data era and is feeling quite ancient today. To boot, Astronomer (the hosted version of Airflow) is very expensive, especially given GCP Cloud Composer and AWS MWAA offer much cheaper hosted versions (pretty much the cost of the instance). I would count “hosted by your cloud provider versions of Airflow”, as a similar concept to self-hosting, and the TCO of these is definitely lower than buying Astronomer.