A Very Modern Data Stack
Interoperability is to 2022 what Observability was to 2021
The Key Component
When I was in the interview process for my new role at Ruby Labs, I received an offer earlier than I expected - I had shared my recent interview with Tristan Handy and Julia Schottenstein on the Analytics Engineering podcast after my first round, and it turned out this answered most of their technical questions for me!
The one part of the stack that I’m sure I want, ahead of starting at Ruby Labs, is dbt. In my role at Lyst, we managed to transform our data transformations (drum sting) using it and it was a platform for success. In the last two years, dbt has exponentially grown in all metrics: features, team, community, use… Its adoption has been so strong that many tools now integrate with it as a framework or are built directly on top of it. Its central place in the modern data stack is now providing more than the transformation layer it started with; it is now providing a data interoperability framework. Choosing tech and tools which integrate well with dbt makes for resilient analytics architecture which is very fast to set up and iterate on.
From what I’ve learned about Ruby Labs from my interview process and intro chat with a data team member, everything seems fairly greenfield which is exciting for me, as it means I can be flexible with my solutions and won’t have many existing things to support. Ruby Labs has mParticle for data collection from platforms, backend services written in Go and everything hosted in GCP (there is a collection of other platforms at the bottom of this careers page). Based on this knowledge, I have proposed an initial data stack:
Airflow here is the GCP Cloud Composer hosted version. I haven’t included an ELT tool here, but if we have to integrate data from some third-party platforms or need reverse ETL, I might bring in Gravity Data who I work with as an investor and advisor (think much cheaper than Fivetran but with the reliability you need for data replication that full OSS solutions make your responsibility). It will have a dbt integration in the next few weeks based on the metadata API and post/pre-hook triggers.
Continual, Lightdash and Hex are fairly new tools but the bet is that their current and future integrations with dbt will enable shorter time to value, lower maintenance and fewer places where things are defined.
There are plenty of things I haven’t mentioned like a catalog such as Atlan, or an observability tool like Monte Carlo or Metaplane. This is very much a foundational stack and tools like these can be brought in as needed in a few months.
Having used both Snowflake and BigQuery in the past, there are features I prefer in Snowflake but if BigQuery is mostly comparable in terms of features, then the additional interoperability within GCP is something worth trying to retain. In the past with Snowflake I have adopted a pattern of continuous ingestion of semi-structured data into raw tables in Snowflake, and then parsing out what is needed afterwards with dbt. BigQuery has just released support for semi-structured data which will enable me to continue using this flexible pattern. Both of these data warehouses are well supported by dbt with core adapters. I outright will only choose a data warehouse with a dbt adapter at this point in time.
There are some new contenders like Rockset and Firebolt who I would love to try too and have recently announced dbt adapters. They both have the potential to outperform BigQuery and Snowflake for specific use cases, and Rockset in particular is interesting as it avoids a few steps in typical data ingestion patterns involving semi-structured data related to batch.
Interoperability within data architecture is key and will become a normal expectation of products from customers, this is why there has been reticence from architects to use services outside of their cloud vendor; Snowflake is a good example of a company ensuring interoperability with cloud vendors to make it an easy choice. Integration with dbt is becoming a way to deliver interoperability for any data product.
I really want to support open source like dbt and Lightdash, but I think you have to be pragmatic about what works too and that you don't want to have to guarantee everything stays up yourself; I've found going for the hosted version of open source tools to be a great compromise, as you still use and contribute to something that belongs to the community but don’t have to worry about infrastructure (being responsible for a database or ETL tool staying up is difficult or risky without a lot of resource).
However, one piece of the stack must stay open source to stop the whole stack fragmenting: the interoperability layer/framework. If you're long enough in the tooth to remember the dark old days of Microsoft, Teradata, and Oracle having completely separate and poorly interoperable stacks, I think as a global data discipline we never want to go back to that place. I can see a mega vendor/s arising that automatically sets up that open-source interoperability layer for you and connects it to a few of its preferred tools, with bundled pricing and choice to swap some parts out for what you might prefer. All three cloud vendors could probably achieve this in the next year or two if they didn't want to make the interoperability layer proprietary. Listening to this episode of the analytics engineering podcast really crystallized the importance of this.
I’m sure as I learn more upon joining, and find out the limitations of some of these new technologies that this stack will change, and I’ll share an updated stack in a future post!