I really enjoyed learning to use Dagster and believe it’s one of the premier orchestration tools out there. However, whenever I do a deeper series on a tool like this, I feel like I’m rushing to learn how it works and do something practical with it. Having had a bit of time since finishing the series, some thoughts have come to mind about future possibilities.
One of the fundamental questions about using dbt at scale is whether you should have a monorepo vs many separate ones. Before writing this post, I was definitely in the monorepo camp - I felt losing the lineage by splitting the DAG was too great a loss. Some other teams have homemade solutions to this loss of lineage, such as editing the graph.gpickle to point to nodes in other dbt repos.
While these are inventive and creative solutions to dealing with the many repo/broken lineage problems… I can’t help but feel they are a bit janky and reduce the transparency in your wider dbt usage. You can’t then go to one place to find all your assets and how they fit together in a human-viewable way. Orchestration across the many repos is difficult, as you then end up needing to understand how they fit together and selecting feeder repo jobs to run into the next ones… it’s a bit of a mess. The best case is that you have the sub-DAGs as nodes on an orchestrator DAG, which runs them together in sequence. However, this is far from ideal - the model level lineage is lost, what the DAG is actually doing is opaque and no doubt many unnecessary models will be run = bloat.
One of the things that really impressed me about Dagster was how assets are well-linked. If you recall the DAG from my final post on Software Defined Assets:
Dagster is able to automatically link the lineage of different systems, where the assets at the end of one and the start of another have the same name. So, in the DAG above, Dagster is able to derive that the Fivetran syncs are loading into specific table names, but also that these same table names are the named sources in my dbt DAG, and therefore they are linked in the DAG. I can’t see why this wouldn’t be possible with multiple dbt DAGs, too.
A simple example could be in my DAG above. Imagine if I wanted the Hubspot dbt models which are from the Fivetran Hubspot dbt package to be a separate dbt repo, and then just the models I created in another. Providing I named the models I depended on in my repo as sources (or perhaps even without this), Dagster would be able to connect the two DAGs at an asset level. It would be as if you really just had one DAG, but could manage it as multiple.
Imagine a more complicated scenario, where a Marketing team had Analytics Engineers who had built a Marketing-specific dbt project. Then there was also a Finance team, which had Analytics Engineers building a dbt project for Finance, but this dbt project had a dependency on a few models from the Marketing dbt project, with some paid search cost dbt models as sources. If these dbt projects were both in a Dagster repo separately, the finance AEs could have a job in Dagster which updates a final asset for key reporting, whose upstream assets clearly include the paid Marketing cost models. If the Marketing team were to make changes to these assets that would break the Finance ones, they could tell by having a CI test which tries to materialise all Dagster downstream assets. They would be able to understand the impact of their changes throughout the data lineage of their organisation.
Benn recently wrote about data contracts and covered an area I hadn’t thought of much before - the contracts between dbt models in the same lineage. These contracts are very hard to manage and enforce with multiple dbt projects in separate repos, but with software-defined assets spanning all dbt projects, you can see in development where contracts will be broken if they cause downstream models to break.
Dagster have recently announced (1.1.1) that they allow you to use Software Defined Assets within the context of dbt cloud jobs, even to the model level, and will soon now support subsetting of jobs. What this means is that, as I did in the second post of my series on Dagster, I could have brought in my dbt cloud job and all the models being run inside it would become available as assets in Dagster. These assets can be composed in the same way as any other Software Defined Assets in Dagster, so you could have a job which ran the upstream assets of one of the models in the dbt cloud job. This effectively means that Dagster is handling the dynamic filtering of dbt cloud jobs; the dbt cloud assets will still be run on dbt cloud infrastructure and environments.
I believe, where possible, that orchestrator, transformer and engine should remain separate. So, in this case, the orchestrator is Dagster, dbt is the transformer and the engine is Snowflake or whatever DWH you’re using with dbt. The dbt environment is separate and different to that of Dagster, as it is managed in dbt Cloud.
In my final post on the series, I described how I was helped to sync my Metaplane-dbt repo with my Metaplane-Dagster one, so that the copy of the dbt repo in my Dagster one stayed in sync with CI/CD workflows testing changes on the Dagster side. I’ve begun to think that if I were to start a brand new dbt project, I might just start it inside a Dagster repo, to avoid the need to have this sync. This way, CI/CD is much more straightforward - any change you make to your dbt repo will trigger a Dagster Cloud CI workflow.
I’ve probably said this multiple times, but learning dbt brought me much closer to the Software Engineering way of working than any of the work I had done before. Previously, this had always felt like running a bunch of tasks in order (“lining up dominoes”) - Airflow very much feels like this too, but Dagster does not.
In some ways, developing in Dagster peels away more of the facade in front of the Software Engineering workflow than dbt does - you’re having to think about __init__ files, importing things you’ve made elsewhere in the project, deployment requirements… While I’ve enjoyed this learning experience, there are many who won’t and Dagster need to reduce some of the learning required or make low/no code ways to manage things in Dagster Cloud for these users (no code can still generate code elsewhere for other uses in the org).
I’ve often thought that the main threat to orchestrators is the end of batch: if we don’t have batch jobs to run, then we don’t need them, do we? Can the result of every function be known, based on the stored previous results? Let’s take an average function - the result on its own is not enough to then take the next value in the stream and recalculate the average - you would also need a count of the values to be maintained. It gets even more complicated with some calculations such as standard deviation. I’m sure there are some instances where it doesn’t work at all. I think batch transformation will be around for a long time, even if it sits on streaming data.
It would be easy to imagine Dagster being able to define the lineage of streams flowing into each other.
Dagster Community Day - My Highlights
Introducing declarative scheduling
Freshness policies - specify how up-to-date you expect your assets to be.
Freshness based scheduling allows you to avoid having the same asset refreshed multiple times by different runs.
Versions of assets allow you to only refresh assets which have changed.
You can observe an asset to see if it’s changed, when it changes its downstream dependencies become stale.
Serverless Dagster Cloud
Fast deploys - Dagster Cloud doesn’t build a new docker image each time, just python code running in an existing container saving a lot of time. If deps are changed then these are rebuilt and uploaded too.
Docker image is rebuilt in the background - not holding you up.
Non-isolated runs - previously every run used it’s own container, adds overhead as you need a container to be provisioned. Now can choose to just use an existing container and the run starts immediately.
Infra-as-code for DE + Dagster beliefs
You can define an Airbyte connection in code, Dagster check and apply much like Terraform plan and apply.
Single-source of truth needs config as code and not with it stored in the UI of a SaaS application.
Dagster believes data management is a software engineering discipline, which is consistent with how I felt about learning how to use Dagster.
Data Assets should be defined in software ie code.
Change is best managed through the Software Development Lifecycle.
You shouldn’t be clicking around to change business logic, it’s fragile and dangerous.
Terraform isn’t best for this, designed for infra and not business logic, data people shouldn’t need to learn it to make changes to business logic. Terraform has no knowledge of the rest of your data stack and assets.
Why the orchestrator to hold IaaC for data assets? The Orchestrator is the ultimate source of truth for your data assets and dependencies. Single pane of glass for data team. Centre of software engineering lifecycle for data teams.
I have an active thread on Dagster if you want to join the discussion!