Going with the Airflow - Part 1
Why Airflow and in what form?
Firstly, what is Airflow and where did it come from?
Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.1
SeattleDataGuy (now in Denver) on whether you should use Airflow and why people do and don't like it:
I have somehow, throughout my time in Data, managed to avoid having to use Airflow. The first stack I built was composed of dbt (core and cloud), Snowflake and Looker. I was really fortunate that this org didn't really use any third party software at the point I arrived. They had taken the approach of building everything in house (including a CRM system!) or using OSS. All of these services emitted events to S3 if they had data that needed recording and metrics to measure. As a result, we were able to entirely use Snowflake's Snowpipe and connector to handle all ingestion into Snowflake from S3.
With data either flowing continuously into Snowflake via Snowpipe or being batch loaded by services using Snowflake Connector, I had no need for Airflow. I was able to simply use dbt cloud's scheduler to schedule the various dbt jobs we needed to run. I was curious about tools like Airflow, Prefect and Dagster. Internally, our Platform team had decided to use Argo for orchestration, as it fit with their wider strategy. It had been the plan for dbt jobs to be run as nodes in the Argo DAG at some point in the future but it hasn't happened so far.
The reasons for wanting dbt jobs to be part of the wider DAG were to do with newer services. These newer services were desired to be smaller with fewer dependencies. Having the Snowflake connector package as a dependency, managing transactions and kicking off dbt jobs was undesirable complexity. However, this meant we were often in a position where we would schedule dbt jobs to be run after when we thought these services would finish running. As you can imagine, this created some problems:
Sometimes the service would run late and the dbt job would run, doing expensive work with no new data (or worse, incomplete new data)
Sometimes the service would fail altogether, with the same consequence
Our dbt models had to have logic to find the newest data partition in S3
Our dbt models had to manage changes to our staging data
We even had an issue where the service was scheduled to run on the current local timezone and dbt cloud jobs ran on UTC. Daylight savings...
These problems showed the need for a larger, engineering-wide, DAG. This is what a tool like Airflow, Prefect, Dagster or Argo can accomplish.
I'm currently building Metaplane's data stack with three objectives:
To power our own data use cases, including reporting and analysis
To dogfood our own product
To understand our customer's data stacks and have empathy for the problems they encounter (and therefore how we could solve them)
Thanks for reading davidj.substack! Subscribe for free to receive new posts and support my work.
Worse is better
Funnily enough, in the seconds after writing the section above, I got a notification about this blog post (the strangeness of data twitter unconsciously writing about the same stuff strikes again):
As someone with a background in data engineering but no experience using Airflow, I was pretty confused trying to set it up. I wanted to set it up using Astronomer but even that was just complicated. I had way too many screens popping up every time I tried to launch the application and none of them gave me the information I needed.
Madison goes into why she found Airflow difficult to use, coming to it with fresh eyes (very much the position I’m in), and even tried Astronomer to simplify things before happily settling on Prefect instead. We do have a handful of organisations who use Prefect, but we probably have more than 10x as many who use Airflow. Some of the reasons Madison wouldn’t choose it for her own tooling are why I must for ours - I’m trying to gain empathy with our customers. If Airflow indeed is more difficult to use and maintain, then this could be where providing an integration with it could provide more value than for users of Prefect and Dagster.
Had I only had the first objective with my data stack, then I would have gone straight to a Prefect vs Dagster evaluation, based on Madison's post.
There are some elements of the difficulty around using Airflow that are related to self-hosting it. These issues are ones that we would never be able to help customers with, unless we were to host it for them (this is very much not on the roadmap!). Therefore, I've chosen to use a hosted version. There are three options available:
GCP Cloud Composer
SeattleDataGuy has also written on the choice between these:
My first DAG is initially going to be very simple. I want to trigger a few Fivetran connectors to run ahead of a dbt job. This is still sufficient for me to gain an understanding of Airflow and also to access some of its metadata. Based on the post above, I will therefore choose MWAA or Composer, as the base cost of Astronomer is also quite high where the cost of using MWAA and Composer is low. AWS and GCP are happy to help you spend compute elsewhere and on Airflow by providing it as a service. Astronomer, as a VC backed company in its own right, has to make profit from providing Airflow as a service alone.
I will choose MWAA :
Based on the video version of the post above, MWAA seems to have slightly less config than Composer. A “Fisher Price” Airflow sounds great! Even if I can’t use the official Fivetran and dbt Airflow providers with it, it’s only a case of writing three API requests in Python.
Given most of our infra is on AWS I’m more likely to be supported in this regard by our team, if I choose MWAA.
Most of our customers are on AWS and could well be using MWAA - or would consider using it, as it’s on the menu (despite my preference for GCP over AWS, yeah I said it 😈).
I'm going to do a short course on Airflow to get the basics, either through Datacamp or Coursera (I originally used Datacamp to learn Python for the first time, so I will definitely take a look here first). As I go through the course and learn about Airflow, I’ll try to implement what I’ve learnt using MWAA.
Part of my aim with this series is to show how to go from zero to one in knowing how to use a piece of tech, for those who feel daunted by picking up something new without having a colleague to teach you at the next desk.
The next post in this series will be spinning up MWAA, and finding out how to orchestrate Fivetran connectors.