Going with the Airflow - Part 4

Conclusions and Thoughts

Aug 31, 2022

close-up photography of lighted bulbs — Photo by Dan DeAlmeida on Unsplash

Thanks for reading this series so far. For those of you who haven’t seen parts 1-3, here are the links:

Firstly, what is Airflow and where did it come from? Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014[2] as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and s…

3 years ago · 5 likes · 1 comment · David Jayatillake

davidj.substack

Going with the Airflow - Part 2

This post is a continuation of a series on Airflow. For this second post I’ll be getting my hands dirty setting up and doing some practical learning. Hopefully, reading my experience will help others, with the same lack of skill as me, to take the plunge…

3 years ago · 3 likes · David Jayatillake

davidj.substack

Going with the Airflow - Part 3

There is clearly a difference between SaaS and hosted software… MWAA doesn't feel like it's a service. However, it does seem like it might cost somewhere in the region of SaaS: $12 a day is not far off $5k a year, and that's also on the smallest instance. I imagine that if you went for a bigger instance, the costs would start to look something like those…

3 years ago · David Jayatillake

Since I included a quote from Stephen’s post, Ray Buhr, a legend from the dbt community (he’s everywhere at the same time, at least on dbt slack), responded to it in an insightful way. It’s a really long quote that I agree with on many levels, so I’ll break it up and add my thoughts.

I think you are right on target with the overall message. Airflow was built to string tasks together, not provide an overview of all the ways data is flowing or what’s causing issues.
Sure Airflow, it has logs for failed tasks, but logs become diluted as we start using more complex tasks. As we’ve moved towards isolating tasks through the use of operators that run on remote servers (such as database calls, kubernetes, databricks, etc), there’s a level of indirect between the workload and Airflow that’s helpful for reliability but at the cost of less transparency. Taking the stance of Astronomer, they’ve done a great job providing Airflow as scalable tool. You can have a different Airflow instance for each team or project or layer of your data stack and you get resource monitoring and statistics about performance. But Astronomer doesn’t help you better understand what’s breaking in your pipeline or where you could simplify or consolidate tasks.

Astronomer is making it easier to use Airflow as software. This is necessary, as it’s really a painful experience to use without any assistance. As I mentioned in the previous posts in the series, I have tried just to install it locally… failed and given up. I’ve tried pip install, brew install for astro and using docker - no luck. I know I’m not the most gifted with infra and platform stuff, but it shouldn’t be this hard. I’ve met Data Engineers who’ve said the same thing, but have found Prefect and Dagster a breeze to use.

woman in black long sleeve shirt covering her face with her hands — Photo by Elisa Ventur on Unsplash

I recently had a chat with folks at Astronomer, about possible collaboration between Metaplane and Astronomer and learned that they are working on a bunch of products and features that'll directly address pain points felt by smaller teams over the coming months. Of course, I offered myself as a guinea pig as MWAA, despite being functional, isn’t great to use:

I’d have to have a second MWAA instance to use for development, that I would stop when not in use
Find a way in my CICD process to get my code to push to the dev S3 bucket
Start the dev MWAA instance
Run the changed DAG
If no errors, then do the same on my prod s3 bucket and refresh my prod MWAA instance upon merge

I think I could figure this all out with time, but I’m sure it would be as brittle as a test tube. Also, fundamentally, the time I would spend setting this up doesn’t in itself get me closer to my goal - it’s just removing the roadblocks.

Providing a great dev workflow is something that I hope will be enabled by using Astronomer Cloud.

All of this doesn’t solve the issue around transparency. As Airflow begins to be primarily an Orchestrator of remote resources, each node becomes a complex black box that is possibly a DAG of its own. Transparency disappears - each node has its own log stream that isn’t visible in Airflow. This adds extra complexity as you need sensors for everything, you’d end up building two nodes for each action you want. One of the great things about dbt models is that you don’t need to build sensors - each model is an operator and sensor in one. This is naturally easier when the data warehouse is the only kind of system executing operators in the DAG.

It would be great if a sensor existed in dbt to trigger a run when data was ready, as this would help with running transformation jobs at the same cadence as data is refreshed. Often, ELT providers and CDPs only send new batches of data to a customer’s storage, whether cloud storage or DWH, at certain intervals - doing it in real-time would be too expensive. If sensors could launch a run whenever this data arrived, which is out of the control of the Data Engineer, this would be a good solution.

Back to Ray’s quote:

Think about dbt and one of the best parts of that tool is the ref – you can build reusable blocks that the rest of the data pipeline can take advantage of. Airflow doesn’t really have that concept. Sure, you create DAGs that trigger off of other DAGs and tasks that use the output of other tasks, but that’s not really the same thing. You have to go out of your way to try and use anything from outside the current DAG and it’s risky to do so.

A greater ability for dbt as a technology, would be to run sources (connectors, API gets) that are outside of the DWH on demand, too. This has been called software-defined assets or reverse orchestration. I alluded to it before, in my just-in-time analytics post:

davidj.substack

Just in time Analytics

It’s conceivable that in the not too distant future, most analytics will be event driven and real-time (or at least near real-time). If all sources, even third party platforms, are able and expected to write events in real-time to shared storage or data warehouses, then this will become possible. Many data sources (CDP, CRM, payment processors…) are cur…

4 years ago · 4 likes · 5 comments · David Jayatillake

I had intended that you could run a command like dbt run -m +asset_to_refresh+ or dbt get metric -select +metric_called but clearly I didn’t explain this well:

Benn Stancil @bennstancil

@DSJayatillake @s_ryz I actually disagree on that. The just-in-time proposal is still a linear set of dominos; they just knock each other down faster and more efficiently. My argument is I don't want to think about the dominos at all. I want to define a result, not a procedure.

We stop orchestrating our data pipelines on the basis of dominoes=tasks: how well they’re lined up, their spacing etc. We just think about the ball we want knocked over at the end (probably something in Salesforce, as apparently all roads lead there).

person holding white and blue plastic blocks — Photo by Bradyn Trollip on Unsplash

The final part of Ray’s quote:

The other thing that I want to call out is how weird data engineering teams can be about cost. They will often spend an enormous amount of time and energy reducing infrastructure costs, but that’s only worthwhile if you have enormous data volume and velocity. Tools like Fivetran and Census are providing SaaS solutions to data engineering that cost money, but reduce dev time and remove cost of maintenance from the tech stack. Still, lots of data engineering teams will look at Astronomer and balk at the cost, especially since running Airflow on kubernetes is a pretty well solved problem now and AWS has MWAA. They don’t offer the control plane over Airflow that Astronomer does, but that’s not a full data stack solution either so is it really worth the extra cost? Calculating the ROI on this stuff boils down to opportunity costs and from what I see, data engineering teams are sometimes going in pretty wildly different directions from each other. I tend to side with those who believe they should be providing tools and infrastructure that make working with data at the organization easier and more joyful versus, as a single example, creating pristine “gold layer” datasets that analysts and data scientists can use as their starting place.

💯 on the weirdness of trying to build or host everything yourself, when a solution exists off the shelf that may cost money, but is cheaper than spending your time solving problems that have already been solved. Whilst I am saving cost by not using the current enterprise version of Astronomer Cloud over MWAA, even though it can solve many of my issues, it makes sense given I am the entire data team at Metaplane. At any bigger org whose data is much larger and therefore more crucial for making decisions, the choice would be different.

Despite all the problems I had with infrastructure and developer experience, using Airflow and building DAGs in it was actually enjoyable. Less so than with dbt, but it’s still nice to declare a pipeline as code and watch it come to life when you run it.

Astronomer have been hyping up the release of Airflow 2.4 and doing some interesting work with the Astro SDK. It feels like Airflow might be modernising soon. This means that current Airflow users will benefit much more easily from accepting these upgrades, compared to moving away.

Sarah Krasnik @sarahmk125

There's been consensus around orchestration not being a solved problem. Dagster and Prefect are incrementally, not 10x, better than Airflow. What's really missing? While I agree it's unsolved, I have a hard time articulating specific features yielding a 10x better flow.

Given how many Metaplane customers use Airflow, I’m hopeful that Astronomer will improve things around abstracting infrastructure for companies of all sizes and also modernise Airflow to be closer to Prefect and Dagster.

However, I’m also excited to give Prefect and Dagster a try, to see how they function in comparison. I suspect that, for an org that hasn’t chosen an orchestration tool at this time, Prefect or Dagster might be a better choice to start with.

I recently ran a LinkedIn poll:

It’s possible that in the next 5 days more votes come in and oust Orchestration from the top spot, but it’s clear that it is a very important part of Data Engineering.

davidj.substack

Going with the Airflow - Part 4

Conclusions and Thoughts

Discussion about this post