5 Comments

I like this idea of a central orchestrator, but wouldn't a pre-hook just move the bottleneck you describe from the realm of Airflow to the realm of dbt? The pipeline still has to run and fetch the data, it's just that now dbt would be calling the pipeline instead of Airflow?

An alternative approach is similar to Meltano's, where _each_ EL pipeline can run the affected dbt transforms right after loading. So rather than doing one Big Batch of dbt transforms, you could do microbatches of transforms just after loading, a la `meltano elt from-salesforce to-snowflake --> dbt run -m source:salesforce+`

Expand full comment

The big secret of analytical query optimization is that you do it ahead of time as much as possible. Pre-aggregate, pre-calculate and preload so that runtime analytics are fast. If your use cases are ad hoc analytics and data discovery or data science experiments then the slow performance of data JIT might be fine. If it’s streaming data then by definition it has to be running continuously, also doing precalculation and preaggregation as much as possible. Otherwise your query will take longer than the data takes to arrive and cease to be real time.

I like the architectural approach of removing the Airflow or other orchestration bottleneck because pipelines can be loosely coupled, self describing and self-executing, but usually you would trigger based on arrival of data (push) rather than when it is requested by analytics (pull). It would certainly be cheaper as far as compute goes to do it JIT, but a hell of a lot slower for users and not a great Ux.

Expand full comment