I like this idea of a central orchestrator, but wouldn't a pre-hook just move the bottleneck you describe from the realm of Airflow to the realm of dbt? The pipeline still has to run and fetch the data, it's just that now dbt would be calling the pipeline instead of Airflow?
An alternative approach is similar to Meltano's, where _each_ EL pipeline can run the affected dbt transforms right after loading. So rather than doing one Big Batch of dbt transforms, you could do microbatches of transforms just after loading, a la `meltano elt from-salesforce to-snowflake --> dbt run -m source:salesforce+`
I think what I mean is technically very similar like dbt run -m +salesforce_staging+ which would run the source too, this is probably identical in practice to the Meltano flow you described. For me the key difference is that the Meltano flow like the Airflow one, is about updating a source and then it's dependencies, with the just in time way of thinking it's focused on the analytics use case: I want this end table in the DAG to be updated so go and pull from the roots of the tree to get what's needed to do it. The intent is focused the opposite way.
Also a separate point to this, even if the benefit I described above didn't exist, wouldn't it be good to not need something like Airflow at all?
The only reason people stick with tools like Airflow is the desire for control and visibility over the entire mess of pipelines. If instead all data maintained it’s own lineage (I prefer provenance) carrying it with it downstream then you can monitor at any point in the chain and see that provenance. The key is trust, freshness, correctness - successful analytics projects are those fully trusted by end users. You get to that by pre-empting any possible doubt or challenge users could have ( why does metric A not add up to metric B - because they are non-aggregable. Why does your metric not reflect what the source system says - because the source calculation is wrong and btw here is our calculation and proof of correctness) What you lose is central monitoring, observability so have to roll something there. You also lose downstream visibility so that has to be re-cobbled together in the separate observability tooling.
What you gain is the ability to pause, debug, modify and resume any section of the DAG without pausing or breaking the entire thing.
Yeah, I think one of the big problems I've seen that is related is stakeholders asking, "what does this data represent // when was this loaded?" This approach really tries to bridge that divide.
And yes! Not needing an orchestrator would be great. As more SaaS tools come out, it seems like Airflow is potentially doing less and less.
The big secret of analytical query optimization is that you do it ahead of time as much as possible. Pre-aggregate, pre-calculate and preload so that runtime analytics are fast. If your use cases are ad hoc analytics and data discovery or data science experiments then the slow performance of data JIT might be fine. If it’s streaming data then by definition it has to be running continuously, also doing precalculation and preaggregation as much as possible. Otherwise your query will take longer than the data takes to arrive and cease to be real time.
I like the architectural approach of removing the Airflow or other orchestration bottleneck because pipelines can be loosely coupled, self describing and self-executing, but usually you would trigger based on arrival of data (push) rather than when it is requested by analytics (pull). It would certainly be cheaper as far as compute goes to do it JIT, but a hell of a lot slower for users and not a great Ux.
I like this idea of a central orchestrator, but wouldn't a pre-hook just move the bottleneck you describe from the realm of Airflow to the realm of dbt? The pipeline still has to run and fetch the data, it's just that now dbt would be calling the pipeline instead of Airflow?
An alternative approach is similar to Meltano's, where _each_ EL pipeline can run the affected dbt transforms right after loading. So rather than doing one Big Batch of dbt transforms, you could do microbatches of transforms just after loading, a la `meltano elt from-salesforce to-snowflake --> dbt run -m source:salesforce+`
I think what I mean is technically very similar like dbt run -m +salesforce_staging+ which would run the source too, this is probably identical in practice to the Meltano flow you described. For me the key difference is that the Meltano flow like the Airflow one, is about updating a source and then it's dependencies, with the just in time way of thinking it's focused on the analytics use case: I want this end table in the DAG to be updated so go and pull from the roots of the tree to get what's needed to do it. The intent is focused the opposite way.
Also a separate point to this, even if the benefit I described above didn't exist, wouldn't it be good to not need something like Airflow at all?
The only reason people stick with tools like Airflow is the desire for control and visibility over the entire mess of pipelines. If instead all data maintained it’s own lineage (I prefer provenance) carrying it with it downstream then you can monitor at any point in the chain and see that provenance. The key is trust, freshness, correctness - successful analytics projects are those fully trusted by end users. You get to that by pre-empting any possible doubt or challenge users could have ( why does metric A not add up to metric B - because they are non-aggregable. Why does your metric not reflect what the source system says - because the source calculation is wrong and btw here is our calculation and proof of correctness) What you lose is central monitoring, observability so have to roll something there. You also lose downstream visibility so that has to be re-cobbled together in the separate observability tooling.
What you gain is the ability to pause, debug, modify and resume any section of the DAG without pausing or breaking the entire thing.
Yeah, I think one of the big problems I've seen that is related is stakeholders asking, "what does this data represent // when was this loaded?" This approach really tries to bridge that divide.
And yes! Not needing an orchestrator would be great. As more SaaS tools come out, it seems like Airflow is potentially doing less and less.
The big secret of analytical query optimization is that you do it ahead of time as much as possible. Pre-aggregate, pre-calculate and preload so that runtime analytics are fast. If your use cases are ad hoc analytics and data discovery or data science experiments then the slow performance of data JIT might be fine. If it’s streaming data then by definition it has to be running continuously, also doing precalculation and preaggregation as much as possible. Otherwise your query will take longer than the data takes to arrive and cease to be real time.
I like the architectural approach of removing the Airflow or other orchestration bottleneck because pipelines can be loosely coupled, self describing and self-executing, but usually you would trigger based on arrival of data (push) rather than when it is requested by analytics (pull). It would certainly be cheaper as far as compute goes to do it JIT, but a hell of a lot slower for users and not a great Ux.