Someone Else's Postgres

Does batch ELT have to be so?

Jun 22, 2022

Presto Is The Third Time Charm For Federated Databases

EL (extract and load, but not transform) in batch is costly in terms of:

engineering
resilience of pipelines
duplication of data
security and credentials management
extra load on source systems
tooling

There is a movement away from batch EL and towards streaming. Recently, there have been a few different technologies and startups trying to address CDC (Change Data Capture) streaming data.

CDC captures transaction log data that is generated by many OLTP databases which store state for almost all services and software - Postgres, MySQL, SQL Server and Oracle being some of the most commonly used. As this data is generated immediately upon any database transaction, it's a real-time stream of state change on that database.

In general, the connectors that Fivetran/Airbyte support provide a way to extract data from other vendors' OLTP databases, just with a few intermediary steps (e.g. outputs to file systems or APIs) which then prevent this data from being a real-time data stream. Some of the technologies mentioned above could provide a service for any SaaS or Software vendor to stream customer data to the customer’s Cloud Storage or data warehouse. This could eliminate the need to use a batch EL connector to get data from these vendors. It would be possible to tell vendors where to send your data, which you could then consume into further streams of your own or run scheduled transformations.

A lot of EL from internal databases is still dealt with via batch, without taking advantage of CDC. At a previous company, we had a product catalog implemented on AWS Aurora (Postgres) which dealt with many millions of insert/update/delete transactions per day. Instead of EL happening based on the transaction log from this database, which could actually filter the transactions to meaningful ones (where an insert, update or delete occurred), we had a separate Spark job to reduce product events. Essentially, the product events were being processed twice. The output of this Spark job had many events which weren't meaningful, as they matched the current state of the product. Using this output added significant overhead to later transformation processes, as we had to deduce whether the reduced events had genuine changes from the product state. We needed this to determine when the state of the product had truly changed since the last meaningful event (stock status, price etc). So the product events ended up being processed a third time…

The Analytics Engineering Roundup

HTAP Databases

Before diving in, an acknowledgement: last week the Roundup linked to a post that was a blatant plagiarism of a brilliant a16z post from 2020. Turns out the content is still highly relevant ~1.5 years later! Matt, Jennifer, and Martin: sorry about that, very much our fault…

3 years ago · 18 likes · 1 comment · Tristan Handy

It's also possible that with new HTAP databases, a DWH may not be needed at all and that you can use one database for everything. Eliminating the EL in ELT altogether, at least from your own production data.

AlloyDB is a good example - it's serverless and would seem (it's very new!) to provide good enough analytical performance for the vast majority of companies. Since originally writing this, I have attended the Snowflake Summit where Unistore was announced for a similar purpose.

With these HTAP databases and CDC EL adopted by software vendors, the need for batch EL could be greatly reduced, mostly leaving streaming use cases:

- CDC where databases are being used to store state for systems that need to be observed (mostly for third party systems where you can't query their database directly)

- Event-driven architecture (web and app events)

Both would then not only be streaming but at least near real-time. We wouldn't have the issues we do currently where the cadence of some slower updating batch sources limits how close to the present our data can represent. A typical example of this is payments data being available in batch for the past day (sometimes just with a date and not a timestamp!), representing conversion, but user event data coming in real time from a CDP (always with a timestamp and therefore can have conversions happening before the lead!).

All of this naturally requires better ownership and governance of data by all who create it. The data generated by any software vendor becomes part of the product offering.

grayscale photography of gray and white concrete building — Photo by Sean Benesh on Unsplash

I can't see a future where state (distinct from events) isn't usually stored by databases for applications. As soon as you design a system that stores state in an ACID way, you end up designing a database. There seems to be plenty of capital available for serverless OLTP databases, with some VC funds having multiple varieties of such products in their portfolios. Apart from adaptation to the Cloud and distributed systems, they haven't changed in principle for half a century. Databases aren't going anywhere.

Is this a possible final stack in ELT?:

HTAP internal production database,
CDC from 3rd Party,
Event-driven architecture for apps

If there were a dominant HTAP database in the future, where schemas could be shared between organisations, then this would eliminate CDC from 3rd Party in the stack. This, I imagine, is Snowflake's 10 year plan - no data movement, just data sharing.

Sarah Krasnik Bedell

While I'm a huge fan of CDC from internal production databases to copy to the warehouse, I have a lot of skepticism around CDC from 3rd parties. In particular, a few questions I don't know the answer to:

Are we sure the vendor source data is actually structured how it's written to the warehouse, or are there transformations that happen in batch? If the latter, CDC itself won't be an end-to-end solution.

If providing CDC as a solution, how will vendors deal with heavier demand and the stress of supporting real-time systems? First, vendors will have to keep an eye on source databases and everything that comes with scaling CDC. This also has multi- vs single-tenancy implications. Second, anytime you say "realtime" there's an SLA around it. With CDC, if the SLA doesn't decrease, then there's no point on doing CDC as opposed to batch. If it does decrease, this will put a lot of stress on vendors to actually deliver which has potential financial ramifications.

Expand full comment

1 reply by David Jayatillake

1 more comment...

davidj.substack

Discussion about this post