The unrealised promise of HTAP

And how Postgres learned to quack

Aug 23, 2024

aerial photography of green field and houses at daytime

Hybrid transactional/analytical processing (HTAP) is a data architecture that joins online transactional processing (OLTP) and online analytical processing (OLAP) workloads, allowing one system to support both processing sets. Read on as we take a deeper look into what HTAP does.1

HTAP databases have been somewhat of a holy grail for a long time. There are many reasons for this:

Elimination of ETL - you don’t need to move your data from OLTP to OLAP databases, you can use the same database for both. It is very costly to copy data from an OLTP database to an OLAP one and keep it in sync. It also simplifies architecture - normally an orchestrator, ELT tool and a transformation tool are needed to manage the synchronisation of data between OLTP and OLAP. With HTAP, it’s possible that just the database is needed.
Avoiding duplication of data - having an OLAP copy of your OLTP data introduces redundant data. Most OLTP databases already have a replica, so this is then a third copy.
Security - as we’ve seen recently, there seems to be less care over access to OLAP data stores, than OLTP, which can lead to a legal risk from data breaches. If you store your data once then it only needs to have security policy applied to this one store.
Freshness - data is naturally fresher for OLAP queries, as they are using the same data store: for some operational use cases this is necessary.
Increasingly, Data Products need OLAP queries for production - having to use a different database to handle OLAP queries adds complexity to delivery for the reasons above and more. It would be better if a data product could use a HTAP database and have consistency of results between OLTP and OLAP queries.

This dream of HTAP has been difficult to realise, primarily because OLTP and OLAP engines work differently. Modern OLAP engines are great at operations over columns and configure their storage this way too, but they aren’t as good at single record operations. The opposite is true for OLTP, so it’s been hard to get one engine and storage type to do both well.

This week, there has been some big news - Postgres now supports DuckDB with the pg_duckdb extension! Postgres is the most popular OLTP database in the world2 and DuckDB is fast becoming the most popular open-source OLAP query engine in the world.

This is a smart way to overcome some of the problems above: by having two query engines available from the same interface. I’m going to refer to the features section from the readme of pg_duckdb as I go through how it works.

SELECT queries executed by the DuckDB engine can directly read Postgres tables.
- Able to read data types that exist in both Postgres and DuckDB. The following data types are supported: numeric, character, binary, date/time, boolean, uuid, json, and arrays.
- If DuckDB cannot support the query for any reason, execution falls back to Postgres.

Postgres tables are not stored in a columnar way, but a row-based way, as you would expect for an OLTP database. This means there is inefficiency in DuckDB reading from a Postgres table, in that doing an aggregation on all rows of a table, for one column, will read the full data for the table. With a columnar format, it would only read the data for that column. A lot of the time this may not matter, especially if the data is small. However, if it does impact performance, it may make sense to copy the data into a columnar format instead.

Enable the DuckDB Iceberg extension using SELECT duckdb.enable_extension('iceberg') and read Iceberg files with iceberg_scan.
Write a query — or an entire table — to parquet in object storage.
- COPY (SELECT foo, bar FROM baz) TO 's3://...'
- COPY table TO 's3://...'

It is possible to query on a columnar store of the data, using pg_duckdb, as you can copy data from a Postgres table into parquet for faster querying later. Above, I mention how expensive synchronisation between separate OLTP and OLAP databases can be - here it is considerably cheaper.

It is understandable that the example here is to store these parquet files in S3 to prevent using up too much disk storage on the instance Postgres is running on - if your Postgres runs out of disk space to work with, it pretty much falls over. However, there would be benefit to being able to store the parquet files locally on the instance, as it then doesn’t have network/cloud cost and latency for querying. It would be great if there could be a way to allocate disk, memory and CPU between standard Postgres usage and DuckDB usage - this would allow for teams to confidently use DuckDB completely locally to the instance Postgres is on.

As Postgres is a server, it persists and can have the ability to schedule jobs - pgAgent. Using pgAgent, a regular copy of tables from Postgres to parquet/iceberg would be possible. What would also be really useful here is if one table could point to both Postgres and equivalent parquet/iceberg storage, and dynamically change what is read depending on the query engine used. Postgres being a server, now with DuckDB included, makes DuckDB server-based too and no longer only in-process - you can run a DuckDB query any time Postgres is up.

Typically OLAP queries, that DuckDB would handle, don’t need to work in an ACID way - it doesn’t matter if there are Postgres transactions happening that would change the state of the data being queried by DuckDB and potentially impact results. It is accepted that OLAP queries are usually out of date to a degree. I’m not sure how the configuration of DuckDB access to Postgres storage works, but it should be possible to allow DuckDB queries to read records that are locked for transactions.

Another thing that I think is an interesting opportunity is that Postgres can emit a WAL file, which is a dataset of state changes to any record stored. It could be possible with pgAgent scheduled jobs and DuckDB to store this WAL file in S3 (it’s not practical to store this on the database instance, as it will become very large). With the WAL file and DuckDB, it would then be possible to make a standardised method to understand the state of a table or subset of records at any given point in time. This is a big part of what CDC enables today and it could become much easier and cheaper with pg_duckdb.

The Postgres community is huge and support for extensions is excellent. Not a week goes by when I don’t hear about some new capability that has been made available. I’m looking forward to seeing new extensions to be used with pg_duckdb and how existing ones already can work with it. We could see whole architecture patterns represented in data stacks, being recreated inside Postgres.

Watch out for a future post where I take pg_duckdb for a spin.

https://www.snowflake.com/guides/htap-hybrid-transactional-and-analytical-processing/

When thinking of databases that are not in-process, like SQLite

davidj.substack

Discussion about this post