dst - Data Stream Tool

What would we need from a data transformation framework for streaming data?

May 04, 2022

Analytics Engineering tooling is mostly limited to batch data infrastructure: databases and data warehouses. Dealing with streaming data has so far been the domain of Data Engineers. There are some exceptions to this, such as Rockset and Materialize, which are streaming databases that have dbt adapters. Snowflake and BigQuery have some provision for streaming too, but all of these feel like database abstractions on top of streams. Couldn't the data transformation framework be enough of an abstraction? What if we had adapters for GCP pub/sub, AWS Kinesis, Confluent and Kafka, just as we do for BigQuery, Redshift, Snowflake and Postgres in dbt?

Permissions

Database and DWH roles and permissions are quite well defined within these systems (BigQuery being an exception). For streams, permissions will need to be dealt with inside and across cloud projects. "dst" would need to be able to consume existing streams, create new ones and possibly host functions to transform events.

dbt currently handles a lot of boilerplate transactions in dbs and DWHs - there are similar commands related to managing and creating streams in dst.

The framework handles all of the config around notification creation, subscription and stream creation, but the models contain the transformation logic.

Sources and Sinks

Sources are other streams not created by dst, but possible to be referenced by it.

Sinks are where the streams end:

A sink can be a table in a database or data warehouse, which can then in turn be a source for batch transformation. Incremental tables could be set to be updated based on micro-batching of events from the stream; if a certain number or size of events builds up, then these would be pushed into the table

File storage like S3
Another system like a CRM platform
An action. eg if this logic holds, send a slack notification

Refs and Models

Models with Ref functions generating a DAG are the foundation of dbt. I feel a DAG fits streams just as well as tables if not better.

A model with materialization of “stream” would represent a new stream being created as a transformation of an existing one. Refs could be used to refer to the event, or micro-batch of events, from the preceding stream. This is very similar to how refs in dbt currently refer to previous tables in the DAG.

These models would always represent nodes in the middle of the DAG. Sources and models materialized as “sink” would be the left and right edges respectively.

Environments and languages

One of the complications about streaming, that doesn't occur in batch, is the environment. With batch, the query engine is part of the compute, as this is how all databases and data warehouses work. They provide an engine that you don't usually have to configure beyond possible resource sizing (I am excluding configuring data structures like partitions and indexing). Streams don't inherently provide you with an engine, compute or even language to do transformation with.

The choice of language to use is restricted by this compute. If it's a container, then nearly anything could be used. If it's a lambda or cloud function, then specific languages are available, depending on cloud.

You could have:

SQL using something like DuckDB on each event file
Python (Pandas/Numpy are fine for data this small)
Julia
Go
Rust
Shell

As data size being handled by any model is consistently small, compute need only scale horizontally to cope with increased volume. This works best when the models are stateless, as otherwise each event or micro-batch would need to call another service to fetch state. This would limit the throughput of the stream to the capacity of this service and its underlying store of state (usually a database or in-memory cache).

It's better for logic to be in the model rather than joined in (hydration/enrichment) via a query or call to a service:

a SQL case "when"
a pickled ML model
a function from a package

This does highlight some of the limitations with streaming data and any transformations. One of the benefits of batch transformations is that you can enrich your models through joins. This doesn't work well for streaming if the join is large; each micro-batch or event will require a scan of the joining dataset. Doing this millions of times a day is too expensive and slow.

Each model would need to configure things like what language is being used and on what compute. Is it a lambda function, or a container on an instance?

{{ config(materialized="stream", engine = "lambda", language = "Go") }}

Investment

There is increasing investment and community interest in this space:

Quix.ai (provides framework and engines to run the transformation, filtering, hydration and enrichment as micro-services, in Python) - Project A Ventures
Decodable (provides framework and engines to run the transformation, filtering, hydration and enrichment as micro-services, in SQL) - Bain Capital Ventures
Rockset (streaming database with SQL queries that can be deployed as micro-services on streaming data, with dbt adapter) - Sequoia and Greylock
Benthos (OSS framework able to configure transformation, filtering, hydration and enrichment, written and extendable in Go)

This is without even mentioning all the CDPs that offer parts of this functionality.

Pedram__c @pedram_navid

all b2b data saas companies converge into cdps. this is pedram’s law

It wouldn’t be a huge leap for one of the technologies mentioned above to go further towards being a CDP, with SDKs to use in applications to capture events. Springs as the source of the streams!

river in forest at daytime — Photo by Ryan Stone on Unsplash

One commercial difference between streaming frameworks and batch is that, with batch, the engines are generally provided by another tool, typically a data warehouse. Data warehouses are not easy to build, which is why the industry is worth hundreds of billions. It therefore makes sense for a batch transformation framework to integrate with existing Data Warehouses. With streaming, it's possible for the framework provider to provide and sell the compute engines as they only need to horizontally scale, suiting serverless. Decodable and Quix indeed do this. This does incentivise writing more efficient code in faster languages like Julia, Go and Rust; if your model/service can process a stream on 2 nodes instead of 20 this will be a big saving over a year.

It's possible that for smaller batch data, compute engines with something lightweight like DuckDB could be bundled with the framework, similarly to streaming; the data would be small enough that whole historical datasets could be queried very quickly on small machines. This would enable a whole basic data stack to be provided, for small data use cases. Although, this would no longer be batch or streaming, as any concept of incrementally dealing with data is gone.

davidj.substack

Discussion about this post