A Thought of Stream...

...and a three-horse race

Aug 10, 2022

Whatever Happened To The Cast Of Johnny Mnemonic?

GCP just announced a pretty big new feature that allows for PubSub streams to be directly queryable from BigQuery. The new way of subscribing a BigQuery table to a topic costs nothing on ingestion for BigQuery - you only pay for the use of PubSub.

Better yet, you no longer need to pay for data ingestion into BigQuery when using this new direct method. You only pay for the Pub/Sub you use. Ingestion from Pub/Sub’s BigQuery subscription into BigQuery costs $50/TiB based on read (subscribe throughput) from the subscription. This is a simpler and cheaper billing experience compared to the alternative path via Dataflow pipeline where you would be paying for the Pub/Sub read, Dataflow job, and BigQuery data ingestion.1

As you can see from the quote above, it was possible before using Dataflow, but it was much more complex and also costly. With the new method of topic subscription to BigQuery table, it’s possible for a data practitioner who is less technical to access streaming.

This is making the Data Engineering activity of connecting a stream to a Data Warehouse child’s play, similar to what has happened with batch ELT using Fivetran. Until recently, getting streaming data available to use in a Data Warehouse was non-trivial.

I have been at an org where this had been implemented using different tools:

We needed to have a piece of infrastructure with an API, where engineering teams could send events.
The infrastructure would then normalise the events with a metadata envelope and store it in a standardised way in S3, as Parquet.
We would then use Snowpipe, as we were on Snowflake, to ingest the data from S3 into Snowflake tables with variant columns for the semi-structured data. We invested in Terraform modules to automate this process, but it still required some care and synchronisation between two Terraform repos - one to make the pipe, table, stage and grants in Snowflake and one to add SNS to the bucket in S3.
We then would have standardised Raw to Staging dbt models that would handle this data using a JSON schema.

The steps above would take about a day’s work, even after having invested in the infra and Terraform automation and maintaining them. While the new integration from GCP above won’t solve point 4, it does make steps 1 to 3 very much easier. It will be interesting to see if Snowflake follows suit on allowing this kind of integration for GCP customers, provided GCP doesn't block them.

The no-code/low-code topic has been active recently, and I accept that using the GUI interface to set up the integration between PubSub and BigQuery prevents it from being version controlled. However, most GCP tooling is well supported with Terraform providers, so I would think that this will also be supported.

You could categorise the types of data EL into these four types:

External streaming data - eg CDP platform like Segment, or data available via webhook
Internal streaming data - eg PubSub topics created by your Engineering team
External batch data - eg Stripe
Internal batch data - eg the output of some batch service as CSV or Parquet in cloud storage

I’ve written about how I think batch is really streaming in disguise:

davidj.substack

Someone Else's Postgres

EL (extract and load, but not transform) in batch is costly in terms of: engineering resilience of pipelines duplication of data security and credentials management extra load on source systems tooling There is a movement away from batch EL and towards streaming. Recently, there have been a few different…

3 years ago · 1 like · 2 comments · David Jayatillake

If external providers of streaming data provided PubSub topics as a way for customers to access data, it would make it very easy for Data teams who use BigQuery to ingest this data. I feel like this is very much part of GCP’s own Data Cloud play - the difference between Snowflake data shares and this feature is very fine. There will be a stream of data flowing into the shared Snowflake database provided by the third party - the Snowflake database here is almost an abstraction on that stream. GCP has made this abstraction unnecessary.

While I’m excited about what GCP has enabled, I know, from the Snowflake based approach above, that it can result in problems when the semi-structured data from the stream changes. Of course, there are processes that can help with this:

Having SLAs around notification of changes to data structure and format
Maintaining multiple versions of the same stream, to allow time for teams to transition
Data mesh principles - eg the Engineering team who owns the stream also owns the landing and flattening in far left dbt models

I do think we need tooling around this though - we need observability tooling that automatically checks for schema changes in this semi-structured data. With the data able to be stored as a JSON format column in a Data Warehouse, these raw semi-structured column tables are a reflection of the data lake, or the data lake itself if it is the only sink for a stream. The monitoring on this semi-structured data is effectively very close to source: as close as possible without going further up the stack and monitoring the data produced in the application.

I also think it’s possible to allow for quality approaches here - a tool could be able to use a JSON schema to know “what good looks like” for a piece of semi-structured data:

structure
required fields
field format
envelope fields (metadata)
event versioning - switching for above properties by version, plus monitoring for unexpected versions (with versioning any schema for a specific version of an event becomes immutable)

The tooling applied to this semi-structured data, whether using ML or JSON schema validation, is effectively monitoring the data health of a stream. This is enabled because the stream is accessible or stored in the Data Warehouse, showing again some of the benefits of the centre of gravity being around the Data Warehouse.

In addition to the news above, which is now a couple of weeks old, GCP have announced a tighter integration between BigQuery and BigTable:

Modern businesses are increasingly relying on real-time insights to stay ahead of their competition. Whether it's to expedite human decision-making or fully automate decisions, such insights require the ability to run hybrid transactional analytical workloads that often involve multiple data sources.
BigQuery is Google Cloud’s serverless, multi-cloud data warehouse that simplifies analytics by bringing together data from multiple sources. Cloud Bigtable is Google Cloud's fully-managed, NoSQL database for time-sensitive transactional and analytical workloads.
Customers use Bigtable for a wide range of use cases such as real time fraud detection, recommendations, personalization and time series. Data generated by these use cases has significant business value.
Historically, while it has been possible to use ETL tools like Dataflow to copy data from Bigtable into BigQuery to unlock this value, this approach has several shortcomings, such as data freshness issues and paying twice for the storage of the same data, not to mention having to maintain an ETL pipeline. Considering the fact that many Bigtable customers store hundreds of Terabytes or even Petabytes of data, duplication can be quite costly. Moreover, copying data using daily ETL jobs hinders your ability to derive insights from up-to-date data which can be a significant competitive advantage for your business.
Today, with the General Availability of Bigtable federated queries with BigQuery, you can query data residing in Bigtable via BigQuery faster, without moving or copying the data, in all Google Cloud regions with increased federated query concurrency limits, closing a longstanding gap between operational data and analytics.2

I’ve quoted a lot of the post, as it is well worded and sets the context nicely. It even mentions HTAP as a concept. Even though these are two different databases, allowing access from one to another allows them to operate as one with the benefits of both.

I think this is the second HTAP database configuration that we’ve seen in recent months from GCP, the first being AlloyDB. However, this gives organisations the power of BigTable for OLTP use cases and BigQuery for OLAP and both are market leading serverless, near unlimited-scale solutions. Unistore from Snowflake, at this time (from the presentations at Summit), is not ready to handle the kind of volume that BigTable can, not by a few orders of magnitude. This puts GCP ahead in the HTAP race without question.

It may feel like Snowflake Summit was a long time ago now (at least it does to me!), but this is actually a swift response from GCP to address some of the shortcomings of their analytical stack.

davidj.substack

Hope and convening in Las Vegas

Before As many of you will know there is a big thing in data going on in Las Vegas this week - Snowflake Summit 2022. It's the first large in-person data conference I've attended since Snowflake Summit 2019 in London. It's also the first time I've flown in 7 years! Between having small children and the impact of Covid restrictions, I have happily avoided…

3 years ago · 7 likes · David Jayatillake

We shouldn’t just be talking about Databricks vs Snowflake… GCP is the dark horse in this race, with more money than either (or combined for that matter) and a clear determination to push forwards.

black horse wearing leather saddle — Photo by Lindsey Bidwell on Unsplash

With further support for Python/Spark within the BigQuery context, it could pull ahead of either.

As you can see from the recently released Python model dbt documentation, although GCP as a platform supports Python for data, it’s not native on BigQuery compute. If it was, and stayed true to BigQuery’s automated horizontal and vertical scaling philosophy3, it would have a big technical advantage over Snowflake’s Snowpark and Databricks.

Game on.

Thank you for reading davidj.substack. This post is public so feel free to share it.

https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics

https://cloud.google.com/blog/products/data-analytics/bigtable-bigquery-federation-brings-hot--cold-data-closer

Since finishing this post GCP announced Dataflow Prime https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime which has “right-fitting” meaning automated horizontal and vertical scaling. With tighter integration to BigQuery they will have a technically superior product to Snowpark and Databricks in this regard

davidj.substack

A Thought of Stream...

...and a three-horse race

Discussion about this post