Approaches at different levels of the data stack
In last week's post I wrote about metrics layers and their approaches to joining data.
One topic which is also relevant when talking about metrics layers, is caching. Caching is a big deal! It's thought of as one of the harder problems in software engineering to get right.
First of all - what is a cache?
1In computing, a cache (/kæʃ/ (listen) KASH) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.
The definition above talks about how cache hits are faster than reading from a slower data store. This is indeed one of the main benefits, but another big one in the cloud data warehouse (CDW) world is cost. If you can save an expensive query with table scans, joins, window functions/grouping, sorting... having to be run, it's worth doing so. It's not uncommon for organisations to be spending around $1m per year on CDW spend. This spend splits between scheduled data transformation work and unscheduled ad hoc querying. Repeated slow similar queries are responsible for a huge part of unscheduled spend.
There is also an additional cost: time and decisions lost. Often, when data teams try to manage their CDW, they do things such as reducing cluster size. This saves money by reducing throughput: users actually run fewer queries as they have a finite amount of time to wait for them. They also have a finite amount of attention: if the query takes many minutes to run, they often context switch. All avoidable context switching reduces productivity! The users who context switch are looking to use data to inform or build; increasing their query times by 4X possibly slowed down their inform/build times by > 4X because they context switched in the meantime. The waiting threshold for context switching is probably somewhere between 10 seconds and 1 minute (how long does your Looker query run for before you open another tab or application?) Ideally, mean and median response time should be well under 1 minute.
Speeding up ad hoc queries has a twofold benefit: saving CDW compute cost and also increasing inform/build throughput. Thus, the importance of caching: if we can effectively use caching to speed up ad hoc queries, we can realise this benefit.
CDWs also have minimum compute cluster uptimes, ranging from seconds to minutes. If caching can prevent a CDW cluster from being spun up, you can save >100% of the CDW cost of handling that query. For example, if a CDW cluster had a 1 minute minimum uptime and your query took 15 seconds to run, you would save 400% of the true CDW cost of running that query. My guess is that Snowflake makes a substantial part of their revenue from unused compute during cluster minimum uptime. I wouldn't be surprised if it was more than 10% of revenue.
Are ad hoc queries really ad hoc?
The truth is that ad hoc queries aren't that ad hoc, not such a hot take. Companies are organised into disciplines such as product, marketing and finance. These disciplines have well-defined metrics that matter and dimensions to drill into.
Once users have found a way to pull these metrics, filtered to how they want, they save these queries into dashboards or scheduled reports. They may also just refresh their dashboard whenever they want to see if there is new data. If they are running something that looks different, it may just be a filter of their usual query. Analytics tooling then further standardises these queries in terms of format.
Pareto's law begins to apply... at least 80% of "ad hoc" analytics queries are very similar or identical to other queries that have run before, perhaps even within a 24 hour period. It also becomes clear that there is a cadence of known queries that will run. This allows scheduled running and caching of these results so that users get near instant results. A number of tools and metrics layers do this.
From an Analytics perspective, there is caching present at each level of architecture:
Data Warehouse (results cache) - It's common for data warehouses to have a cache which lasts 24 hours. This often stores results of all queries run in this time with the query string, as a key value pair. Then, if a query which is an identical string is run, the cached results are retrieved and no actual query execution happens. This is why you sometimes see query results much faster than expected (milliseconds). This cache is highly performant: in this Snowflake example, it's 10000 times faster than running the query without any cache available! Slight changes to the query string can stop this cache from being hit though.
Data Warehouse (compute coupled) - while a compute cluster is up for a CDW, it has associated memory and disk. If a table/s has been scanned into these and another query hits the compute cluster using the same data, then the tables aren't rescanned and the data in the compute cluster is reused. This is also a very valuable form of caching and in the Snowflake example was roughly 20 times faster than running the query without any cache. With auto-shutdown of clusters, the cluster has to be heavily used (like a cluster used to serve BI for a large number of users) or has to have a long, and therefore expensive, minimum uptime.
Metrics Layer (Stored in CDW) - Scheduled queries and ad-hoc requests to the metrics layer could have results materialised as temporary tables in the CDW. It will require small table scans, but no real work other than retrieval. It still allows results to be served quickly, but it may be a few seconds rather than sub-second.
Metrics Layer or Tool Specific (Speed Layer) - this would be an in-memory key value store (redis, rocksdb), OLTP db (postgres, mysql) or document store (mongo/couchdb) which can deliver fast retrieval. This would only work well for small dataset retrieval (eg for graph input).
Tool specific (in the browser itself, data for graphing/table calcs/current pagination), very fast to the point of being instant on tiny data. No compute required from vendors, as it's on the user's machine.
How could we improve this?
I believe there are opportunities for smarter caching, but with increased logic needed in the caching service. If this had a level of interpretation to understand the components of the call, more reuse of cached data could occur.
If, instead of a pure SQL string being the key for the cache, a collection of metadata was stored (eg array of dimensions, array of filters, time range, time grain), then cached data could be used when the grain of data is needed to be higher. The cached data can be aggregated to the new level of grouping required, saving a CDW query.
This would also allow for additive behaviour to save CDW querying. For example, if the cached data had 2020 and a new query for 2020 and 2021 hit the cache, then just the 2021 data could be queried from the CDW, which would be faster and cheaper.
Could we even begin to predict queries that will run and then cache them in advance in batch for efficiency, without being asked to schedule them?