Accio Firebolt!

"More Data, More Problems"

Aug 24, 2022

I have been interested in Firebolt as a technology for a couple of years. I was leading a Data team where we had implemented Snowflake as a replacement for Redshift. We initially had achieved cost parity when doing a like for like migration, but costs soon spiralled.

The thing about Snowflake is that queries run pretty much without fail, if resourced reasonably well. The separation of storage and compute, in conjunction with auto-scaling of warehouses (horizontal), makes the concurrency ability of the platform excellent - you really have to push a warehouse quite hard with data far too large to get it to struggle: it usually just takes longer. The unintended consequence of this is one that is new as a data phenomenon - you stop worrying about handholding your tool. You start to think: “Everything I've built for now is working OK, what else could I build?”. You stop saying no to stakeholders on the basis that your data warehouse can't handle it; it becomes a cost/benefit/prioritisation trade-off.

We also introduced dbt at the same time as Snowflake. Slowly but surely, the other Engineering teams in the business, who had been using Looker PDTs, realised dbt was better than Look ML for data transformation. By the end of my time there, there were more contributors to our dbt repo from outside of the Data team, than there were inside. I count this as a success, as we democratised access to data in a way not seen before.

Many people were doing more things with data using dbt and Snowflake than before. Of course, this meant our costs, which I had carefully chiselled down to be less than Redshift at the point of migration, increased. It got to a point where our Snowflake costs were increasing every month without fail. Our Finance team asked us to start to assign costs to each of the teams using Snowflake.

When Snowflake warehouses are shared between teams, it's very hard to do this accurately. It's much easier to give teams their own warehouses, as cost is then cleanly separated. We went down the route of building Terraform modules for Snowflake, to construct everything a team needed. Teams only had to provide a list of users who were in the team, and what warehouses they wanted.

This, again, meant our costs took an upwards step-change. Sharing warehouses and having auto-scaling on them gave us an efficiency which we then lost in doing this splitting. However, costs started to become more explainable and controllable. The teams to which the costs were assigned often had very large budgets and their new Snowflake costs weren't that big a deal.

David Jayatillake @DSJayatillake

@josh_wills I've seen a backend team use snowflake to host a 50mb dataset and query it once a minute 24/7... Incurring $24k cost per year 🤦‍♂️... They could have kept it in memory for no extra cost.

These spiralling costs with Snowflake have resulted in a cottage industry of companies who specialise purely in reducing Snowflake costs! I remember walking past at least three booths at Summit for such companies. Even though I was impressed at Summit, it's clear that Snowflake is constrained by the “innovator's dilemma”.

davidj.substack

Hope and convening in Las Vegas

Before As many of you will know there is a big thing in data going on in Las Vegas this week - Snowflake Summit 2022. It's the first large in-person data conference I've attended since Snowflake Summit 2019 in London. It's also the first time I've flown in 7 years! Between having small children and the impact of Covid restrictions, I have happily avoided…

3 years ago · 7 likes · David Jayatillake

It was around this time that my honeymoon period with Snowflake ended. Once something costs in the $100ks, and looks like it might hit $1m in the not too distant future, you start to consider alternatives. It was also around this time that I started speaking to the Firebolt team.

This was quite early in the Firebolt journey, so my call was with two of the cofounders: Saar Bitner and Boaz Farkash. Boaz showed me some information about the platform and I was getting excited, until I found out that Firebolt couldn't do delete or merge DML statements. It also didn't have a dbt adapter or Terraform provider: the latter not being such a big deal, but the former being a deal breaker - at least in terms of Firebolt being our main data warehouse.

The main joy, for me, of Firebolt being much faster than Snowflake in querying the same data, using the same compute, was mostly the huge cost saving we would receive. I didn't think much about the fact that our dashboards would often take a minute or so to run (more for the bigger datasets) - this was always the speed at which stakeholders had access to BI. We had previously used smaller Snowflake warehouses for Looker in order to keep costs down, and increased their sizes on special occasions like Black Friday, where teams needed faster access to data. It didn't bother me that access was slower the rest of the time; I wasn't used to an expectation that access should be fast. After all, we had managed to make it much better than when we were on Redshift 🤷.

Data Stakeholders have come to expect more in the last couple of years. Expecting sub-second dashboard response times is something I anticipate will not be too far away. If next generation tooling like Clickhouse, Firebolt, Rockset, Transform (cache), dbt Semantic API (cache) etc enable this kind of speed, then the expectation will follow. Data teams who stick with today's SLAs on latency will fail to meet expectations - human attention spans are in decline. This, and my history with Firebolt, made me interested to accept the invite to the lunch and learn session!

I was particularly interested to see how far the platform had come since I last spoke to Firebolt, and also to consider whether it was a data warehouse worth integrating with, for our roadmap at Metaplane.

I will share some of the slides shown below, along with comments as I go, on their content and some of the things the presenters said in response to questions.

There is no magic in Engineering - speed can be increased by using more resources or doing less work, almost like a computer science equivalent of the laws of thermodynamics. One of Firebolt’s main ways of providing faster query performance than Snowflake is their proprietary file format, F3. While this has been an advantage in speed, I believe this could be the reason why Firebolt haven’t been able to support the delete and merge DML statements.

If sparse indexes point to small data ranges within files, delete and merge statements will more or less ruin these indexes. Each time a delete or merge statement needed to be executed, it would end up being better to recreate the whole table and its indexes! At least, this is my assumption. This is why it has been suggested that Firebolt could work well as a cheaper faster cache for consumption layer tables, produced using another data warehouse such as Snowflake or BigQuery.

My questions:

Iceberg? “No support for querying at this time but connector coming to ingest”. This is not surprising, given Iceberg hasn’t been shown to be more performant than existing Snowflake micro-partitions. It wouldn’t be possible for Firebolt to provide their current performance using it in its current form.
Delete/Merge? “There is a beta of delete internally, at partition level only. Aiming for Q3 release of delete to GA. Soft deletes are also an option”. I’m pleased to hear that progress on delete (and therefore merge, either in its own right or a two step delete and insert) has been made.

Given the difficulty I anticipated with F3’s sparse indexing, which is intrinsic to how Firebolt works, it doesn’t surprise me that it’s taken this long to solve delete. Soft deletes are an interesting concept, too. My understanding is that it works like this:

An existing table is composed of multiple F3 files like Snowflake micro-partitions.
To run deletes or merges on this table, it is possible to understand which F3 files contain data which needs to be removed or changed.
A new table is created which maintains the unchanged files, and has new files which contain the updated and inserted data.
The old table is dropped and the new table replaces it.
The table is compacted over time to prevent suboptimal performance.

This is exactly the kind of process that can be abstracted in a dbt adapter. I can’t remember the last time I used DML statements directly on production and development data - I’ve used dbt to handle these. It would be fine in my mind for delete and merge to not exist as actual statements if they can be enacted through the soft delete method. With ad hoc work, I can always drop and recreate tables.

This is how the query performance really improves 👆.

Question: Is an aggregating index just an OLAP cube?

SE: “Not really, but you can query original tables without needing to know about them. No lengthy build processes on a cube.”

Question: Does the join index interactivity with the aggregating index mean you can have fields from other tables in your aggregating index on a specific table?

SE: “No, aggregating indexes must contain fields from the base table only. However, you can still utilise the join index to speed up joins for dimension value lookups etc. So basically, your Fact tables can be properly modelled with just the business keys in, and then you would use your join index to retrieve the display value.”

Firebolt engines have warmup - you can choose whether to preload indexes on an engine. This is less user-friendly than Snowflake, although more powerful. Auto-scaling doesn't happen at the moment, but arguably isn’t needed as much, given that high concurrency BI queries have known joins and aggregation which should be sped up by the relevant indexes.

Firebolt lets you choose memory/compute/storage-optimised engine options for different workloads. They don’t obscure what the underlying instance is in the same way that Snowflake does, making pricing more transparent.

This level of configuration is appreciated, but it makes me wonder why should the customer have to right-size through trial and error? Snowflake’s query acceleration service should provide this as a service soon, and BigQuery already doesn’t require users to right-size their resources for a query. A query is planned before it's executed - the plan outputs how much data will be scanned and put into memory and also a likely estimation of how much extra will be needed for each join and function afterwards. I believe it’s possible to right-size either to the maximum cluster the query needs, or bring in nodes during the query process, as needed.

Firebolt is ANSI-SQL compliant and has a good set of functions compared to some other data warehouses, particularly with regard to array functions. This has come a long way since I last saw a demo. The syntactic sugar for dealing with semi-structured data is also in place now, which wasn't the case before.

Conclusion

Firebolt arguably is a very good platform that doesn't yet have the completeness of Snowflake or BigQuery. Its aggregating and joining indexes can guarantee certain query patterns will return sub-second responses, eliminating the need for analytics caching in some circumstances. The metadata that these contain could in theory prevent Metaplane ever needing to do any table scans for our tests.

If you have very heavy BI usage, it looks like a good option even now, even if you will have to regularly sync data to and from another data warehouse used for ELT. However, without the delete DML statement and merge dbt incremental models, I would only recommend its use alongside another data warehouse at an org with huge data. For smaller data use cases, it’s like taking a sledgehammer to a picture hook.

With a delete functionality of some kind on the way, Firebolt becomes more interesting… perhaps as a complete data warehouse.

davidj.substack

Accio Firebolt!

"More Data, More Problems"

Conclusion

Discussion about this post