As I mentioned last week, just before I departed for Snowflake Summit in Las Vegas, I would be checking back in this week with what I had seen and thought about it. I’m writing and publishing this at my gate in Harry Reid Intl so please excuse any typos or grammar errors!
If you recall in my post last year, I had felt disappointed with the progress Snowflake had made over the previous few years but was impressed by the releases at last year’s Summit. Some of those features haven’t lived up to the hype, like Unistore, but others like Snowpark seem to be well used. The stats shared this year: 30% of Snowflake orgs use Snowpark and 10m queries a day are run on Snowpark, show a decent level of usage, still with a lot of room to grow.
If I was more impressed than I thought I would be last year, I’ve been blown away this year. If, and it’s a big if, most of these features become usable and performant soon then not only have Snowflake innovated to a tremendous level, but it is a clear indicator that we’re at an inflection point in how we work with data.
The list, it’s not comprehensive, but what stood out to me:
Unified Iceberg Tables - Single mode to interact with external data. Unmanaged mode where coordination of changes happens by a different system, or managed where Snowflake looks after it. Managed Iceberg performs as well as native Snowflake Tables. This is a big deal, it means Snowflake is a Data Lakehouse too. You don’t need to store everything in Snowflake if you don’t want to, you can use Snowflake compute to query and manage your data lake.
Unstructured Data - Document AI. Ask questions in natural language from documents you have stored in Snowflake. Including images and pdfs, it’s clear that this will allow unstructured data to become semi-structured data in Snowflake. It is possible to build pipelines to convert this unstructured data as it comes in. There is no fine-tuning on the model required to start with but you can if you want to - feedback as to veracity of the model output, then retrain and you can share the fine-tuned model. This is very much a service approach as opposed to providing infrastructure to have an engineering approach, as I expected last week.
Geospatial support in GA, for both spherical and flat surfaces. Can switch between systems easily. Support intersecting coordinates and shapes.
ML powered functions - forecast, anomaly detection, without needing to know how ML works. These are accessible from Snowflake SQL and look very easy to use.
Snowflake Performance Index - stable workloads have been improved in performance by 15%. This should mean a saving of this much in credit spend too, so while I do talk about a lack of focus on efficiency later, this is definitely better than nothing. Quite a few changes that have been released over the last year have led to this improvement.
Core budgeting capability, allowing you to know if specific assets are tracking against a budget.
Warehouse utilisation metric - how utilised is this warehouse at a given point in time. Allows you to know if you should increase your warehouse size - because of how queries utilise memory and spill to disk/s3, it’s not always true that running a query on a smaller warehouse is cheaper. Knowing whether a warehouse is 100% utilised on a query or much less is helpful to optimise workloads.
Cybersyn - AI generated content for the Snowflake Marketplace. Make the worlds economic data available. Exclusive to Snowflake.
Snowflake Native App Framework - custom event billing including based on consumption.
Snowflake VS Code extension - I’m not sure if this is new but I didn’t know about it, and it looks decent.
Developer Experience - Native Python API, Rest API and new Snowflake CLI.
Git integration to maintain code in sync for applications deployed on Snowflake. You could see this as the first necessary step regarding version control in Snowflake, and that it could spread elsewhere.
Snowpark enhancements: Granular control of python packages, python runtime 3.9/3.10, external network access, vectorised UDFs and more that went off the screen before I could type them down. External network access is particularly interesting as it allows for a whole host of new possibilities.
Streaming - Snowpipe streaming > dynamic tables public preview (declaratively transform data using sql) > can pipe them together sequentially.
Text-to-code - SQL on Snowflake generated by LLM. Even including DDL. This is inside your Snowflake workbooks, and is analyst/dev tooling much like copilot. Whilst I wondered whether Snowflake would release a text-to-sql feature, optimised for Snowflake, for data consumers, this didn’t happen. It makes sense as Snowflake are so enterprise focused - text-to-sql with it’s issues around consistency and reliability could cause stakeholders to have a poor experience with Snowflake.
AI/ML feature engineering: training, inference, monitoring, consumption, snowflake.ml.preprocessing, snowflake.ml.modeling similar to scikit-learn etc
Snowpark model registry - store, publish, discover, deploy models.
This goes head to head with DataRobot, and it will be a hard sell for auto-ML platforms like H2O and DataRobot to win Snowflake business going forward.
To enable Streamlit powered LLM apps, native functions that interact with LLMs: st.chat_input(), st.chat_message()
Snowpark container services - This is huge! If this works the way it should, it’s essentially running Kubernetes on Snowflake. You can build whole applications on Snowflake like other cloud vendors enable. There were 10 demos from 10 vendors who managed to move their entire application onto Snowflake: Astronomer, Alteryx, SAS, Dataiku, Hex, Nvidia NeMo, Pinecone, Carto, Weights and Biases and Relational AI -the most interesting for data folks being Hex. The whole workflow and every runtime stays inside Snowflake making security considerations for any Snowflake customer very light.
This sort of possibility makes Snowflake much closer to a true cloud vendor. In some ways, this is the vision GCP started out with: make apps and abstract the infrastructure. However, I think what Snowflake are doing here is closer to the GCP dream than where GCP are today. Snowflake customers store most, if not all of their data in Snowflake or in data lakes (which now will perform the same as described above). The theme of “bring the compute to the data, bring the application to the data” was heavily brought home throughout the conference. Last year it felt like a nice idea, this year it feels more real. When you don’t have to move data around to get applications to work, a whole host of features and services that AWS/GCP/Azure have to offer, aren’t necessary.
Nvidia AI Enterprise available on Snowflake including LLMs and infra required.
Platform for LLMs - open source LLMs, Snowflake LLMs, Partner Models
Capacity commitment - you can buy apps from the marketplace by drawing down from this commitment. Similar to AWS commit use for AWS marketplace.
Once again significant cost savings aren’t delivered, but instead powerful new use cases are and the vision of the data cloud seems much closer to complete.
AI is proving a catalyst to accelerate change, even at organisations of Snowflake’s scale.
Elsewhere near the Summit
Mode is acquired by Thoughtspot! This was honestly a bit shocking as they have seemed very different in so many ways: philosophy, commercial approach, use case…
As soon as I heard the news, I thought of the Sisense <> Periscope acquisition, which by all accounts didn’t go well. It’s not the case that the same has to happen again, but it will certainly be interesting to see how it unfolds.
Will Mode customers have to pay Thoughtspot prices on renewal? How deeply integrated will Mode be with Thoughtspot?
The rational of providing analyst tooling on Thoughtspot does make sense, but the synergy value mostly comes through integration. However, the other edge of the sword is that very deep integration between two products of this maturity will almost certainly fail (think migrating Mode dashboards to Thoughtspot ones after somehow making them compatible, amongst other difficult things). I could be wrong though.
from Alex Izydorczyk of Cybersyn: "Hey there :wave: - nice summary of Summit! I did want to point out re:Cybersyn that we are not necessarily AI-generated content but rather a lot of our content gets used for AI use cases and/or we may use AI in the future for data cleaning. At the core, we're a data-as-a-service provider (which basically means we make datasets and Native Applications available on the Marketplace) - these content sets are all focused on where business & consumers are spending money and time (ie. economic data)"