I finished looking at model kinds in sqlmesh yesterday and will begin today by finally running the sqlmesh plan command!
The command ran fine after killing my Harlequin SQL editor (DuckDB is single-player). You can choose how far from and to you want to backfill an incremental model - I didnโt specify in this case, as it wasnโt necessary.
So, letโs fire up Harlequin again and see whatโs happened.
You can see the raw_http schema created from running the dlt pipelines, as well as three new schemas:
sqlmesh - this contains metadata about sqlmesh activity and assets.
sqlmesh__raw_http_sqlmesh - this contains the sqlmesh snapshots, the tables that store the data processed. These tables are named similarly to the models, except with a fingerprint suffix.
raw_http_sqlmesh_dev - as you can see, this schema only contains views, but is the schema which contains the actual model objects for the dev environment just created. The views all look like this, pointing to the snapshot table:
INFO - Updating view 'bluesky.raw_http_sqlmesh__dev.incremental_followers' to point at table 'bluesky.sqlmesh__raw_http_sqlmesh.raw_http_sqlmesh__incremental_followers__1284607026'
INFO - Executing SQL: CREATE OR REPLACE VIEW "bluesky"."raw_http_sqlmesh__dev"."incremental_followers" AS SELECT * FROM "bluesky"."sqlmesh__raw_http_sqlmesh"."raw_http_sqlmesh__incremental_followers__1284607026"
These are from the logs output for the plan. You can see that the log before the view creation statement explicitly says that it is updating the model view to point to the snapshot view.
So what are snapshots in sqlmesh? They are part of a broader concept called virtual data environments:
Iaroslav Zeigerman of Tobiko wrote an excellent post explaining what virtual data environments are in sqlmesh and why they surpass their forebears. It explains most of the benefits that sqlmesh has over previous tools.
I wonโt try to cover all of the details of the post, so Iโll TLDR;
Virtual data environments provide:
โฆa way to reuse existing data when appropriate to get an accurate preview of proposed changes in a fully isolated mannerโฆ
Every time a change to an existing model is made, a new snapshot of this model gets created and associated with a unique fingerprint. The fingerprint itself is a combination of hashes computed on the attributes that constitute a model. By default, each model snapshot writes into its own unique table (or updates its own unique view), so multiple versions of a model can exist at the same time without causing conflicts.
The platform doesn't expose datasets (physical tables or views) populated with model snapshots directly. Instead, it provides access to them through a layer of indirection implemented using views. This way, updating a dataset version in production becomes the atomic and almost instantaneous operation of updating a view associated with it by swapping the source it points to. The best part is that this operation is completely transparent to downstream consumers who always refer to a view and never to an underlying physical table. I refer to the layer of indirection powered by views as the virtual layer, while the layer of tables and views populated directly with model snapshots is called the physical layer.
This idea of exposing a view rather than a table has existed before. Before the Big Data era, BI developers did the work analytics engineers do today. They often used the Microsoft SQL Server stack, with its great limitations, but to a very high level of proficiency. They were very often certified and had a โrightโ way of doing thingsโnot too dissimilar to medallion architectureโand also used OLAP cubes built using SSAS.
I remember that interacting with datasets BI developers had curated made it rare to query a table. They would say they couldnโt guarantee the table wouldnโt change, but they would promise to maintain the view exposing it. They could then change the underlying data at will without breaking downstream dependencies. Sounds like maintaining an API, right?
However, what is new here is the automation of creating and maintaining the underlying data tables, using fingerprints according to environments and promoting logic from one to another. So, if youโre feeling like - this all sounds great, but Iโm not sure I can manage it ๐ , donโt worry, you donโt have to. This explains how the technology works, so it ceases to be unknowable magic. From an analytics engineering perspective, everything actually becomes much easier:
Creating a new environment is as simple as running
sqlmesh plan dev
. Or whatever you want to call your environment.Data is immediately available, as the whole of prod is pointed to from your dev views - the separation between virtual and physical layers. This is cheap and easy.
Letโs see what happens when I run sqlmesh plan
:
Well, that was painless and easy. The data I created in my dev virtual environment already exists, and the prod views simply point to it now. This is the same as step 2 in the diagram above, except I started with dev instead of prod.
As you can see, the fingerprints of the model snapshots have remained the same because no logic has been changed for any models. Iโve just promoted my dev changes to prod virtually.
INFO - Updating view 'bluesky.raw_http_sqlmesh.incremental_followers' to point at table 'bluesky.sqlmesh__raw_http_sqlmesh.raw_http_sqlmesh__incremental_followers__1284607026'
INFO - Executing SQL: CREATE OR REPLACE VIEW "bluesky"."raw_http_sqlmesh"."incremental_followers" AS SELECT * FROM "bluesky"."sqlmesh__raw_http_sqlmesh"."raw_http_sqlmesh__incremental_followers__1284607026"
This is by far the best and most painless way Iโve ever seen to manage the creation and promotion of environments in analytics engineering. Iโve had to do it before, and, thankfully, we had Snowflake with zero copy cloning, but even thenโฆ
We had to build a job to create the environment.
We'd have to make more than one if multiple engineers worked on different tasks.
Over time these environments would drift from prod and would become expensive to store.
Promotion to prod was tricky. You had to be sure that your dev environment matched prod at the time, then run your final code and zero-copy clone your changed models into prod.
All of this was not cheap in terms of compute used.
As this was difficult and unreliable, circumventing the environments and developing on prod was often tempting.
To be honest, I never felt 100% sure we wouldnโt break anything in prod after merge because of the quality and drift of our dev environments.
We had to establish all of these processes ourselves because dbt supports environments by allowing different targets, which are often schemas or databases within your data warehouse. However, there is no built-in method for provisioning these environments, and we utilised dbt run-operation to create our development and CI environments in Snowflake.
The only extra thing that I would want, relating to sqlmesh virtual environments, would be to have the environment coupled with git branches and PRs. So, if I opened a new branch to work on and then ran sqlmesh plan
, it automatically made a dev environment for that branch to use for that development work.
When development is finished and a PR is made, a CI Process should run sqlmesh plan
on an environment created from prod during the CI run (very freshly created from prod, therefore with no logic change, and if prod changed while CI was running, then CI would build again and re-run).
Once CI succeeds on an approved PR, it should auto-merge to prod and sqlmesh plan and apply should happen on prod to promote the changes virtually. When the branch is deleted, the dev and CI environments should be destroyed.
The speed of running CI/CD workflows with sqlmesh virtual updates should be excellent - the build with dependencies will take far longer than the sqlmesh virtual updates. Due to the fingerprints of the snapshots, other environments can find data already processed in development without the dev environment name or metadata needing to be shared.
A CI/CD bot template is already available, and I expect more to come that will cover the git alignment above.
Tomorrow, Iโll cover how virtual environments help with managing change.
First of all thank you for sharing with us your journey with SQL Mesh, it's helping a lot on-top of the official documentation thanks to many examples and comparisons.
I am still struggling about one point when thinking about virtual environment: wouldn't be a risk to have many physical snapshots of the same model ? I am thinking about working with big tables stored on cloud warehouse which requires to spend a lot of time to create a new physical copy of the table for dev.
Is there any smart thing done behind the scene by SQL Mesh to avoid n-plicating storage costs and compute costs to create the snapshot ?
Thanks!