Yesterday, I got my advent project to a point where I had ingested data from the Bluesky API endpoints for followers and follows using dlt. Then, I used the integration between dlt and sqlmesh to automatically generate sqlmesh models from tables created by the dlt pipeline in DuckDB:
If you look at the way the integration works, it uses the metadata and labelling created by dlt but only from the data warehouse, rather than consuming any of the dlt-generated pipeline state from elsewhere - thus preserving loose coupling by design. All of the dlt tables are converted to incremental_by_time_range sqlmesh models.
Before we go into using the models and further into using sqlmesh, I thought it was well worth covering the different models and how sqlmesh supports environments.
Let’s cover the basics:
Full
This is the sqlmesh equivalent of a dbt table materialisation for a model. As expected, it fully refreshes the table it represents on each run.
View
This is like for like with the dbt view materialisation for a model, creating a regular database view unless you specify the materialized parameter to be true - then, it compares to the separate dbt materializedview materialisation.
It’s an interesting choice between these two configs. At Cube, we’re currently ingesting Salesforce data using Fivetran. We want our sales data fed into the sales dashboard to be as fresh as possible. Fivetran, in theory, updates our raw Salesforce data every 15 minutes, if and big if, there have been any changed rows. We can easily go 12 hours of the day with a handful of row changes, as they are human interaction driven. However, it means that our source data is pretty fresh.
The sales dashboard is the most important internal data asset at Cube, for obvious reasons, but because of our size, it is very unlikely for it to be used more than few times a day. If we were to use a materialised view for models downstream of the raw salesforce data managed by Fivetran, we could end up running these models much more often than we need and spend more than if they were just views and run when the dashboard runs. It’s not certain without testing though, and it could be the case that BigQuery efficiently updates materialised views for small changes. However, I would imagine that even small changes cause partition rewrites, which are usually expensive.
Embedded
These are like ephemeral models in dbt, where they don’t end up as database objects, but are injected into other models that reference them as subqueries.
Seeds also exist and are exactly the same.
SCD Type 2
I’m not going to go into a huge amount of detail on this model kind, as the sqlmesh docs are great at explaining the two kinds of model which support type 2 slowly changing dimensions. I’ve had to work with SCDs a number of times in my career and they’re never a simple thing to manage. I have always stuck with Type 2 though and I’ve never needed to learn how to implement another kind - they are probably the most common by far, which is why Tobiko have chosen to support them.
I, or one of my analytics engineers, has had to hack this together in the past using dbt incremental models, long before dbt snapshots had matured to what they are now. It’s difficult, brittle and can be very expensive when it goes wrong.
When I was at Lyst, we used this kind of method to understand the state of our product catalog, which had tens of millions of product variants to observe. This may not sound like a lot of data, but when they change four times a day or more (we tracked stock levels per size), it starts to become a lot=expensive!
sqlmesh’s configs support input data that has an updated_at timestamp, with unique_key_id - by time kind.
The second kind, by column, supports generating valid from and to timestamps automatically based on run time, for input data that doesn’t have an update timestamp and invalidation is triggered by changes to the values of specific column/s on subsequent runs.
I have had to use both of these methods in the past, and it’s a real boon to just have these as simple to use model kinds in sqlmesh. This is the kind of work where you might expect an engineer to spend a sprint building, testing and making it available - much more than building another kind of model. With this easy to use model kind, it makes it much faster and easier work. If you end up dealing with SCD Type 2’s a lot in your data model, this model kind existing in sqlmesh could be a good reason to adopt.
This is where better data modeling in product engineering helps more than tooling for the data or analytics engineer. If a record has a version that incremented when it changed state, that existed in the original source data, then dealing with SCD Type 2’s becomes trivial.
External Models
sqlmesh has an equivalent concept of dbt sources called external models, and they can be generated automatically by running the command create_external_models. Because sqlmesh actually parses SQL, these external models don’t need to be specifically denoted like dbt sources do. They are any database object queried by the project, that is not managed by the project. The yaml generated contains useful metadata from the database, like types and descriptions.
sqlmesh models have some other benefits, too - a schema.yaml file isn’t needed to document your models, in-line —comments next to your fields will be picked up. You can run pre-statements and post-statements as normal SQL, separated from the model SQL statement by semi-colons.
I have left incremental models for my next post, as they are a deeper topic.