sqlmesh janitor

mopping up some remaining features and other thoughts about sqlmesh

David Jayatillake

Dec 13, 2024

I wanted to mop up some remaining sqlmesh features in bulk before I move on to doing something else with it next week.

Let’s look at the remaining CLI commands that I think are important or useful:

clean - I haven’t encountered a need for this yet, but I could imagine a situation where something has gone wrong, and then sqlmesh plan or run doesn’t work properly due to broken state. This would be the first command I would try to fix things before taking more drastic measures!
dag - refer to the picture at the top generated by this command. Compared to the lineage in the UI, it’s quite basic, but it might be useful in contexts where you cannot show someone the UI or need to embed it.
dlt_refresh - you can use this to incorporate new dlt data tables created since you initialised your sqlmesh project with dlt. It would be very useful for me to start using dlt for other Bluesky endpoints. This command type is a big inspiration for what I will try to build next week.
evaluate - returns a sample of model data. This is helpful when you’re developing and want to check if the data looks as expected. Similar to dbt show.
format - Data folks can be quite opinionated about SQL formatting, and what I appreciate about this format command is that it settles the debate. As an organisation, you can choose how to implement this command (including opting for leading commas 🤮), then run it as part of CD to ensure that the code maintains a consistent format. It’s SQL aware, so it is even opinionated about how you should use casts and other functions for your warehouse’s accepted type of SQL.

invalidate - this is how you label development environments as no longer needed after development is finished. The obvious time to run this is after merging a PR for a dev environment that is no longer needed.
janitor - this cleans up snapshots and schemas related to invalidated environments, preventing clutter in your data warehouse from development. We used to have a big problem where we would have a new Snowflake database for each development environment, and we had no automatic mechanism for deleting them - we had to ask if anyone was using it and then delete it… which invariably led to wasted credit spend. janitor can be run after invalidation following the PR merge to remove the development environment that is no longer needed.
table_diff - Datafold’s previous product iteration focused on providing good development environments and a CI process which diffed dev environment models from prod to check the logic hadn’t broken anything. I know plenty of analytics engineers who loved this product, but the truth is that it should have been part of the dbt developer experience. With sqlmesh virtual environments and this command, you have the same functionality. It’s obvious for an engineer to use when testing their changes and also have the output shown in a CI for a PR being reviewed.
render - this is incredibly useful and similar to dbt compile. However, the rendered model code is output in the terminal instead of into a file, which helps you avoid hopping around the IDE. It’s also more powerful than dbt compile in that you can set an execution time, as well as start and end times for incremental models, to see what code would actually run in a given circumstance.
You can also use the expand option to display all or some materialised models as code, providing a query that can be executed purely against external models. This way, you could run it against a source database like Postgres and use the dialect option to render the SQL so that it works on that database! This could be incredibly helpful for testing or if you simply need the data exactly as it exists right now in the production OLTP database.
ui - this starts a browser-based UI for sqlmesh; see a picture of it below. This is the only place you can see the column-level lineage and the enhanced lineage visual in general. You can also use it as an IDE and make code changes, make environments and run plan using a button. It doesn’t feel complete as an IDE, and perhaps that is something that will be more feature-complete in Tobiko Cloud. It must have version control linked with sqlmesh environments to make it better than using your own IDE. Given how easy it has been to use Windsurf to develop sqlmesh models, it will need an excellent copilot in short order.

The UI is the only feature I think may be a bad idea. We’ve seen with the dbt Cloud IDE that engineers prefer to use their own development environment. I personally would prefer a great VSCode extension that linked virtual environment with git branch and showed lineage for a model or column depending on where my cursor was.

However, for enterprise customers, having an IDE in the application (you can’t just sell them a CLI) is probably necessary. Many data folks in large enterprises aren’t allowed to have an IDE, Python, or DuckDB… installed on their machines, but IT can’t stop them from having a browser. It’s a tough balance to strike to ensure engineers don’t feel forced to use the UI, and can use their IDE of choice, while also supporting enterprise and less technically able users.

Cube Cloud has an IDE, and part of why engineers actually use it is that they can see the effects of their changes in real time using our playground. However, for a tool like dbt or sqlmesh, the effect of their changes is in data; you really want to be able to use a fully featured command line as you write your code so you can run and test etc. It’s not impossible to have a more responsive IDE with the UI (maybe it could run evaluate, test, table_diff(parent_branch/env, dev_branch/env) in real-time using DuckDB and test data)). If it also provided a great database explorer and SQL runner, this could be another driver towards it. I’d be happy to be proven wrong here, but I doubt the UI will ever be the best choice for someone like me.

There are also a bunch of other features not related to the CLI I didn’t cover:

Python models - I don’t use these in dbt (therefore wouldn’t use them in sqlmesh either) and don’t see the point… I still see Python as something you would run elsewhere from a data warehouse 🤷1. In short, Python models can run locally in sqlmesh and don’t require your data warehouse to support a Python runtime.
If it wasn’t clear from my previous posts, sqlmesh handles data contracts within its purview automatically by explaining breaking and non-breaking changes to the engineer. You have to manually specify this with dbt, with more YAML…
sqlmesh supports Iceberg format as a table_format property for a model.
Incremental models have other properties that look very useful: batch_size, batch_concurrency, forward_only, on_destructive_change and disable_restatement. They protect and optimise how you can use incremental models on large, expensive datasets. These will be useful in a real-world production project and harder to show in my Bluesky project.
sqlmesh supports multiple query engines for cost and performance optimisation on a model level.
sqlmesh supports multi-repo in open-source, dbt supports this in Cloud only.
Although this was apparent from previous posts, sqlmesh has the plan/apply way of working like Terraform. This means that you don’t need to run anything to test whether your code changes will work, saving cloud data warehouse costs from many partial runs that fail in development.

It’s also worth noting what Tobiko Cloud has in features that sqlmesh doesn’t:

Tobiko Cloud complements SQLMesh, supporting companies that need enterprise-level features like scalability, observability, and cost optimization.
Here’s a comparison:
1.Deployment: Tobiko Cloud simplifies SQLMesh deployment by hosting it on our infrastructure.
It provides enterprise-grade hosting and scalability for complex data transformations, freeing teams from managing infrastructure themselves.
2.Observability and Insights: Tobiko Cloud integrates deeply with SQLMesh, providing instant visibility into pipeline versions, code changes, and errors.
This allows teams to monitor their pipelines, detect changes in pipeline behavior, and rapidly trace the root causes of data issues.
3.Efficiency: SQLMesh's built-in features like virtual data environments and automatic change classification reduce computational costs and improve processing speeds.
Tobiko Cloud's enhanced change classification identifies even more scenarios where code changes don't require rerunning downstream models.
4.Cost monitoring: Tobiko Cloud automatically tracks costs per model execution for BigQuery and Snowflake.
This allows teams to rapidly detect anomalous spending and to identify the models driving cloud costs.

As I mentioned in an earlier post, having an environment state hosted centrally and reliably outside of your data warehouse is a valuable service. If Tobiko Cloud can, through its semantic understanding of SQL, help teams troubleshoot production issues more rapidly and monitor pipelines for cost and deviation, it's providing a lot of value. The enhanced change classification (partially breaking changes I looked at in a previous post, and possibly further features) on a big project will speed up engineering work, too.

It seems to eliminate three or four tools that Modern Data Stack teams were buying separately in the ZIRP era. Bundling is for real. Your CTO and CFO will gladly accept the argument for buying one tool that replaces one tool and removes the need for three others, especially if it’s cheaper than the four bills combined. This is almost certain to be true, but I need better insight into Tobiko Cloud pricing to be sure.

Having the functionality of these tools in one place will enhance developer experience, too, although it’s likely dedicated tools may have richer feature sets. For example, data observability tools use probabilistic methods to monitor data pipelines that I doubt Tobiko Cloud would offer.

Why have I covered sqlmesh in such depth? I have because of how foundational a technology dbt has become for people who work in data. I had been putting off looking at it for some time because I knew it would be a big project to really evaluate and compare to dbt. Backwards compatibility with dbt really pushed the envelope for me to take a look.

You might think it’s perhaps unkind to promote sqlmesh in this manner because it poses a threat to dbt, but I don’t perceive it that way. Without competition, dbt operates in a market of one, and such markets are less valuable because risks can’t be adequately mitigated. With competition, the market could hold significantly more value. It’s preferable to have 50% of 10X than 90% of X.

I genuinely believe that if Tobiko became a highly successful company and sqlmesh/Tobiko Cloud turned into a very successful product, dbt Labs could gain more than anyone else. It’s not a zero-sum game.

There was no incentive for dbt to improve rapidly when they were growing so fast without competition. I have noticed an uptick in feature releases in recent quarters for them - competition is healthy and makes you fitter.

I am confident now that I can advise a company that sqlmesh is ready for prime time. Putting my money where my mouth is, we are just about to move Cube’s internal analytics from dbt core to sqlmesh.

Certainly, there will be a cost associated with switching from dbt if it is already in place, and this cost may not always be justified, particularly if you are using dbt Cloud enterprise, which alleviates some of the differences I have pointed out in this series. The larger the vessel, the more difficult it is to alter its course; the challenge is that the largest analytics engineering teams will gain the most from sqlmesh’s features.

There will also be resistance from analytics engineers who appreciate dbt’s community and movement and feel that moving away from it2, even to something better, doesn’t align with their career paths. I’m uncertain how often we’ll see sqlmesh mentioned in job descriptions in the near future; dbt appears in many of them these days.

Some might also be reluctant to learn something new if what they currently have is functioning well enough— not everyone is eager to take on what I’ve just done and enjoy it. After all, as my career has demonstrated, you can effectively use dbt at scale; that hasn’t changed.

If you’re building a fresh data stack and haven’t implemented dbt yet, then you should evaluate sqlmesh when choosing. There is no doubt that it’s a better tool. dbt core isn’t superior to sqlmesh in any way; it’s on par in some. sqlmesh has addressed all major issues that I’ve encountered running dbt at scale, and it had to really. It can’t take a shot at dbt without this being true - if it were worse in any way, it wouldn’t fly, such is dbt’s dominance in the space.

When learning how sqlmesh incremental by time models worked, their complexity=power did make me wish I could make a simpler dbt style incremental model that just ran the data from the last seen insert timestamp to the latest, but now that I’ve learned how they work I know how to do this and also enjoy the benefits of their extra features. Using render with start and end options is really helpful to know what an incremental model will actually do.

The thing we should bear in mind is that without dbt, would we have had sqlmesh? sqlmesh is undoubtedly inspired by dbt and intentionally diverges where its creators have strong opinions on what should be different, shaped by real trigger time managing the complexity of Netflix-scale data. You can’t have V2 without V1, but it’s time, and we’re ready for V2. The rough edges of dbt’s developer experience are something I’ll find hard to return to now that I’ve seen what’s possible.

Some of you may feel I’ve been disloyal for writing this series after years of being a “dbt zealot.” I’m still a big fan of dbt and consider many people who work there to be friends. sqlmesh is simply a better tool. Therefore, it’s now my preferred tool for transformation in my toolbox until something better comes along… and then I’ll switch to that 🤷.

I’m searching for productivity and efficiency. Yes, I love the data community, too, but not at the expense of professionalism and potential higher competence. It’s wrong to choose a second-best tool just because you like the vendor more. The vendor’s name isn’t on your paycheck.

Exceptions could be to run an ML model or make a request to an external API to do something that is effectively reverse ETL (think pushing some data back to Salesforce). Although, I think these could be done in another tool kicked off after the sqlmesh run has finished, and would be less brittle for it.

Although you don’t have to just because you’ve changed tool!

davidj.substack

Discussion about this post