sqlmesh cube_generate

It's time to be bold

Dec 16, 2024

woman climbing on rocks during daytime photography — Photo by AJ Yorio on Unsplash

When I first spoke to Toby Mao from Tobiko, we discussed whether integrating sqlmesh and Cube was possible. It seemed possible, at least to a limited level. After all, a DAG generates a data model but does not express how the data model should be used directly, what relationships in the data model exist, and not necessarily how to perform aggregations to calculate measures.

When we discussed this, I briefly considered who would build it—someone from Tobiko or one of the engineers at Cube… but I didn’t give it much thought. All I knew was that it wouldn’t be me. I’m not a software engineer, so I wouldn’t know how to build such an integration in code, although I understand how to plan one architecturally and from a product point of view.

A couple of things have changed since then. My experience building connectors, with dlt and without, with Windsurf earlier in this series has pushed my capabilities further than I could have thought possible in the last two weeks. The other thing that tempted me was looking at the PR that showed how the dlt integration with sqlmesh was built. It’s not just the capability; it’s also the speed of development - things that I know how to do but don’t often take ages to re-learn and build. Some of these things take seconds to build and validate in Windsurf.

So, I will have a go; this idea has been brewing in my mind since the start of last week. I have no idea whether I will succeed. This could blow up on the launchpad - I may fail to determine a good place to start. This could blow up upon ignition - even if I devise a plan, I may be unable to make it work. This could crash and burn after take-off - what I create may not be suitable for acceptance and integration into sqlmesh.

However, if I don’t attempt it, how will I know if it’s possible? I don’t expect what I produce will be production-quality code, like that created by a professional software engineer who might have worked on this and completed it anyway. I just want to explore what’s possible. So, follow along with the series this week; I genuinely don’t know how it will end!

Another driver is that I’ve been thinking about how to solve the cold start problem for a company beginning with Cube. It’s actually something I’ve been considering trying long before I joined Cube - when I was running Delphi Labs with

Michael Irvine

. Most of our customers use a framework like dbt/sqlmesh to perform data transformations. The metadata generated by the DAGs in these tools is incredibly rich, especially for sqlmesh, which uses sqlglot to parse SQL.

It’s Monday, and I’m feeling audacious. Let’s go windsurfing!

My rough plan so far:

If you look at what is possible to model in sqlmesh, it has a lot of metadata needed to build a semantic layer. I need to understand how to access this programmatically. I know it’s possible because sqlmesh UI does this to show the lineage visual, and sqlmesh uses it to determine breaking and non-breaking changes in plan.
1. It has which models have which parents. When a model has multiple parents, it means the parents have been joined or unioned. I am unfamiliar with sqlglot at this point, but I hope it outputs join keys.
2. If data has been aggregated for use in a model, then this aggregation pattern could be harvested as metadata to make a measure in Cube. For companies that don’t use a semantic layer, their gold layer aggregates are ideal and complete to define the measures in Cube.
3. Through column-level lineage afforded by sqlmesh(sqlglot), any field that is a primary key in one model but exists in another denotes a primary key - foreign key relationship.
4. Types of fields are available throughout the models.
5. The DAG understands aliasing; a field may be renamed from model to model, and understanding these changes can uncover more primary key-foreign key relationships.
Using the metadata found in 1: build a process that takes this metadata, reduces it if needed (the relationships may be used many times in the same DAG, for example), and then outputs Cube YAML into a folder.
Using the dlt PR as a guide, make a sqlmesh command that generates Cube YAML based on the metadata in a DAG.
1. The command should have options to allow you to choose where to output the Cube YAML folder.
2. Bonus - the command should have an option to generate models only for a specific folder/s or tag/s in the project (e.g. Gold Layer).
3. Bonus - the command should have an option to upload the project to a Cube Cloud deployment, accepting a token to do so.
Bonus - the command should have an option to accept a query history file that can then be used with sqlglot, to harvest more metadata to splice with the metadata from the sqlmesh project, in order to generate a richer semantic layer.
Bonus - the command should have an option to accept an OpenAI/Anthropic token, and this then drives a step which feeds Cube YAML already generated, the reduced metadata, some prompt instructions about how to use the reduced metadata, and Cube data modeling documentation into a prompt for a powerful LLM like Sonnet/Opus/O1, with some final prompt instructions to enhance the deterministically-produced Cube semantic layer YAML.

The dlt to sqlmesh PR here will be very useful in my Windsurf Cascade instructions.

So let’s build a prompt to put into Cascade - I will try to build everything except the bonus parts in one go first and see how it fares. I’ve forked the sqlmesh repo and have opened it in Windsurf to make changes.

Prompt:

Create a new CLI command that is used by typing: sqlmesh cube_generate.
This command will generate a Cube project in a specified folder - it should have an option to choose the folder location, or, by default, it will be set to the root of the sqlmesh project. It will also generate the YAML for Cubes, including their joins, dimensions, and measures, as outlined in the documentation here: https://cube.dev/docs/product/data-modeling/syntax.
The command should not attempt to generate Pre-aggregations.
It must utilize the metadata available from the sqlmesh project, as used by the sqlmesh UI and plan commands, or alternatively use sqlglot to parse the models in order to infer join relationships between models and establish the join relationships between the cubes.
The CLI command should leverage the same metadata to determine if aggregations have occurred in the data model, thereby suggesting measures to create for the cubes. Other fields will be utilized to develop cubes' dimensions and specify details such as primary key parameters.
This PR serves as a good example of how to create a new CLI command in sqlmesh: https://github.com/TobikoData/sqlmesh/pull/3218/files

It did surprisingly well; after a lot of cajoling and fixing its own solution, it got something that generated cubes for the sqlmesh models, but couldn’t handle the joins.

It also went partway to building its own SQL parser! Which is something I specifically didn’t want.

So when I try this again tomorrow, I will go step by step. Pull each thing out of the metadata that I want, and test it. Then pull another thing out, test it… so on. Until I have all the metadata I need. Then, the metadata collected will be used to generate Cubes. Then create the CLI command.

davidj.substack

Discussion about this post