sqlmesh cube_generate build part 2

Time for some AI Engineering

Dec 18, 2024

Yesterday, I got to a point where I got a CLI command to output JSON, which described the relationships between sqlmesh models and how fields were aggregated. This is the fundamental metadata required to build a semantic layer. When I set out to build this integration this week, I knew it was unlikely I would build something that deterministically generates a Cube data model perfectly. I thought that, at some point, I would come to a point where what I had built harvested the relevant metadata from a sqlmesh project, that I could then use AI to enhance this output.

sqlmesh cube_generate build part 1

David Jayatillake

December 17, 2024

I’ve been hacking away at this today and got so far:

Read full story

After my progress last night, I attempted to enhance the metadata further to enable its deterministic use in generating a Cube semantic layer. However, I reached a stage where minor improvements were consuming excessive time, and I’m not aiming to create something perfect here! Therefore, I believe it’s time to delve into some AI engineering.

My next goal is to add options to the cube_generate CLI command that will accept OpenAI1 credentials and LLM selection. If these credentials are provided, a prompt should be sent to the selected LLM containing the metadata the command can generate already, the rendered SQL for the models, Cube’s documentation on data modeling syntax, and instructions on what to do.

The Gold layer tables in my synthetic2 sqlmesh example project are a great example of a microcosm of a semantic layer. They have event and dimensional tables joined with field aggregations to create metrics. Cube has the concept of Views, which are like star schemas that prevent you from needing Gold layer tables with fixed grain, where you still need knowledge about what to aggregate to derive metrics. It would be great if Cube Views could be generated for each Gold layer model, referring to Silver layer Cubes.

I’ve written this far without attempting any work at all, so let’s see how it goes!

So this was how I started out, and after some additions, and a few minor fixes3, I got something that worked! Compared to yesterday’s hard problem of recursive AST walks, which I wasn’t actually sure how to solve on a conceptual level myself4, Cascade laughed5 at the simplicity of this task.

Here is my new sqlmesh cube_generate CLI command:

this is what `sqlmesh cube_generate —help` outputs, most capable is o1-preview

A lot of what I actually ended up playing with was the prompt text itself. Windsurf built the original prompt really well, summarising the URLs of the Cube docs into a good structure. A few errors resulted in strange output from the summarisation, like using camelCase instead of snake_case in many places (although this may be valid in an old Cube version). It also found some old ways of describing relationships between Cubes, which were actually valid but that we never really use anymore. Most of my time today has been editing and adjusting the summarised Cube docs that feed into the prompt.

Here is the output of running sqlmesh cube_generate --generate-yaml --select-models "gold.*" --output test_output.yaml :

I’m really impressed with the results. I haven’t tried to compile it on a Cube instance yet, and I’m sure there will be some errors, but I had a decent look, and it looks very good! It shows how much can be done with relative ease using Windsurf, especially where it is “workflowy”.

I know that allowing only OpenAI is a bit limiting, but it’s the one most people have access to. I don’t want to build too much or use Langchain here; just a simple API request to OpenAI with a prompt is enough.

LLMs are great at being creative; I asked Windsurf Cascade to make me a new sqlmesh project to put in the examples folder that was based on a typical e-commerce business with a relational data model and at least 20 models. I only needed to make slight adjustments, and I could run the sqlmesh plan command successfully. Over time, I added things like seeds and more detail to model configs, and again, Windsurf was very helpful at making bulk changes that would normally take a human a long time to make without errors, but Cascade did it within a minute.

A lot of this has been mucking about with env_vars, and when Cascade runs a shell command, it can’t access my env_vars, but when I run it in my terminal, I can!!! 🤬 Although, obviously, this is a good safety feature to stop AI from reading an API key. However, I think if Cascade could just use the terminal like a tool, like I do, using an env_var doesn’t mean you need to read the env_var.

It was one of those moments when I wasn’t quite sure what I wanted to see or what the truth even is. If a field is derived from two other fields defined in a CTE, to which original table does the derived field belong? The truth is that there is no table, much like the floating metrics in Power BI. Perhaps this is why Power BI doesn’t assign metrics to any table in its data modeler.

Metaphorically - it’s not sentient.

davidj.substack

sqlmesh cube_generate build part 1

Discussion about this post