SQL Forever

At this rate, it will outlast humanity

Oct 25, 2024

blue and pink string lights — Photo by Aaron Greenwood on Unsplash

Recently, the Malloy team moved to Meta, from Google. Malloy provides a language to express analytical and transformation queries that compiles to SQL:

Malloy queries are concise and reader-friendly. They seamlessly translate into SQL, optimized for your database.
Malloy's semantic data modeling allows you to effortlessly craft and recycle intricate business logic.
Daunting, intricate SQL challenges become straightforward in Malloy.

Malloy is a project that generated a lot of excitement. However, in all honesty, it hasn’t lived up to the hype. It hasn’t been as widely adopted as it should be by this stage if it were to become a standard. Part of why I think this is, is that you have to run your Malloy code somewhere other than where you run your SQL code. Malloy code is not natively supported by any data warehouse today. It also requires users to learn a new way of writing queries. Users who already know SQL probably won’t find sufficient utility, most of the time, to warrant learning Malloy - I fit into this demographic.

Cube is a project with comparable function, although different focus, to Malloy - it is as close to a standard for semantic layers as exists, but still has a lot of room to grow in adoption. Malloy’s growth has been significantly slower and looks quite flat at this time in comparison. DuckDB was probably hidden in academia for the first couple of years before it achieved hyper-growth.

You might ask if it matters if Malloy becomes a standard or not? Were the creators of Malloy always wanting to make something more niche? The problem with this is that when something isn’t adopted enough, the creators have to do all the integration work. The tools and systems they integrate with progress without concern for the project. This is a really difficult position to be in. It leads to fewer, more brittle integrations, and therefore can’t be adopted by as many people.

So, between forcing users to learn a whole new way of querying and not running natively on a data warehouse, Malloy has been stuck between a rock and a hard place. Even though I liked it as a technology, and I’m obviously a big supporter of semantic modeling, I never thought it would succeed at scale. Actually, the move to Meta makes a tremendous amount of sense. Analysts and Data Scientists at Meta are very technically capable and if something becomes the standard at Meta then they will readily adopt it, regardless of whether the industry as a whole does. If the Meta internal data toolchain supports Malloy, with some internal education and encouragement, data folks at Meta probably will start to use it to get the benefits.

Even though Malloy has attributes of a semantic layer, it’s not designed to be a semantic layer in the way something like Cube is - it’s really meant to help analysts and data scientists efficiently and consistently write reusable analytical queries in a modular way. It has wider scope of use than a semantic layer like Cube, which is not meant for exploration beyond what is purposefully exposed, but it is also much less scalable in terms of who can possibly use it. It does not enable self-serve analytics by non-technical stakeholders and only the more technically-able analysts would pick it up - again, this fits a move to Meta very well.

If you look at the structure of the Malloy queries above, they are similar to Cube REST API requests with filters (where), dimensions (group by), measures (aggregate). However, there is no Malloy server: it’s not designed to be in production anywhere, it’s designed for exploratory data work.

Dplyr in R and Pandas1 in Python did become standards for how to query data in those ecosystems, but neither had established methods before. These data folks were typically more technically-skilled than an analyst. They were also in an era of experimenting with new methods and tech. Pushing SQL off its stoop in the database world is a completely different task - I believe at this point that it’s not really feasible. Lindy’s law etc.

This is part of why DuckDB has been so successful - rather than make a new DSL for their query engine, they have used SQL - most data folks can use it immediately and in more places than their previous data warehouse. It’s the same reason why Snowflake was so successful before it. It’s partly the same reason Databricks is catching up, and is no longer something data folks think of as a platform for “people smarter than them”. One of the core differences between data engineering and analytics engineering is that analytics engineering is oriented around SQL2 and data engineering is not.

A lot of the steps you can pipe together in Dplyr and Pandas actually feel like running components of SQL on their own. Dplyr was the last time I used something I felt was nicer than SQL to express data manipulation and querying.

Recently, GCP announced piped syntax in BigQuery. It almost felt like a response to the Malloy team moving to Meta, but of course it’s probably unrelated. The problem statement in the announcement is really great:

Since its invention 50 years ago, SQL has become the universal language used across the database industry. Its familiar syntax and well-established community have truly democratized how people access data everywhere. But let's face it, SQL isn't perfect. SQL's syntax has several issues that make it harder to read and write:
-Rigid structure: A query must follow a particular order (SELECT … FROM … WHERE … GROUP BY…), and doing anything else requires subqueries or other complex patterns.
-Awkward inside-out data flow: A query starts in the middle and then logic builds outwards, starting with FROM clauses embedded in subqueries or common table expressions (CTE).
-Verbose, repetitive syntax: Tired of listing the same columns over and over in SELECT, GROUP BY, and ORDER BY, and in every subquery? We hear you.
These issues can make SQL more difficult for new users. Even for expert users, reading or writing SQL takes more work than should be necessary. A more convenient syntax would help everyone.
Over the years, several alternative languages and APIs have been proposed, and some have had some success in narrow applications. Many of these, including Apache Beam and Python DataFrames, use piped data flow, which makes it easy to compose queries arbitrarily. Many users find this syntax easier to understand and more usable than SQL.
But SQL is ubiquitous, and unlikely to be replaced any time soon. Learning a new language and migrating workloads to new tools is simply too much for most users.

Here is an example of the syntax from the announcement:

-- Pipe Syntax

FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`

|> EXTEND EXTRACT(YEAR FROM trip_start_timestamp) AS year

|> AGGREGATE COUNT(*) AS num_trips

GROUP BY year, payment_type

|> AGGREGATE AVG(num_trips) AS avg_trips_per_year GROUP BY payment_type ASC;

DuckDB already allows for some of this behaviour and will allow more in the future.

FROM my_table SELECT my_column;

SELECT

('Make it so')

.upper()

.string_split(' ')

.list_aggr('string_agg','.')

.concat('.') AS im_not_messing_around_number_one;

I do believe that efforts to make SQL “friendlier”, or to allow richer, more efficient expression will work, providing they can be used in the data warehouse itself, with and alongside existing SQL. If you’re getting a human to interact with a database, get them to write SQL++, and they can use as much or as little of this new syntactic sugar as they like.

If you’re getting a tool, system or AI to interact with a data warehouse, you have a lot more freedom in terms of proxying and abstraction - you don’t necessarily need SQL to be at the interface, it just needs to be compiled to in order to run on the data warehouse.

Sonnet 3.5 now can use computers - we’re going toward getting AI to use our existing infrastructure, rather than building dedicated new infrastructure for it. If AI builds on top of SQL, SQL may outlive us!

Pandas use isn’t really for serious work any more, it’s better to quack. It’s more for cute small data use cases and demos. Although, Pandas-style APIs to more powerful tooling is now very commonplace. So the Pandas API is the standard, more than Pandas itself. This is similar to how SQL started off as a way to query on-prem databases, but now can be used with massive clusters. Modern data warehouses are pretty far from old OLTP database technology - SQL is just the standard API.

When hiring analytics engineers, this was the one technical skill I couldn’t compromise on. Everything else: git, dbt, shell, Python… were nice-to-haves. The other skills could be learned quickly on the job. SQL as a skill also shows the ability to understand and use a relational data model, which is the other side of the coin in terms of necessary technical ability for an analytics engineer. Data engineers often need to know many more technical skills, but fewer soft skills.

davidj.substack

Discussion about this post