Lindy never left

Michael Stonebraker and Andy Pavlo's new paper in SIGMOD

Jul 05, 2024

cable network — Photo by Taylor Vick on Unsplash

Database legends, Michael Stonebraker and Andy Pavlo, released an interesting paper in SIGMOD Record last month: "What Goes Around Comes Around... And Around..."

I have read the paper, and have multiple parts I want to explore (and quote) below, but for your convenience here is the TL;DR courtesy of Perplexity:

1. Dominance of the Relational Model (RM) and SQL:

Despite numerous attempts to replace or augment them, the relational model and SQL have remained the dominant paradigms in database systems. The paper argues that these technologies have proven to be resilient and adaptable to changing requirements.

2. Evolution of Data Models:

The paper discusses various data models that emerged to address perceived limitations of the relational model, including:

- Object-oriented databases

- XML databases

- Key-value stores

- Document databases

- Graph databases

However, most of these alternative models have either been absorbed into relational systems or have found niche applications rather than replacing SQL databases entirely.

3. Query Languages:

While new query languages were developed for different data models, SQL has continued to evolve and incorporate features to handle diverse data types and structures. The paper notes that many NoSQL systems have added SQL-like query capabilities over time.

4. Convergence Towards Multi-Model Databases:

The trend in recent years has been towards multi-model databases that can handle different data models within a single system. This approach allows organizations to use a single database for various data types and querying needs.

5. Cloud-Native Databases:

The paper discusses the rise of cloud-native databases, which are designed to take advantage of cloud infrastructure for scalability and elasticity. These systems often combine features from both SQL and NoSQL paradigms.

6. Specialized Systems:

While general-purpose relational databases remain dominant, the paper acknowledges the emergence of specialized systems for specific use cases, such as time-series databases or graph processing engines.

7. Future Trends:

The paper speculates on future developments, including:

- Further integration of machine learning capabilities into database systems

- Continued evolution of SQL to handle more complex data types and operations

- Increased focus on ease of use and automation in database management

8. Conclusion:

The authors conclude that while there have been many innovations in data models and query languages, the fundamental principles of the relational model and SQL have proven to be enduring. They suggest that future developments are likely to build upon these foundations rather than replace them entirely.

This paper provides valuable insights into the historical development and current state of database technologies, offering a perspective on why certain approaches have persisted while others have been less successful in displacing established paradigms.

SQL incorporated the best query language ideas to expand its support for modern applications and remain relevant.

Regarding SQL absorbing many of the great features of other languages, we can see this happen in Snowflake, BigQuery and DuckDB, in particular.

Snowflake’s ability to handle semi-structured data in SQL was groundbreaking when it came out. I personally relied upon this, with Snowpipe, to ingest data from our S3 parquet data lake when I was at Lyst back in 2019, avoiding the need for another ELT process. BigQuery has incorporated the ability to call ML models from BQ SQL, which has now extended to include LLMs. DuckDB has consistently been adding new functionality and therefore power to SQL.

Hadoop’s limitations opened the door for other data processing platforms, namely Spark [201] and Flink [109]. Both systems started as better implementations of MR with procedural APIs but have since added support for SQL [105].

Databricks did not initially support SQL in their platform, but has since released multiple SQL interfaces to multiple query engines over the years. This is an example of the reverse of SQL absorbing great features from other languages, where non-SQL platforms have brought SQL in. The gravity of SQL was too strong, and Databricks had to support it to compete with the likes of Snowflake. It’s clear they have gone from strength to strength since this point. When I chose Snowflake at Lyst in 2019, I didn’t even consider Databricks as it was something that was “for people much more technically capable than me”.

Today it’s a different story with Databricks going toe to toe with Snowflake, and the two platforms becoming increasingly similar in capability. Snowflake is bringing Databricks’ abilities into Snowflake SQL, Databricks is adding Snowflake’s SQL abilities to its already advanced ML and AI offerings. Both are investing heavily into AI, both have recently released AI analyst experiences - which I will explore soon.

Others provide a non-SQL veneer over a RDBMS (e.g., PRQL [64], Malloy [39]). Although these languages deal with some of SQL’s shortcomings, they are not compelling enough to overcome its entrenched userbase and ecosystem.

I agree with this, even though I like Malloy (I haven't seen PRQL in any great depth), I think it will remain niche. The only thing that ever tempted me away from SQL was Dplyr in R, where I genuinely thought expression was better than SQL for some use cases. Pandas in comparison (which I've spent a lot more time with since R's decline), is far less appealing. With DuckDB going from strength to strength, I can’t see why I would want to go back to Pandas1.

However, if a project like Apache Calcite could gain momentum, which extends SQL rather than replacing it (which the paper illustrates is a proven strategy), I think this could catch on. I don’t actually know why the main DWH vendors haven’t offered Calcite or their own version of it in their SQL variants.

The summary section for the paper is very interesting and there are some parts that I have want to look at in the light of OLAP.

Developers need to query their database directly.
Most OLTP applications created in the last 20 years primarily interact with databases via an abstraction layer, such as an endpoint API (e.g., REST, GraphQL) or an object-relational mapper (ORM) library. Such layers translate an application’s high-level requests into database queries. ORMs also automatically handle maintenance tasks, such as schema migrations. One could argue that since OLTP developers never write raw SQL in their applications, it does not matter what data model their DBMS uses as these layers hide it.2
ORMs are a vital tool for rapid prototyping. But they often sacrifice the ability to push logic into the DBMS in exchange for interoperability with multiple DBMSs. Developers fall back to writing explicit database queries to override the poor auto-generated queries. This is why using a RDBMS that supports SQL is the better choice.

With OLAP and data analysis the reverse of the above has been true, data folks have mostly been writing their own SQL to interact with data warehouses for a long time. I really believe this pattern of always writing custom SQL, to solve any piece of data analysis, doesn't scale.

When application developers write a transaction directly to a database, the development and testing process ensures that this SQL is correct. If they treated production data the wrong way someone would know very quickly.

It's also simpler, they aren't doing aggregations and using CTEs or subqueries to retrieve a record3. The main complexity as described in the paper is that relational models can require many joins or subsequent queries/transactions, to retrieve the attributes for an entity - this can lead to production performance problems if managed poorly.

With analytical queries, the possible complexity is huge, which is why we mostly haven't had an abstraction layer in place. Universal semantic layers can cover a big part of the complexity surface area, providing consistency; compiling high-level requests to complex SQL. They will never cover every possible type of analytical query, but we should use them to cover many or most.

So my conclusion is in reverse for data folks: Analysts should move away from querying their database directly where possible. This is especially true for standardised reporting and queries which are not ad-hoc but operational in nature.

For OLAP databases, NL could prove helpful in constructing the initial queries for exploratory analysis. However, these queries should be exposed to a dashboard-like refinement tool since English and other NLs are rife with ambiguities and impreciseness.
There is a reluctance to depend on current LLM technology for decision-making inside the enterprise, especially with financial data. The biggest issue is that the output of an LLM is not explainable to a human. Second, LLM systems require more training data than “traditional” ML systems (e.g., random forests, Bayesian models). Companies generally cannot outsource the creation of training data for these models to unskilled people. For these reasons, the uptake of LLMs for enterprise data will be cautiously slow.

The first paragraph here is a position I’ve written a lot about. Text-to-SQL tools can be a useful copilot for experienced data folks. The fact that “English and other NLs are rife with ambiguities and impreciseness” are the main reason semantic layers are needed to enable use of AI with data.

The second paragraph are the consequences of not having a semantic layer. Transparency on what an LLM has generated, where making a high level request to a semantic layer that then compiles to SQL using engineer-defined logic, is possible. You don’t need a vast training dataset to achieve high accuracy if your LLM data application is using the text-to-semantic layer method, as I’ve written about before.

Universal semantic layers don’t try to replace SQL for OLAP. They provide high level APIs and compilers to abstract the writing of this SQL. Engineers can then codify how the data model fits together, and what it means, to then be used many times by less technical users - OLAP SQL with scale.

Although Duckplyr looks fun.

The statement here is reversed for data analysts. As they nearly always write raw SQL, it hugely matters that they have a strong data model. I’ll let

Joe Reis

do the evangelism around data models, and I’ll stick with semantic layers. I will assume you have listened to Joe already and made a good data model to put your semantic layer on!

The argument for tolerating poor data modeling doesn’t really exist, as is broached many times in this paper when talking about the move towards the relational model from tech that initially shunned it.

At least I hope not.

davidj.substack

Discussion about this post