Many ducks, a big pond

Splitting the query

Mar 04, 2025

pile of rubber duckies — Photo by Andrew Wulf on Unsplash

Recently, DeepSeek unexpectedly1 open-sourced Small Pond, which provides a way to scale out DuckDB to run queries on a cluster of machines, rather than one machine or instance. Tools like Motherduck don’t do this currently; they use bigger single instances to run bigger workloads, which has been shown to be effective for almost all workloads.

However, I do think this exposes a need. DuckDB has avoided building any feature to make it easy to scale out. Instead, they have focused on making it the fastest and best in-process OLAP query engine to run on a single machine. They rely on the vendor community to be concerned with applying DuckDB to bigger data scenarios with their infrastructure.

Nonetheless, the question remains: How do we run DuckDB on many machines? Why is someone doing it if we don’t need to do it? Admittedly, companies that train LLMs have bigger data needs than others, but it was still done. They have also chosen to do this instead of using existing technologies like Spark, Dask or Modin, which are designed to run on many machines, unlike DuckDB. Has DuckDB become so performant that it is more attractive to try and scale it out instead of using something like Spark?

Before DeepSeek launched Small Pond, the announcement of Polars Cloud, which I discussed a few weeks back, clarified this need for me. Like DuckDB, Polars aimed to maximise performance on a single machine or instance. However, as I noted in my post on it, the focus has shifted specifically towards scaling Polars out.

DataFrame

David Jayatillake

Feb 20

Last week, Polars announced Polars Cloud, which enables Polars to be scaled up across a cluster of machines rather than just on a single machine.

Read full story

…we have built a distributed scheduler-worker architecture in Rust that can analyze these queries, runs an adapted Polars optimizer and, depending on the query, comes up with a physical plan that scales horizontally or vertically. Polars is very strong in vertical scaling and we recognize that many queries will be most cost effective on a single machine. It however, is also very common that queries start with a set of operations that reduce the dataset size (think of group-by’s and filters). When querying large data sizes from cloud storage, you are bound to the IO limit of a single node. With horizontal scaling we can drastically increase those download limits and after the size reductions finish off on a single machine. We are building an architecture that shines in both horizontally and vertically scaling, dubbed diagonal scaling, choosing the optimal strategy dynamically…

They have described state-of-the-art diagonal scaling, in which many machines can execute a single query. The appropriate machines for each step of the query execution plan are chosen for their different properties and in varying numbers. For example, many smaller machines can use higher combined i/o for the initial table scan steps, and larger high-memory machines can perform intensive operations like sorts and window functions.

Foundation

David Jayatillake

September 25, 2024

Read full story

I don’t think anyone expects DuckDB Labs to follow suit and provide a similar cloud offering. This would go against their whole foundation setup, which I think is great and has resulted in what we enjoy today in DuckDB.

However, could DuckDB add functionality to run queries on many machines where the user or the vendor provides the infrastructure? This would mean the query execution plan would need to become aware of available infrastructure and choose to create instances to execute different steps in the query, similar to the scheduler-worker architecture described in the Polars Cloud quote above.

Could there be a future where you provide DuckDB with AWS credentials? Then, it can create EC2 instances as needed to run query steps, pass the data onto further created instances to perform subsequent query steps and destroy the instances as it goes through the query, minimising cloud spend for the user or vendor.

I have been interested in Snowflake's query acceleration service because I’ve watched hundreds or probably thousands of Snowflake queries execute. I’ve seen them go through scan, filter, sort, join, aggregate… steps. I’ve seen where these steps are i/o restricted, where they are compute-restricted, and where they run out of memory and spill to local disk or, worse yet, to S3.

Choosing one machine to run all the query steps has always seemed odd in the cloud, where you can quickly and easily spin up an instance of many configurations. Each step has different needs and specific resources. With Apache Arrow as mature as it is now, there is little to contradict this regarding data movement, either.

Smallpond uses Ray to distribute work to different instances with DuckDB. See

mehdio

’s post below for more details.

Mehdio's Tech (Data) Corner

DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data

DeepSeek has made a lot of noise lately. Their R1 model, released in January 2025, outperformed competitors like OpenAI’s O1 at launch. But what truly set it apart was its highly efficient infrastructure—dramatically reducing costs while maintaining top-tier performance…

4 months ago · 66 likes · 10 comments · mehdio

This is an ingenious way to scale out DuckDB, but I also feel it’s really crude. Don’t get me wrong; I think Ray is great, but the problem here is that you’re running many full queries to run the steps of one query. Operating this way will undoubtedly result in waste and inefficiency. We also know from the history of the data stack that not all functions easily, if at all, separate into many queries.

The core metadata needed to divide work among many machines, as I alluded to above, is the query execution plan. The query execution plan details at each stage of the query the number of rows that will be scanned (i/o), estimations of how many rows will remain after filtering, estimations of how many rows will be joined, estimations of the number of operations that will happen (compute), estimations of the number of rows that will be used for a given operation (memory). These can be used to accurately calculate what configuration of machine or set of such machines is needed to perform each step of the query efficiently.

The ideal approach would be for the query execution plan to output data moved, memory usage, and operations to perform for each query step, and for another component to provision machine/s as appropriate for each step and then to execute the steps. As I understand, this is what the new engine in Polars Cloud will do, and it is the right way to scale out how a query executes.

With DuckDB, I would expect that the component that provides multiple machines for each step and executes them would be built by a vendor or user. Regular DuckDB users would not see a difference in how it works in this regard - if no special infra is provided, then the query executes in the same way as it does today.

Today, it is not possible to run separate query steps on different machines in DuckDB, and this is probably the biggest change that would be needed to achieve what I described above.

The counterargument to this whole topic is that the right-sized machine is the one sitting in front of me, with plenty of power and RAM to execute most queries I would want to run with DuckDB. So, do we need to consider the complexity of splitting the query across machines at all? I instinctively want to say no, but something is nagging me, making me think that no will eventually become yes.

Unexpectedly, because we weren’t expecting the disruptors of OpenAI to open-source a data warehouse. However, this probably shows how closely linked data and AI engineering activities are.

davidj.substack

DataFrame

Foundation

Discussion about this post