Last week, Polars announced Polars Cloud, which enables Polars to be scaled up across a cluster of machines rather than just on a single machine.
Polars is one of a handful of open-source high-performance querying projects, including DuckDB and Apache DataFusion. Like DataFusion, Polars is written in Rust and can often perform 10 to 100 times faster than its precursor, Pandas.
I’ve always preferred SQL to Pandas; part of that is familiarity. I learned SQL first, then R with dplyr and Tidyverse, and then Python and Pandas. While expression in R with dplyr does give SQL a run for its money, I never found Pandas to be as good. It felt clunky at the time. It has improved over the years and now boasts better ergonomics, but I never felt certain about what the best way to achieve something would be. I sensed that there could be many different methods to accomplish the same task in Pandas… when really there should only be one, perhaps due to years of bloat and its dependencies.
As you may have seen in my recent advent series on SQLMesh, I utilised the dlt framework to build pipelines from the outset, and alongside this, I also used Polars. Polars felt significantly simpler to use than Pandas; there appeared to be one correct way to accomplish tasks. It was also much more performant than Pandas, and the integration with PyArrow is truly excellent. AI still seems to suggest using Pandas unless instructed otherwise, at which point it defaults to Polars - likely due to the abundance of StackOverflow posts and similar resources that AI has been trained on regarding Pandas.
While I prefer expressing queries in SQL to using dataframes, I recognise that there are situations where it is more appropriate to utilise dataframes. This often appears to be the case in ML pipelines where SQL may have already been employed to present a flat dataset for ingestion into the ML pipeline, managing all the relational logic beforehand. At this point, using dataframe methods to manipulate your data, add columns, and so on becomes easier.
Frequently, ML workloads do not operate on cloud services in the same way that SQL queries do when using a cloud data warehouse. Instead, they are either running locally or on a provisioned cloud instance that has been optimised for the task in terms of memory, computing power, and disk space. ML workloads are notoriously sensitive to out-of-memory issues.
ML training libraries are not easy to use outside of this local or single cloud instance context; therefore, it seems that dataframe libraries have also been forced to live here. Polars has greatly improved the execution of the data transformation steps on a single machine prior to the ML training steps. However, data transformation steps can often be more challenging than ML training to predict resource requirements, as it’s easy to temporarily explode the size of your data while performing a join or creating copies to join to themselves, among other operations.
SQL has had the opposite problem: It has become easy to run it on a server or in a cloud data warehouse, but it is difficult to run locally in a performant way for analytical workloads. That has really changed only since DuckDB.
Polars was already being made easy to use for anyone or any workload that used Pandas before. There are even ways to use Polars with a Pandas-compatible API. Then, all of your code that relies on Pandas will continue to work with the power of Polars.
This step to scale out Polars for use with a cloud cluster is a natural progression, although I’m still uncertain how auto-scaling will function for Polars Cloud. It would be ideal if users didn’t have to worry about right-sizing the instance, but that may be a dream. Polars already maximises the utility of your single machine or instance. Where that isn’t sufficient, you can now utilise Polars Cloud without significantly changing how you use it.
Cluster side, we have built a distributed scheduler-worker architecture in Rust that can analyze these queries, runs an adapted Polars optimizer and, depending on the query, comes up with a physical plan that scales horizontally or vertically. Polars is very strong in vertical scaling and we recognize that many queries will be most cost effective on a single machine. It however, is also very common that queries start with a set of operations that reduce the dataset size (think of group-by’s and filters). When querying large data sizes from cloud storage, you are bound to the IO limit of a single node. With horizontal scaling we can drastically increase those download limits and after the size reductions finish off on a single machine. We are building an architecture that shines in both horizontally and vertically scaling, dubbed diagonal scaling, choosing the optimal strategy dynamically…
…Besides multiple scaling strategies, we are committed to run open source Polars as our engine on the worker nodes. This ensures our incentives are aligned and that the semantics of Polars cloud will not deviate. Polars will allow you to run all engines. That means there will also be GPU support. You will be able to spin up a machine with a high end GPU and connect locally in interactive mode. Our new Streaming Engine has an out-of-core design, and will be able to spill to disk in an efficient manner. Together with distributed queries, this will truly scale Polars to any dataset. We already have the first preliminary results and on the PDS-H2 Benchmark and they look very promising. We already beat our in-memory engine by a factor of ~3 (and there are still a lot of performance opportunities) and it goes without saying that the memory characteristics are much better.
With Polars Cloud, I can foresee the data transformation stages of ML pipelines becoming less of a limiting factor, delivering rapid transformations for a pipeline. However, I find that the ML training aspect still does not scale easily.
Part of SQL's success and renewed dominance was due to the limitations of how Pandas and R dataframe methods failed to scale effectively (yes, I’m aware of Dask, Modin, etc.). With cloud data warehouses like BigQuery, Snowflake, and Databricks, you could execute virtually any desired data pipeline if you were prepared to pay for it. This was not the case for Pandas and R. This advantage, coupled with the user-friendliness of SQL for a broader audience of data practitioners, contributed to the dominance of the cloud data warehouses we observe today. I wonder, had something like Polars Cloud been available 10 years ago when Snowflake and BigQuery emerged, would SQL have achieved such dominance once again?