Python Data Analysis #7 A Taste of Polars: Your Next Move When pandas Slows Down

This is the final post of the series. Over the past six posts we covered pandas end to end, from loading data to visualization. For the finale, let’s widen the view one notch: when exactly does pandas start to struggle, and what is Polars, the next move you can reach for when it does? We’ll explore it mostly through side-by-side code comparisons. This isn’t a post for memorizing new syntax — it’s a post that puts a map in your hands showing that this option exists.

The moments when pandas struggles #

Up to a few hundred thousand rows, pandas works well with barely a second thought. The trouble starts when the data grows. You typically hit three walls.

  • Row count: at millions to tens of millions of rows, a single groupby or merge starts taking tens of seconds.
  • Memory: pandas loads the entire file into memory. A 2 GB CSV balloons to several times that size in RAM, and your laptop greets you with a MemoryError.
  • Single core: most pandas operations use just one CPU core. On an 8-core machine, seven cores sit idle.

Design choices that were harmless on small data all come due the moment the data gets big.

Polars: the DataFrame rebuilt in Rust #

Polars is a DataFrame library written in Rust. You use it from Python with a plain import polars, and its design targets exactly the three walls above.

  • Rust core: the heavy operations run as compiled native code.
  • Multicore: it uses every core in parallel with zero configuration.
  • Arrow memory format: it uses Apache Arrow, a columnar standard format, so memory efficiency is high and data moves to other tools without copying.

Installation is one line.

install polars
uv add polars

The same tasks, side by side #

To see how similar the syntax is and where it differs, let’s give both libraries the same tasks on the sales data we’ve used all series, sales.csv (columns: city, category, price, quantity).

First, reading.

read: pandas
import pandas as pd

df = pd.read_csv("sales.csv")
read: polars
import polars as pl

df = pl.read_csv("sales.csv")

So far, nearly identical. Polars’s read_csv parses on multiple cores, so the same file feels noticeably faster. Next, filtering.

filter: pandas
expensive = df[df["price"] > 10000]
filter: polars
expensive = df.filter(pl.col("price") > 10000)

pandas put a boolean mask inside square brackets, while Polars passes an expressionpl.col("price") > 10000 — to the filter method. This difference is the heart of Polars syntax. Finally, a groupby aggregation.

groupby: pandas
result = (
    df[df["price"] > 10000]
    .groupby("city")["price"]
    .mean()
    .sort_values(ascending=False)
)
groupby: polars
result = (
    df.filter(pl.col("price") > 10000)
    .group_by("city")
    .agg(pl.col("price").mean())
    .sort("price", descending=True)
)

Read them through and the job is the same: keep only transactions priced above 10,000, compute the average price per city, and sort in descending order.

The expression API as a way of thinking #

The difference the code above reveals fits in one sentence: pandas “keeps reshaping a DataFrame object through methods”, while Polars “assembles what to compute as expressions and hands them over in one go”.

pl.col("price").mean() computes nothing by itself. It is merely a computation plan — “the mean of the price column” — and it only executes when plugged into a slot like agg(). Because expressions are independent building blocks, they combine freely.

several expressions at once
result = df.group_by("city").agg(
    pl.col("price").mean().alias("avg_price"),
    pl.col("price").max().alias("max_price"),
    pl.col("quantity").sum().alias("total_qty"),
)

What you wrote as a dictionary in pandas, like agg({"price": ["mean", "max"], ...}), becomes a flat list of expressions in Polars. And Polars executes the expressions it receives in parallel internally.

Lazy mode: build a plan, run it once #

Polars’s real weapon is lazy mode. All the code so far ran eagerly, line by line, the moment you typed it. Lazy mode is different. Start with scan_csv and operations aren’t executed immediately — they only stack up as a plan — until you call collect(), at which point the entire plan runs at once.

lazy mode
result = (
    pl.scan_csv("sales.csv")
    .filter(pl.col("price") > 10000)
    .group_by("city")
    .agg(pl.col("price").mean())
    .collect()
)

A grocery-shopping analogy: the eager style makes a separate trip to the store every time someone says “buy milk” or “buy eggs”. The lazy style writes the full list down first, plans the route, and makes one trip.

Right before collect(), Polars’s query optimizer reviews and rewrites the plan. For the example above, optimizations like “we only ever use the city and price columns, so don’t read the others” and “push the price > 10000 filter down into the file-reading step so only the needed rows are loaded” are applied automatically. That’s why files that wouldn’t fit in memory often process just fine in lazy mode.

Which one should you use? #

It’s less a matter of picking one over the other and more about choosing the right default for each situation.

CriterionpandasPolars
Ecosystem, search resultsOverwhelmingly moreGrowing fast
Visualization and library integrationMostly built around pandasPossible via conversion
Speed and multicoreSingle coreParallel by default
Data larger than memoryHardPossible with lazy mode

For exploratory work under a few hundred thousand rows, and anything that ties into other libraries, pandas is still the comfortable choice. At millions of rows and up, or for pipelines you run repeatedly, Polars saves serious time. And the two are not enemies. One line of conversion moves you between them, so a combination that’s common in practice is doing the heavy processing in Polars and converting to pandas at the end for visualization.

moving between the two
df_pl = pl.from_pandas(df_pd)   # pandas -> polars
df_pd = df_pl.to_pandas()       # polars -> pandas

A map toward even bigger data #

If you ever outgrow Polars too, there are just two keywords to remember for what comes next. Parquet is a columnar file format that replaces CSV, storing the same data far smaller and letting you read only the columns you need. DuckDB is an analytical database that runs SQL directly on files, and it has become the de facto standard for handling larger-than-memory data on a single laptop. Both connect naturally to Polars through the Arrow format, so the way of thinking you learned in this series carries straight over.

Closing the series #

A one-line look back at each of the seven posts.

We’ve talked a lot about tools, but the essence of analysis is not the tool — it’s the question. “Which city’s sales are slipping?”, “Is this change a coincidence?” The question comes first; pandas or Polars is merely the means of fetching the answer. The loop you practiced in this series — load, explore, select, transform, summarize, plot — stays with you even when the tools change.

For what to learn next, I recommend two paths. If you want to harden recurring data work into code, the Python Automation series follows naturally; if you want to build muscle in the Python language itself, Modern Python Intermediate is the next step. When a good question finds you, the tools for answering it are now in your hands.

X