Python Data Analysis #7 A Taste of Polars: Your Next Move When pandas Slows Down
This is the final post of the series. Over the past six posts we covered pandas end to end, from loading data to visualization. For the finale, let’s widen the view one notch: when exactly does pandas start to struggle, and what is Polars, the next move you can reach for when it does? We’ll explore it mostly through side-by-side code comparisons. This isn’t a post for memorizing new syntax — it’s a post that puts a map in your hands showing that this option exists.
The moments when pandas struggles #
Up to a few hundred thousand rows, pandas works well with barely a second thought. The trouble starts when the data grows. You typically hit three walls.
- Row count: at millions to tens of millions of rows, a single
groupbyormergestarts taking tens of seconds. - Memory: pandas loads the entire file into memory. A 2 GB CSV balloons to several times that size in RAM, and your laptop greets you with a
MemoryError. - Single core: most pandas operations use just one CPU core. On an 8-core machine, seven cores sit idle.
Design choices that were harmless on small data all come due the moment the data gets big.
Polars: the DataFrame rebuilt in Rust #
Polars is a DataFrame library written in Rust. You use it from Python with a plain import polars, and its design targets exactly the three walls above.
- Rust core: the heavy operations run as compiled native code.
- Multicore: it uses every core in parallel with zero configuration.
- Arrow memory format: it uses Apache Arrow, a columnar standard format, so memory efficiency is high and data moves to other tools without copying.
Installation is one line.
uv add polarsThe same tasks, side by side #
To see how similar the syntax is and where it differs, let’s give both libraries the same tasks on the sales data we’ve used all series, sales.csv (columns: city, category, price, quantity).
First, reading.
import pandas as pd
df = pd.read_csv("sales.csv")import polars as pl
df = pl.read_csv("sales.csv")So far, nearly identical. Polars’s read_csv parses on multiple cores, so the same file feels noticeably faster. Next, filtering.
expensive = df[df["price"] > 10000]expensive = df.filter(pl.col("price") > 10000)pandas put a boolean mask inside square brackets, while Polars passes an expression — pl.col("price") > 10000 — to the filter method. This difference is the heart of Polars syntax. Finally, a groupby aggregation.
result = (
df[df["price"] > 10000]
.groupby("city")["price"]
.mean()
.sort_values(ascending=False)
)result = (
df.filter(pl.col("price") > 10000)
.group_by("city")
.agg(pl.col("price").mean())
.sort("price", descending=True)
)Read them through and the job is the same: keep only transactions priced above 10,000, compute the average price per city, and sort in descending order.
The expression API as a way of thinking #
The difference the code above reveals fits in one sentence: pandas “keeps reshaping a DataFrame object through methods”, while Polars “assembles what to compute as expressions and hands them over in one go”.
pl.col("price").mean() computes nothing by itself. It is merely a computation plan — “the mean of the price column” — and it only executes when plugged into a slot like agg(). Because expressions are independent building blocks, they combine freely.
result = df.group_by("city").agg(
pl.col("price").mean().alias("avg_price"),
pl.col("price").max().alias("max_price"),
pl.col("quantity").sum().alias("total_qty"),
)What you wrote as a dictionary in pandas, like agg({"price": ["mean", "max"], ...}), becomes a flat list of expressions in Polars. And Polars executes the expressions it receives in parallel internally.
Lazy mode: build a plan, run it once #
Polars’s real weapon is lazy mode. All the code so far ran eagerly, line by line, the moment you typed it. Lazy mode is different. Start with scan_csv and operations aren’t executed immediately — they only stack up as a plan — until you call collect(), at which point the entire plan runs at once.
result = (
pl.scan_csv("sales.csv")
.filter(pl.col("price") > 10000)
.group_by("city")
.agg(pl.col("price").mean())
.collect()
)A grocery-shopping analogy: the eager style makes a separate trip to the store every time someone says “buy milk” or “buy eggs”. The lazy style writes the full list down first, plans the route, and makes one trip.
Right before collect(), Polars’s query optimizer reviews and rewrites the plan. For the example above, optimizations like “we only ever use the city and price columns, so don’t read the others” and “push the price > 10000 filter down into the file-reading step so only the needed rows are loaded” are applied automatically. That’s why files that wouldn’t fit in memory often process just fine in lazy mode.
Which one should you use? #
It’s less a matter of picking one over the other and more about choosing the right default for each situation.
| Criterion | pandas | Polars |
|---|---|---|
| Ecosystem, search results | Overwhelmingly more | Growing fast |
| Visualization and library integration | Mostly built around pandas | Possible via conversion |
| Speed and multicore | Single core | Parallel by default |
| Data larger than memory | Hard | Possible with lazy mode |
For exploratory work under a few hundred thousand rows, and anything that ties into other libraries, pandas is still the comfortable choice. At millions of rows and up, or for pipelines you run repeatedly, Polars saves serious time. And the two are not enemies. One line of conversion moves you between them, so a combination that’s common in practice is doing the heavy processing in Polars and converting to pandas at the end for visualization.
df_pl = pl.from_pandas(df_pd) # pandas -> polars
df_pd = df_pl.to_pandas() # polars -> pandasA map toward even bigger data #
If you ever outgrow Polars too, there are just two keywords to remember for what comes next. Parquet is a columnar file format that replaces CSV, storing the same data far smaller and letting you read only the columns you need. DuckDB is an analytical database that runs SQL directly on files, and it has become the de facto standard for handling larger-than-memory data on a single laptop. Both connect naturally to Polars through the Arrow format, so the way of thinking you learned in this series carries straight over.
Closing the series #
A one-line look back at each of the seven posts.
- #1 Getting started: what data analysis is and how to set up the environment.
- #2 Loading and exploring data: opening data with
read_csvand forming a first impression withhead,info, anddescribe. - #3 Selecting and filtering: pulling out exactly the rows and columns we wanted with
locand boolean masks. - #4 Transforming and missing data: creating new columns and setting policies for empty values.
- #5 Grouping and joining: summarizing with
groupbyand combining tables withmerge. - #6 Visualization: turning numeric summaries into charts and confirming patterns by eye.
- #7 A taste of Polars: pandas’s limits and the tools beyond them.
We’ve talked a lot about tools, but the essence of analysis is not the tool — it’s the question. “Which city’s sales are slipping?”, “Is this change a coincidence?” The question comes first; pandas or Polars is merely the means of fetching the answer. The loop you practiced in this series — load, explore, select, transform, summarize, plot — stays with you even when the tools change.
For what to learn next, I recommend two paths. If you want to harden recurring data work into code, the Python Automation series follows naturally; if you want to build muscle in the Python language itself, Modern Python Intermediate is the next step. When a good question finds you, the tools for answering it are now in your hands.