Python Data Analysis #5 Grouping and Joining: groupby, pivot_table, merge
The questions you get asked in data analysis tend to converge on the same shape: “How much did each branch sell?”, “How did it change month over month?”, “What does it look like cross-tabulated by branch and month?” These are questions about summaries of groups, not individual rows, and the pandas tools that answer them are groupby and pivot_table. If the information you need is split across two tables, you first combine them with merge. We cleaned the data in #4 Transforming and missing data; in this post we group and join that data to answer the questions.
- #1 Getting started
- #2 Loading and exploring data
- #3 Selecting and filtering
- #4 Transforming and missing data
- #5 Grouping and joining ← this post
- #6 Visualization
- #7 A taste of Polars (wrap-up)
Example data: sales by branch and month #
Let’s build a small sales table to use throughout this post. Figures are in millions; the Daegu branch has no March data yet.
import pandas as pd
sales = pd.DataFrame({
"branch": ["Seoul", "Seoul", "Seoul", "Busan", "Busan", "Busan", "Daegu", "Daegu"],
"month": [1, 2, 3, 1, 2, 3, 1, 2],
"sales": [120, 135, 150, 90, 95, 110, 70, 80],
})The groupby mental model: split, apply, combine #
Internally, groupby works in three steps. It splits rows that share a key value, applies an aggregation function to each group, and combines the results into one table. The code is a single line, but it helps to picture these three steps in your head.
sales.groupby("branch")["sales"].sum()
# branch
# Busan 295
# Daegu 150
# Seoul 405Rows with the same "branch" value are grouped, and sum is applied to each group’s sales column. Pass a list of keys and you get a multi-key group, where the result index becomes a two-level MultiIndex of branch and month. If that shape is awkward, append .reset_index() to flatten it back into ordinary columns.
sales.groupby(["branch", "month"])["sales"].sum()
# branch month
# Busan 1 90
# 2 95
# Daegu 1 70
# ...
# Seoul 3 150agg: several statistics at once #
If you want the mean and the count alongside the sum, pass a list of function names to agg.
sales.groupby("branch")["sales"].agg(["sum", "mean", "count"])
# sum mean count
# branch
# Busan 295 98.333333 3
# Daegu 150 75.000000 2
# Seoul 405 135.000000 3To name the result columns yourself, use the name=("column", "function") form, as in sales.groupby("branch").agg(total_sales=("sales", "sum"), avg_sales=("sales", "mean")). Since this produces column names that can go straight into a report, you will end up reaching for this form more often in practice.
pivot_table: the same thing as an Excel pivot table #
If you have used pivot tables in Excel, pivot_table is exactly that feature. You designate a row axis and a column axis to build a cross-tab.
sales.pivot_table(index="branch", columns="month", values="sales", aggfunc="sum")
# month 1 2 3
# branch
# Busan 90.0 95.0 110.0
# Daegu 70.0 80.0 NaN
# Seoul 120.0 135.0 150.0The contents are the same as the multi-key groupby result; the difference is shape. groupby gives you a tall, vertical result, while pivot_table spreads the second key across the columns into a grid that humans read easily. The empty NaN in Daegu’s March cell jumps out at a glance. To fill blanks with zero, add the fill_value=0 argument. In short: use groupby for computation, pivot_table for cross-tabs meant for human eyes.
merge: combining two tables side by side #
Suppose there is a separate table of branch managers, and you want to attach a manager column to the sales table.
managers = pd.DataFrame({
"branch": ["Seoul", "Busan", "Gwangju"],
"manager": ["Kim", "Lee", "Park"],
})
pd.merge(sales, managers, on="branch")
# branch month sales manager
# 0 Seoul 1 120 Kim
# 1 Seoul 2 135 Kim
# ...
# 5 Busan 3 110 Leemerge is the same concept as SQL’s JOIN, and the default behavior is an inner join. Only keys present in both tables survive, so Daegu (sales table only) and Gwangju (managers table only) silently disappeared from the result. Eight rows became six. To preserve every row of the left table, use how="left". Change the call to pd.merge(sales, managers, on="branch", how="left") and Daegu’s two rows survive with NaN in the manager column. If the sales data is the baseline and the manager is supplementary information, "left" is the right choice.
Duplicate keys multiply rows. The real trap in merge is not rows disappearing but rows multiplying. If Seoul appears twice in the managers table, every Seoul sales row gets paired with both manager rows, and the result inflates — an 8-row sales table becomes 11 rows. Aggregate that combined table and Seoul’s sales count double. The tool that blocks this trap in code is the validate argument.
pd.merge(sales, managers, on="branch", how="left", validate="many_to_one")
# MergeError: Merge keys are not unique in right dataset; not a many-to-one mergeBy declaring the expectation “many sales rows map to one manager row”, you get an error from pandas the moment that expectation breaks. An error is far better than a wrong number flowing all the way into a report.
concat: stacking same-shaped tables vertically #
If merge attaches columns side by side, concat stacks tables of the same format vertically. It’s what you use in the common situation where each month lives in its own file.
from pathlib import Path
frames = [pd.read_csv(p) for p in sorted(Path("data").glob("sales_*.csv"))]
total = pd.concat(frames, ignore_index=True)ignore_index=True discards each file’s own 0, 1, 2 index and renumbers from scratch. Collecting files in a folder with glob is the same consolidation flow we covered in Python Automation #2 Excel automation. That post moved cells directly with openpyxl, but if aggregation is the goal, stacking with pandas like this is far less code.
After any join, check the row count first #
Right after using merge or concat, make a habit of checking the row count, because joins are operations that fail silently. Whether rows vanish or double, you get a result with no error. A single assert line is enough. If the row count after a left join differs from the original, suspect duplicate keys; if the total after concat doesn’t add up, suspect a missing file.
assert len(merged) == len(sales) # left join: left row count must be preserved
assert len(total) == sum(len(f) for f in frames) # concat: must equal the sum before stackingWrap-up #
What this post covered:
groupbyworks in three steps — split, apply, combine — andaggcomputes several statistics at oncepivot_tablebuilds the same cross-tab as an Excel pivot table; the computation itself is identical togroupbymergecorresponds to SQL JOIN: inner drops rows and duplicate keys multiply them. Declaring key relationships withvalidatekeeps you safeconcatstacks same-shaped tables vertically, and right after any join you verify row counts withassert
In the next post (#6 Visualization), we turn the aggregates we built here into pictures — drawing bar and line charts with pandas’s plot method and matplotlib, including how to fix broken fonts when labeling charts in CJK languages.