Python Data Analysis #5 Grouping and Joining: groupby, pivot_table, merge

Programming Language Python pandas Data Analysis

Monday, May 25, 2026

6 min read

The questions you get asked in data analysis tend to converge on the same shape: “How much did each branch sell?”, “How did it change month over month?”, “What does it look like cross-tabulated by branch and month?” These are questions about summaries of groups, not individual rows, and the pandas tools that answer them are groupby and pivot_table. If the information you need is split across two tables, you first combine them with merge. We cleaned the data in #4 Transforming and missing data; in this post we group and join that data to answer the questions.

#1 Getting started
#2 Loading and exploring data
#3 Selecting and filtering
#4 Transforming and missing data
#5 Grouping and joining ← this post
#6 Visualization
#7 A taste of Polars (wrap-up)

Example data: sales by branch and month #

Let’s build a small sales table to use throughout this post. Figures are in millions; the Daegu branch has no March data yet.

example data

import pandas as pd

sales = pd.DataFrame({
    "branch": ["Seoul", "Seoul", "Seoul", "Busan", "Busan", "Busan", "Daegu", "Daegu"],
    "month":  [1, 2, 3, 1, 2, 3, 1, 2],
    "sales":  [120, 135, 150, 90, 95, 110, 70, 80],
})

The groupby mental model: split, apply, combine #

Internally, groupby works in three steps. It splits rows that share a key value, applies an aggregation function to each group, and combines the results into one table. The code is a single line, but it helps to picture these three steps in your head.

single-key groupby

sales.groupby("branch")["sales"].sum()
# branch
# Busan    295
# Daegu    150
# Seoul    405

Rows with the same "branch" value are grouped, and sum is applied to each group’s sales column. Pass a list of keys and you get a multi-key group, where the result index becomes a two-level MultiIndex of branch and month. If that shape is awkward, append .reset_index() to flatten it back into ordinary columns.

multi-key groupby

sales.groupby(["branch", "month"])["sales"].sum()
# branch  month
# Busan   1         90
#         2         95
# Daegu   1         70
# ...
# Seoul   3        150

agg: several statistics at once #

If you want the mean and the count alongside the sum, pass a list of function names to agg.

multi-stat aggregation with agg

sales.groupby("branch")["sales"].agg(["sum", "mean", "count"])
#         sum        mean  count
# branch
# Busan   295   98.333333      3
# Daegu   150   75.000000      2
# Seoul   405  135.000000      3

To name the result columns yourself, use the name=("column", "function") form, as in sales.groupby("branch").agg(total_sales=("sales", "sum"), avg_sales=("sales", "mean")). Since this produces column names that can go straight into a report, you will end up reaching for this form more often in practice.

pivot_table: the same thing as an Excel pivot table #

If you have used pivot tables in Excel, pivot_table is exactly that feature. You designate a row axis and a column axis to build a cross-tab.

pivot_table

sales.pivot_table(index="branch", columns="month", values="sales", aggfunc="sum")
# month       1      2      3
# branch
# Busan    90.0   95.0  110.0
# Daegu    70.0   80.0    NaN
# Seoul   120.0  135.0  150.0

The contents are the same as the multi-key groupby result; the difference is shape. groupby gives you a tall, vertical result, while pivot_table spreads the second key across the columns into a grid that humans read easily. The empty NaN in Daegu’s March cell jumps out at a glance. To fill blanks with zero, add the fill_value=0 argument. In short: use groupby for computation, pivot_table for cross-tabs meant for human eyes.

merge: combining two tables side by side #

Suppose there is a separate table of branch managers, and you want to attach a manager column to the sales table.

basic merge (inner join)

managers = pd.DataFrame({
    "branch": ["Seoul", "Busan", "Gwangju"],
    "manager": ["Kim", "Lee", "Park"],
})

pd.merge(sales, managers, on="branch")
#   branch  month  sales manager
# 0  Seoul      1    120     Kim
# 1  Seoul      2    135     Kim
# ...
# 5  Busan      3    110     Lee

merge is the same concept as SQL’s JOIN, and the default behavior is an inner join. Only keys present in both tables survive, so Daegu (sales table only) and Gwangju (managers table only) silently disappeared from the result. Eight rows became six. To preserve every row of the left table, use how="left". Change the call to pd.merge(sales, managers, on="branch", how="left") and Daegu’s two rows survive with NaN in the manager column. If the sales data is the baseline and the manager is supplementary information, "left" is the right choice.

Duplicate keys multiply rows. The real trap in merge is not rows disappearing but rows multiplying. If Seoul appears twice in the managers table, every Seoul sales row gets paired with both manager rows, and the result inflates — an 8-row sales table becomes 11 rows. Aggregate that combined table and Seoul’s sales count double. The tool that blocks this trap in code is the validate argument.

validating key relationships with validate

pd.merge(sales, managers, on="branch", how="left", validate="many_to_one")
# MergeError: Merge keys are not unique in right dataset; not a many-to-one merge

By declaring the expectation “many sales rows map to one manager row”, you get an error from pandas the moment that expectation breaks. An error is far better than a wrong number flowing all the way into a report.

concat: stacking same-shaped tables vertically #

If merge attaches columns side by side, concat stacks tables of the same format vertically. It’s what you use in the common situation where each month lives in its own file.

combining monthly files

from pathlib import Path

frames = [pd.read_csv(p) for p in sorted(Path("data").glob("sales_*.csv"))]
total = pd.concat(frames, ignore_index=True)

ignore_index=True discards each file’s own 0, 1, 2 index and renumbers from scratch. Collecting files in a folder with glob is the same consolidation flow we covered in Python Automation #2 Excel automation. That post moved cells directly with openpyxl, but if aggregation is the goal, stacking with pandas like this is far less code.

After any join, check the row count first #

Right after using merge or concat, make a habit of checking the row count, because joins are operations that fail silently. Whether rows vanish or double, you get a result with no error. A single assert line is enough. If the row count after a left join differs from the original, suspect duplicate keys; if the total after concat doesn’t add up, suspect a missing file.

row-count checks

assert len(merged) == len(sales)  # left join: left row count must be preserved
assert len(total) == sum(len(f) for f in frames)  # concat: must equal the sum before stacking

Wrap-up #

What this post covered:

groupby works in three steps — split, apply, combine — and agg computes several statistics at once
pivot_table builds the same cross-tab as an Excel pivot table; the computation itself is identical to groupby
merge corresponds to SQL JOIN: inner drops rows and duplicate keys multiply them. Declaring key relationships with validate keeps you safe
concat stacks same-shaped tables vertically, and right after any join you verify row counts with assert

In the next post (#6 Visualization), we turn the aggregates we built here into pictures — drawing bar and line charts with pandas’s plot method and matplotlib, including how to fix broken fonts when labeling charts in CJK languages.