Python Data Analysis #1: Getting Started with pandas — Notebooks and the DataFrame
If you open the same Excel file every month, apply the same filters, and build the same pivot table, that work can be moved into code. This series is for people who have handled data in Excel or SQL and want to learn how to do the same work in code with pandas, across seven posts. For Python syntax, the level covered in Modern Python Basics is more than enough.
- #1 Getting started with pandas: notebooks and the DataFrame ← this post
- #2 Loading data and first exploration
- #3 Selecting and filtering
- #4 Transforming data and missing values
- #5 Grouping, aggregation, and joins
- #6 Visualization
- #7 A taste of polars and wrap-up
What you gain by moving Excel work into code #
Excel is a great tool. But as analysis grows, you hit limits in three places — and these three are the motivation that carries the whole series.
- Reproducibility is hard. Results built by clicking leave no trace of the process, so a month later it’s hard to answer “where did this number come from?” Code is its own record of the process: run it again and you get the same result.
- Big data won’t open. An Excel sheet caps out at 1,048,576 rows, and well before that, scrolling and formula recalculation become painful. pandas handles millions of rows with the same code, as long as memory allows.
- Repetition is manual. If 100 files need the same cleanup, Excel means opening 100 files. In code, one loop does it.
Environment setup: installing pandas and Jupyter with uv #
In this post we’ll set up a working environment and build pandas’ two core structures — Series and DataFrame — by hand. The environment takes just one tool: uv. Initialize a project, add pandas and JupyterLab, then start the notebook server.
uv init data-analysis --python 3.14
cd data-analysis
uv add pandas jupyterlab
uv run jupyter lab
# When the browser opens, create a new Python 3 notebook from the launcher on the leftThe notebook way of working #
A .py script runs top to bottom in one shot, but a notebook splits code into small blocks called cells — run a cell and its result appears right below it. Variables survive after a cell finishes, so the next cell can keep using them. Data analysis is a loop of “load, look, try a change, check the result,” and a notebook fits that flow well: re-run only the cell you changed and inspect the output with your eyes. All code in this series is written as notebook cells.
Series and DataFrame #
You only need to know two structures in pandas.
- Series: a one-dimensional array of values. It corresponds to one column in Excel.
- DataFrame: a two-dimensional table made of multiple Series side by side. It corresponds to one Excel sheet.
That’s as far as the analogy safely goes, and there’s one difference. Every row in pandas carries a name tag called the index. An Excel row number is just a position, but a pandas index is part of the data: it follows each row through sorting and filtering, and if you use dates or IDs as the index, you can look up rows by name instead of position. For now, just remember that every row has a name tag.
Building your first DataFrame #
Let’s build one from the simplest ingredient: a dict. Keys become column names, value lists become column contents.
import pandas as pd
df = pd.DataFrame({
"item": ["Americano", "Latte", "Croissant", "Sandwich"],
"price": [4500, 5000, 3800, 6500],
"qty": [120, 95, 40, 25],
"category": ["Drink", "Drink", "Bakery", "Food"],
})
df item price qty category
0 Americano 4500 120 Drink
1 Latte 5000 95 Drink
2 Croissant 3800 40 Bakery
3 Sandwich 6500 25 FoodWrite just the variable name on the last line of a cell and the notebook renders it as a table — no print() needed. The 0–3 on the far left is the index we just talked about. If you don’t specify one, numbers starting from 0 are assigned automatically.
Column access and dtypes #
Pull out a single column and you get a Series.
df["price"]
# 0 4500
# 1 5000
# 2 3800
# 3 6500
# Name: price, dtype: int64Series can be combined in whole-column operations, like an Excel formula applied everywhere at once. No dragging a formula down row by row — a new column takes one line.
df["revenue"] = df["price"] * df["qty"]
dfA revenue column appears on the right of the table, with price × qty computed for every row. Check each column’s type with dtypes.
df.dtypes
# item object
# price int64
# qty int64
# category object
# revenue int64Numeric columns came out as int64, string columns as object. In real analysis, a column that should contain numbers often gets read in as object — and when that happens, dtypes is the first place you look.
The first three moves on any data: head, info, describe #
There are three things you run reflexively whenever you receive data you’ve never seen. Let’s look at just the output of describe() for a moment.
df.head() # preview the first 5 rows
df.info() # row count, per-column types, missing-value counts
df.describe() # summary stats for numeric columns: mean, min/max, etc. price qty revenue
count 4.00 4.00 4.00
mean 4950.00 70.00 332375.00
min 3800.00 25.00 152000.00
max 6500.00 120.00 540000.00On a four-row table this looks trivial, but with 100,000 rows, this one line gives you the overall shape of the data. How to properly read the output of all three moves is covered in #2 Loading data and first exploration with a real CSV file.
The notebook trap: cell execution order #
Notebooks have one trap. Cells take effect in the order you ran them, not the order they’re written. Edit and re-run a cell near the top, and the cells below still hold variables built from the old results; delete a cell, and the variables it created stay in memory. As these mismatches pile up, you end up with a notebook that “worked last time but breaks when reopened.” The prevention is simple: keep cells organized so they read top to bottom, and at the end of each working session, run Restart Kernel and Run All Cells to execute everything from scratch. Only a notebook that runs cleanly from top to bottom is a reproducible analysis. The “reproducibility” we listed as the first motivation doesn’t come for free just by using a notebook — it comes from this habit.
Wrap-up #
The foundation built in this post:
- The advantages of code over Excel: reproducibility, large data, automated repetition
- Environment setup with
uv add pandas jupyterlab+uv run jupyter lab - A Series is one column, a DataFrame is one table, and the index is each row’s name tag
- Building a DataFrame from a dict, column access, column arithmetic, and checking
dtypes - The first three moves — head / info / describe — and the cell execution order trap
In the next post (#2 Loading data and first exploration), instead of tables we build by hand, we cover reading real CSV and Excel files: the key options of read_csv, encoding problems, and the routine of sizing up unfamiliar data with head / info / describe.