Python Data Analysis #3: Selecting and Filtering — loc, iloc, and Boolean Indexing
Once the data is loaded, the next task is fixed: picking out just the parts you need. Selections like “only this column,” “only the rows that match this condition,” or “both at once” make up half of real analysis code. In this post we go through pandas’ selection tools — column selection, loc, iloc, and boolean indexing — and finish with the SettingWithCopyWarning everyone runs into at least once. In #2 Loading and exploration we read a CSV and looked at the shape of the data. This time, so you can follow along from anywhere, we’ll build a small DataFrame in code and start from there.
import pandas as pd
df = pd.DataFrame({
"name": ["Kimchi Stew", "Bibimbap", "Bulgogi", "Naengmyeon", "Pasta", "Pizza"],
"category": ["Korean", "Korean", "Korean", "Korean", "Western", "Western"],
"price": [9000, 10000, 15000, 11000, 14000, 18000],
"qty": [42, 35, 21, 18, 27, 30],
})
print(df)
# name category price qty
# 0 Kimchi Stew Korean 9000 42
# 1 Bibimbap Korean 10000 35
# 2 Bulgogi Korean 15000 21
# 3 Naengmyeon Korean 11000 18
# 4 Pasta Western 14000 27
# 5 Pizza Western 18000 30Column selection: one bracket vs. two #
The most basic operation is column selection. But whether you use one pair of brackets or two changes the type of the result.
df["price"] # Series (1-dimensional)
df[["name", "price"]] # DataFrame (2-dimensional table)df["price"] returns a Series — a 1-D array with an index attached, on which you can call aggregations like .mean() directly. df[["price"]], on the other hand, is a DataFrame even with a single column, so the table shape is preserved. If you don’t know this difference, you’ll hit situations where you selected the same column but some code works and some doesn’t. To work with the values of one column, use one bracket; to keep the table shape, use two.
loc and iloc: labels and positions #
To pick rows, use loc and iloc. The distinction fits in one line: loc selects by label (index value), iloc by position (integer offset). For example, df.loc[2] and df.iloc[2] both return the Bulgogi row. They look the same right now because the index runs sequentially from 0, but after filtering, labels and positions drift apart.
sub = df[df["category"] == "Western"] # only rows with labels 4 and 5 remain
sub.loc[4] # label 4: Pasta
sub.iloc[0] # first row: Pasta
sub.loc[0] # KeyError! there is no row with label 0If you need “the nth row,” it’s iloc; if you need “the row whose index value is X,” it’s loc. With a comma, you can select rows and columns at the same time — rows before the comma, columns after.
df.loc[1:3, ["name", "price"]] # rows with labels 1–3, name and price columns
df.iloc[1:3, 0:2] # rows at positions 1–2, columns 0–1There’s a trap here. A loc slice includes the end label, while an iloc slice excludes the end position, like a Python list. So df.loc[1:3] gives 3 rows and df.iloc[1:3] gives 2.
Boolean indexing: a condition becomes a mask #
This is the heart of this post. The pandas model for “only the rows that match a condition” is simple. Apply a comparison to a column and you get a Series of True/False values, called a mask.
df["price"] >= 12000
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# 5 True
# Name: price, dtype: boolPut that mask inside the brackets — df[df["price"] >= 12000] — and you have a filter. Only the True rows remain: Bulgogi, Pasta, and Pizza. To combine conditions, use &, |, and ~ instead of and, or, and not, and each condition must be wrapped in parentheses.
df[(df["category"] == "Korean") & (df["price"] >= 12000)] # and
df[(df["price"] <= 9000) | (df["qty"] >= 30)] # or
df[~(df["category"] == "Western")] # notThe parentheses are mandatory because of operator precedence. & is evaluated before >=, so without parentheses the conditions group the wrong way and you get a TypeError. Use and and you get a ValueError: “The truth value of a Series is ambiguous.” These are the two most common errors in boolean indexing, so it’s worth making parentheses around every condition a habit.
Tools for sharpening conditions: isin, between, str.contains #
Frequently used condition patterns have dedicated methods.
df[df["category"].isin(["Korean", "Snack"])] # is the value in this list?
df[df["price"].between(10000, 15000)] # is it in this range? (inclusive)
df[df["name"].str.contains("Stew")] # does the string contain this pattern?isin is shorter and easier to read than chaining comparisons on the same column with |. between is a range condition that includes both endpoints, and str.contains checks for partial matches on string columns. All three return masks, so they slot straight into &, |, ~ combinations.
query: when readability matters #
As conditions get long, the parentheses and repeated df[...] make the expression hard to read. That’s where the query method comes in. Write the condition as a string, and you can use plain and and or.
df.query("price >= 12000 and category == 'Western'")
min_price = 10000
df.query("price >= @min_price and qty > 20") # reference external variables with @The same condition as boolean indexing would be df[(df["price"] >= 12000) & (df["category"] == "Western")] — query is clearly easier on the eyes. The trade-off is that the condition is a string, so IDE autocompletion and type checking can’t reach it, and column names containing spaces have to be wrapped in backticks. A good balance: build your fundamentals on boolean indexing, and bring out query as a readability tool when a condition grows past three clauses.
SettingWithCopyWarning: views and copies #
Use selection and filtering long enough and you will meet this warning.
cheap = df[df["price"] < 12000]
cheap["price"] = 0
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice...The root cause is that pandas doesn’t guarantee whether the result of df[...] is a view (a window onto the same data) or a copy of the original. If it’s a view, modifying cheap also changes df; if it’s a copy, only cheap changes. Which one you get depends on the situation, so the outcome of the assignment is unpredictable — the warning is a signal that “this assignment may or may not propagate to the original.” The fix splits in two based on your intent.
# If the goal is to modify the original: don't slice then assign — do it in one .loc
df.loc[df["price"] < 12000, "price"] = 0
# If the subset will be used as independent data: make copy() explicit from the start
cheap = df[df["price"] < 12000].copy()
cheap["price"] = 0 # no warning, df is untouchedFor reference, starting with pandas 3.0, Copy-on-Write becomes the default behavior, and chained assignments like the one above are settled as simply never affecting the original — no warning needed. The ambiguity goes away, but the habit of “modify the original in a single .loc, take subsets with copy()” stays valid in every version.
Wrap-up #
What this post covered:
- Column selection: one bracket gives a Series, two give a DataFrame
locis labels,ilocis positions, and the two drift apart after filtering- A condition becomes a mask; combine with
&,|,~and mandatory parentheses isin,between,str.containsexpress common conditions concisely- When conditions get long,
queryis the readability alternative - SettingWithCopyWarning signals view/copy ambiguity; the fixes are a single
.locassignment orcopy()
In the next post (#4 Transforming data and missing values) we move on to changing the data we’ve selected: creating new columns, apply vs. vectorized operations, and handling the missing values that make up half of real-world data work (dropna, fillna).