Python Data Analysis #3: Selecting and Filtering — loc, iloc, and Boolean Indexing

7 min read

Once the data is loaded, the next task is fixed: picking out just the parts you need. Selections like “only this column,” “only the rows that match this condition,” or “both at once” make up half of real analysis code. In this post we go through pandas’ selection tools — column selection, loc, iloc, and boolean indexing — and finish with the SettingWithCopyWarning everyone runs into at least once. In #2 Loading and exploration we read a CSV and looked at the shape of the data. This time, so you can follow along from anywhere, we’ll build a small DataFrame in code and start from there.

example data
import pandas as pd

df = pd.DataFrame({
    "name": ["Kimchi Stew", "Bibimbap", "Bulgogi", "Naengmyeon", "Pasta", "Pizza"],
    "category": ["Korean", "Korean", "Korean", "Korean", "Western", "Western"],
    "price": [9000, 10000, 15000, 11000, 14000, 18000],
    "qty": [42, 35, 21, 18, 27, 30],
})
print(df)
#           name category  price  qty
# 0  Kimchi Stew   Korean   9000   42
# 1     Bibimbap   Korean  10000   35
# 2      Bulgogi   Korean  15000   21
# 3   Naengmyeon   Korean  11000   18
# 4        Pasta  Western  14000   27
# 5        Pizza  Western  18000   30

Column selection: one bracket vs. two #

The most basic operation is column selection. But whether you use one pair of brackets or two changes the type of the result.

Series vs DataFrame
df["price"]            # Series (1-dimensional)
df[["name", "price"]]  # DataFrame (2-dimensional table)

df["price"] returns a Series — a 1-D array with an index attached, on which you can call aggregations like .mean() directly. df[["price"]], on the other hand, is a DataFrame even with a single column, so the table shape is preserved. If you don’t know this difference, you’ll hit situations where you selected the same column but some code works and some doesn’t. To work with the values of one column, use one bracket; to keep the table shape, use two.

loc and iloc: labels and positions #

To pick rows, use loc and iloc. The distinction fits in one line: loc selects by label (index value), iloc by position (integer offset). For example, df.loc[2] and df.iloc[2] both return the Bulgogi row. They look the same right now because the index runs sequentially from 0, but after filtering, labels and positions drift apart.

after a filter, they differ
sub = df[df["category"] == "Western"]   # only rows with labels 4 and 5 remain
sub.loc[4]    # label 4: Pasta
sub.iloc[0]   # first row: Pasta
sub.loc[0]    # KeyError! there is no row with label 0

If you need “the nth row,” it’s iloc; if you need “the row whose index value is X,” it’s loc. With a comma, you can select rows and columns at the same time — rows before the comma, columns after.

selecting rows and columns together
df.loc[1:3, ["name", "price"]]   # rows with labels 1–3, name and price columns
df.iloc[1:3, 0:2]                # rows at positions 1–2, columns 0–1

There’s a trap here. A loc slice includes the end label, while an iloc slice excludes the end position, like a Python list. So df.loc[1:3] gives 3 rows and df.iloc[1:3] gives 2.

Boolean indexing: a condition becomes a mask #

This is the heart of this post. The pandas model for “only the rows that match a condition” is simple. Apply a comparison to a column and you get a Series of True/False values, called a mask.

a condition produces a mask
df["price"] >= 12000
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# 5     True
# Name: price, dtype: bool

Put that mask inside the brackets — df[df["price"] >= 12000] — and you have a filter. Only the True rows remain: Bulgogi, Pasta, and Pizza. To combine conditions, use &, |, and ~ instead of and, or, and not, and each condition must be wrapped in parentheses.

combining conditions, with parentheses
df[(df["category"] == "Korean") & (df["price"] >= 12000)]  # and
df[(df["price"] <= 9000) | (df["qty"] >= 30)]              # or
df[~(df["category"] == "Western")]                         # not

The parentheses are mandatory because of operator precedence. & is evaluated before >=, so without parentheses the conditions group the wrong way and you get a TypeError. Use and and you get a ValueError: “The truth value of a Series is ambiguous.” These are the two most common errors in boolean indexing, so it’s worth making parentheses around every condition a habit.

Tools for sharpening conditions: isin, between, str.contains #

Frequently used condition patterns have dedicated methods.

isin, between, str.contains
df[df["category"].isin(["Korean", "Snack"])]   # is the value in this list?
df[df["price"].between(10000, 15000)]          # is it in this range? (inclusive)
df[df["name"].str.contains("Stew")]            # does the string contain this pattern?

isin is shorter and easier to read than chaining comparisons on the same column with |. between is a range condition that includes both endpoints, and str.contains checks for partial matches on string columns. All three return masks, so they slot straight into &, |, ~ combinations.

query: when readability matters #

As conditions get long, the parentheses and repeated df[...] make the expression hard to read. That’s where the query method comes in. Write the condition as a string, and you can use plain and and or.

the query method
df.query("price >= 12000 and category == 'Western'")
min_price = 10000
df.query("price >= @min_price and qty > 20")   # reference external variables with @

The same condition as boolean indexing would be df[(df["price"] >= 12000) & (df["category"] == "Western")]query is clearly easier on the eyes. The trade-off is that the condition is a string, so IDE autocompletion and type checking can’t reach it, and column names containing spaces have to be wrapped in backticks. A good balance: build your fundamentals on boolean indexing, and bring out query as a readability tool when a condition grows past three clauses.

SettingWithCopyWarning: views and copies #

Use selection and filtering long enough and you will meet this warning.

code that triggers the warning
cheap = df[df["price"] < 12000]
cheap["price"] = 0
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice...

The root cause is that pandas doesn’t guarantee whether the result of df[...] is a view (a window onto the same data) or a copy of the original. If it’s a view, modifying cheap also changes df; if it’s a copy, only cheap changes. Which one you get depends on the situation, so the outcome of the assignment is unpredictable — the warning is a signal that “this assignment may or may not propagate to the original.” The fix splits in two based on your intent.

fix by intent
# If the goal is to modify the original: don't slice then assign — do it in one .loc
df.loc[df["price"] < 12000, "price"] = 0

# If the subset will be used as independent data: make copy() explicit from the start
cheap = df[df["price"] < 12000].copy()
cheap["price"] = 0   # no warning, df is untouched

For reference, starting with pandas 3.0, Copy-on-Write becomes the default behavior, and chained assignments like the one above are settled as simply never affecting the original — no warning needed. The ambiguity goes away, but the habit of “modify the original in a single .loc, take subsets with copy()” stays valid in every version.

Wrap-up #

What this post covered:

  • Column selection: one bracket gives a Series, two give a DataFrame
  • loc is labels, iloc is positions, and the two drift apart after filtering
  • A condition becomes a mask; combine with &, |, ~ and mandatory parentheses
  • isin, between, str.contains express common conditions concisely
  • When conditions get long, query is the readability alternative
  • SettingWithCopyWarning signals view/copy ambiguity; the fixes are a single .loc assignment or copy()

In the next post (#4 Transforming data and missing values) we move on to changing the data we’ve selected: creating new columns, apply vs. vectorized operations, and handling the missing values that make up half of real-world data work (dropna, fillna).

X