Python Data Analysis #4: Transforming Data — New Columns, Dates, and Missing Values
Almost no dataset is ready for analysis straight out of the CSV. Strings have stray whitespace around them, dates are stored as strings, values are missing all over, and the same row appears twice. In real work, more than half of analysis time goes into this cleanup. The previous part covered picking out the rows and columns you want; this part covers turning what you picked into something analyzable. The example data has whitespace, missing values, and duplicates mixed in on purpose.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"order_id": [1001, 1002, 1003, 1004, 1004],
"product": [" Keyboard", "Mouse", "USB Hub", "Monitor", "Monitor"],
"unit_price": [49000, 15000, None, 320000, 320000],
"qty": [2, 1, 3, 1, 1],
"order_date": ["2026-04-01", "2026-04-03", "2026-04-05", "2026-04-10", "2026-04-10"],
"coupon": ["SUMMER10", None, "SUMMER10", None, None],
})New columns: one line of arithmetic #
Let’s create an amount column by multiplying unit price and quantity. No for loop required.
df["amount"] = df["unit_price"] * df["qty"]
# 0 98000.0
# 1 15000.0
# 2 NaN
# 3 320000.0
# 4 320000.0df["unit_price"] * df["qty"] multiplies the entire columns at once. This is the mindset called vectorization. Instead of a Python loop walking row by row, pandas processes the whole column in one pass through its internal C code — the same one line works on a million rows. The key to getting comfortable with pandas is shifting your thinking from “how do I process each row” to “what operation do I apply between columns.” The row that had a missing value produced NaN in the result, too. Missing values stay missing through arithmetic; we’ll deal with them below.
apply is a last resort #
Once a condition is involved, it’s tempting to reach for apply with a function.
def shipping(row):
if row["amount"] >= 100000:
return 0
return 3000
df["shipping"] = df.apply(shipping, axis=1) # called once per rowIt works, but apply calls a Python function for every row. Vectorization breaks down and it becomes a Python loop in practice: a million rows means a million function calls. The same logic as a vectorized operation looks like this.
df["shipping"] = np.where(df["amount"] >= 100000, 0, 3000)np.where(condition, value_if_true, value_if_false) is applied to the whole column at once, and as the data grows, the gap widens to tens or hundreds of times. The remaining cases for apply are complex logic that’s hard to express as vector operations, or situations where you must call an external library function per row. If it can be written as a vectorized operation, write it that way — keep apply for last.
String handling: the str accessor #
For string columns, attach .str to apply Python string methods across the entire column.
df["product"] = df["product"].str.strip() # trim surrounding whitespace
df["coupon"] = df["coupon"].str.replace("10", "") # substitution
df["product"].str.split(" ").str[0] # split, then first piecestrip in particular gets used at least once in nearly every real dataset. “Keyboard” and " Keyboard" look identical to the eye but are different values, and if you don’t clean up the whitespace, the group aggregations in the next part will split into two groups.
Dates: to_datetime and the dt accessor #
Right now the order_date column is just strings. Convert it to a real date type (datetime64) with pd.to_datetime, and the .dt accessor gives you year, month, and weekday directly.
df["order_date"] = pd.to_datetime(df["order_date"])
df["weekday"] = df["order_date"].dt.day_name()
# 0 Wednesday
# 1 Friday
# 2 Sundaydt.year, dt.month, and dt.day work the same way. Subtracting dates gives you durations: df["days_elapsed"] = (pd.Timestamp("2026-04-15") - df["order_date"]).dt.days subtracts from a reference date and pulls out just the day counts with .dt.days, yielding 14, 12, 10, 5, 5.
Finding missing values: what NaN really is #
Empty values show up in pandas as NaN. It stands for Not a Number — originally a special float value from the floating-point standard. That’s why a single missing value in an integer column turns the whole column into floats, and why the amounts above displayed with decimal points like 98000.0. NaN also has one strange property: it isn’t even equal to itself — np.nan == np.nan is False. So you can’t find missing values with ==; you use the dedicated method isna.
df.isna().sum()
# unit_price 1
# coupon 3
# all other columns are 0isna() returns True/False for missingness, and sum() counts the Trues as 1s, giving you missing counts per column. It’s the first line to run on any new dataset.
Fill or drop: dropna and fillna #
Handling missing values comes down to two choices: drop them or fill them.
df.dropna() # drop every row with any missing value
df.dropna(subset=["unit_price"]) # drop only rows where unit_price is missing
df["coupon"] = df["coupon"].fillna("none") # fill missing values with a given valueWhich one to choose depends on what the empty cell means. A missing coupon carries the information “no coupon was used,” so filling it with “none” is correct. A missing unit price, on the other hand, means the value is unknown — fill it with 0 and the average price and total amounts all get distorted. Zero and missing are different values. Zero is the fact “the value is 0”; missing is the state “the value is unknown.” Filling with the mean is another approach you’ll see often, but mean imputation shrinks the apparent spread of the data, and the more values are missing, the more the distribution gets pulled toward the mean. Use it lightly only when the missing rate is low and a simple summary is the goal; otherwise, dropping is often the safer choice.
Changing types: astype #
Even if you want to turn unit_price back into integers, the conversion errors out as long as NaN remains — NaN itself is a float and can’t live in an integer column. Clean up the missing values first, then convert.
df = df.dropna(subset=["unit_price"])
df["unit_price"] = df["unit_price"].astype("int64") # float64 → int64
df["product"] = df["product"].astype("category") # categoricalFor string columns with a fixed set of possible values, converting to the categorical type like the last line is worthwhile. Instead of storing the same string over and over, pandas stores numbered tickets internally, which cuts memory significantly and speeds up group aggregations. On data with millions of rows and only a few dozen product names, the effect is unmistakable.
Removing duplicates: duplicated and drop_duplicates #
Order 1004 appeared twice. Check with duplicated and remove with drop_duplicates.
df.duplicated().sum() # 1
df = df.drop_duplicates()duplicated treats rows as duplicates when every column matches exactly. To compare on specific columns only, pass subset=["order_id"]; the default keeps the first occurrence, and keep="last" keeps the last instead. Sum up data without knowing about duplicates and your revenue gets inflated — make checking for them, along with the missing-value count, a habit early in every analysis.
Wrap-up #
What this part covered:
- Create new columns with one line of column arithmetic, thinking in vectorized operations instead of row loops
applycalls a Python function per row and is slow — keep it as a last resort- The
.straccessor applies trimming, substitution, and splitting across whole columns to_datetimebuilds a real date type, and.dtextracts year, month, weekday, and date differences- Count missing values with
isna().sum(), then choosedropnaorfillnabased on what the empty cell means - Tidy types with
astype, convert repeated strings to categoricals, and remove duplicate rows withdrop_duplicates
In the next part (#5 Grouping, aggregation, and joins) we move on to actually summarizing the cleaned data: per-group sums and means with groupby, plus stitching multiple tables together with merge and concat.