Python Data Analysis #4: Transforming Data — New Columns, Dates, and Missing Values

7 min read

Almost no dataset is ready for analysis straight out of the CSV. Strings have stray whitespace around them, dates are stored as strings, values are missing all over, and the same row appears twice. In real work, more than half of analysis time goes into this cleanup. The previous part covered picking out the rows and columns you want; this part covers turning what you picked into something analyzable. The example data has whitespace, missing values, and duplicates mixed in on purpose.

example data
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "order_id": [1001, 1002, 1003, 1004, 1004],
    "product": ["  Keyboard", "Mouse", "USB Hub", "Monitor", "Monitor"],
    "unit_price": [49000, 15000, None, 320000, 320000],
    "qty": [2, 1, 3, 1, 1],
    "order_date": ["2026-04-01", "2026-04-03", "2026-04-05", "2026-04-10", "2026-04-10"],
    "coupon": ["SUMMER10", None, "SUMMER10", None, None],
})

New columns: one line of arithmetic #

Let’s create an amount column by multiplying unit price and quantity. No for loop required.

column arithmetic
df["amount"] = df["unit_price"] * df["qty"]
# 0     98000.0
# 1     15000.0
# 2         NaN
# 3    320000.0
# 4    320000.0

df["unit_price"] * df["qty"] multiplies the entire columns at once. This is the mindset called vectorization. Instead of a Python loop walking row by row, pandas processes the whole column in one pass through its internal C code — the same one line works on a million rows. The key to getting comfortable with pandas is shifting your thinking from “how do I process each row” to “what operation do I apply between columns.” The row that had a missing value produced NaN in the result, too. Missing values stay missing through arithmetic; we’ll deal with them below.

apply is a last resort #

Once a condition is involved, it’s tempting to reach for apply with a function.

apply (the slow way)
def shipping(row):
    if row["amount"] >= 100000:
        return 0
    return 3000

df["shipping"] = df.apply(shipping, axis=1)  # called once per row

It works, but apply calls a Python function for every row. Vectorization breaks down and it becomes a Python loop in practice: a million rows means a million function calls. The same logic as a vectorized operation looks like this.

np.where (the fast way)
df["shipping"] = np.where(df["amount"] >= 100000, 0, 3000)

np.where(condition, value_if_true, value_if_false) is applied to the whole column at once, and as the data grows, the gap widens to tens or hundreds of times. The remaining cases for apply are complex logic that’s hard to express as vector operations, or situations where you must call an external library function per row. If it can be written as a vectorized operation, write it that way — keep apply for last.

String handling: the str accessor #

For string columns, attach .str to apply Python string methods across the entire column.

the str accessor
df["product"] = df["product"].str.strip()             # trim surrounding whitespace
df["coupon"] = df["coupon"].str.replace("10", "")     # substitution
df["product"].str.split(" ").str[0]                   # split, then first piece

strip in particular gets used at least once in nearly every real dataset. “Keyboard” and " Keyboard" look identical to the eye but are different values, and if you don’t clean up the whitespace, the group aggregations in the next part will split into two groups.

Dates: to_datetime and the dt accessor #

Right now the order_date column is just strings. Convert it to a real date type (datetime64) with pd.to_datetime, and the .dt accessor gives you year, month, and weekday directly.

date conversion and the dt accessor
df["order_date"] = pd.to_datetime(df["order_date"])
df["weekday"] = df["order_date"].dt.day_name()
# 0    Wednesday
# 1       Friday
# 2       Sunday

dt.year, dt.month, and dt.day work the same way. Subtracting dates gives you durations: df["days_elapsed"] = (pd.Timestamp("2026-04-15") - df["order_date"]).dt.days subtracts from a reference date and pulls out just the day counts with .dt.days, yielding 14, 12, 10, 5, 5.

Finding missing values: what NaN really is #

Empty values show up in pandas as NaN. It stands for Not a Number — originally a special float value from the floating-point standard. That’s why a single missing value in an integer column turns the whole column into floats, and why the amounts above displayed with decimal points like 98000.0. NaN also has one strange property: it isn’t even equal to itself — np.nan == np.nan is False. So you can’t find missing values with ==; you use the dedicated method isna.

counting missing values
df.isna().sum()
# unit_price    1
# coupon        3
# all other columns are 0

isna() returns True/False for missingness, and sum() counts the Trues as 1s, giving you missing counts per column. It’s the first line to run on any new dataset.

Fill or drop: dropna and fillna #

Handling missing values comes down to two choices: drop them or fill them.

dropna and fillna
df.dropna()                        # drop every row with any missing value
df.dropna(subset=["unit_price"])   # drop only rows where unit_price is missing
df["coupon"] = df["coupon"].fillna("none")   # fill missing values with a given value

Which one to choose depends on what the empty cell means. A missing coupon carries the information “no coupon was used,” so filling it with “none” is correct. A missing unit price, on the other hand, means the value is unknown — fill it with 0 and the average price and total amounts all get distorted. Zero and missing are different values. Zero is the fact “the value is 0”; missing is the state “the value is unknown.” Filling with the mean is another approach you’ll see often, but mean imputation shrinks the apparent spread of the data, and the more values are missing, the more the distribution gets pulled toward the mean. Use it lightly only when the missing rate is low and a simple summary is the goal; otherwise, dropping is often the safer choice.

Changing types: astype #

Even if you want to turn unit_price back into integers, the conversion errors out as long as NaN remains — NaN itself is a float and can’t live in an integer column. Clean up the missing values first, then convert.

astype
df = df.dropna(subset=["unit_price"])
df["unit_price"] = df["unit_price"].astype("int64")   # float64 → int64
df["product"] = df["product"].astype("category")      # categorical

For string columns with a fixed set of possible values, converting to the categorical type like the last line is worthwhile. Instead of storing the same string over and over, pandas stores numbered tickets internally, which cuts memory significantly and speeds up group aggregations. On data with millions of rows and only a few dozen product names, the effect is unmistakable.

Removing duplicates: duplicated and drop_duplicates #

Order 1004 appeared twice. Check with duplicated and remove with drop_duplicates.

removing duplicates
df.duplicated().sum()        # 1
df = df.drop_duplicates()

duplicated treats rows as duplicates when every column matches exactly. To compare on specific columns only, pass subset=["order_id"]; the default keeps the first occurrence, and keep="last" keeps the last instead. Sum up data without knowing about duplicates and your revenue gets inflated — make checking for them, along with the missing-value count, a habit early in every analysis.

Wrap-up #

What this part covered:

  • Create new columns with one line of column arithmetic, thinking in vectorized operations instead of row loops
  • apply calls a Python function per row and is slow — keep it as a last resort
  • The .str accessor applies trimming, substitution, and splitting across whole columns
  • to_datetime builds a real date type, and .dt extracts year, month, weekday, and date differences
  • Count missing values with isna().sum(), then choose dropna or fillna based on what the empty cell means
  • Tidy types with astype, convert repeated strings to categoricals, and remove duplicate rows with drop_duplicates

In the next part (#5 Grouping, aggregation, and joins) we move on to actually summarizing the cleaned data: per-group sums and means with groupby, plus stitching multiple tables together with merge and concat.

X