Python Data Analysis #2: Loading Data — CSV, Excel, and First Exploration
In part 1 we set up the environment and saw what a DataFrame is. This time we cover the real starting point of any analysis: loading files. Tutorial datasets are always clean, but the files you receive at work are not. The encoding is different, the delimiter is a tab, the first two lines are a decorative title, and the numeric columns have commas mixed in. In this post we’ll learn how to read those files with pandas, and lock in the verification routine you should run the moment a file is loaded.
Analysis starts with someone else’s file #
The first input in most data analysis work is a file you didn’t create. An Excel sheet from a business partner, a CSV downloaded from an internal system, JSON returned by an API. These files carry their creator’s environment with them, so the reader has to adapt to the format. That’s why pandas’ reading functions take dozens of arguments. You don’t need to memorize them all — knowing the four you’ll run into most often solves the majority of cases.
read_csv basics #
This is the function you’ll use the most. With a well-formed CSV, one line is all it takes.
import pandas as pd
df = pd.read_csv("sales.csv")The problem is CSVs that aren’t well formed. Let’s go through the frequently used arguments one by one.
encoding: the legacy codepage trap #
Files saved with “Save as CSV” from Excel on Windows are often not UTF-8 but a legacy system codepage — cp949 on Korean Windows (an EUC-KR extension), cp1252 in many Western locales, and so on. Read such a file as-is and this happens.
df = pd.read_csv("sales.csv")
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0 ...When you see UnicodeDecodeError, a legacy-codepage file is almost always the culprit. The encoding argument fixes it.
df = pd.read_csv("sales.csv", encoding="cp949")The opposite case exists too. If a UTF-8 file shows garbled non-ASCII text, try encoding="utf-8-sig". It handles UTF-8 files that carry an invisible marker called a BOM at the very beginning.
sep: when the delimiter isn’t a comma #
Files labeled as CSVs that are actually tab- or semicolon-separated are common. If you read a file and all the columns come out mashed into one, it’s a delimiter problem.
df = pd.read_csv("sales.tsv", sep="\t") # tab-separated
df = pd.read_csv("export.csv", sep=";") # semicolon-separatedheader: when the first line isn’t the header #
Files exported from internal systems sometimes have a title like “Sales Report, July 2026” in the first line. If the actual column names are on the third line, pass header=2. Note that counting starts from 0.
df = pd.read_csv("report.csv", header=2) # use the 3rd line as column names
df = pd.read_csv("raw.csv", header=None) # file with no header at alldtype: pinning a type up front #
Some columns — product codes, for example — “look like numbers but aren’t.” Read a code like 001234 naively and pandas converts it to the integer 1234, dropping the leading zeros. dtype prevents that.
df = pd.read_csv("sales.csv", dtype={"product_code": str})read_excel: files with sheets #
Excel files are read with read_excel. The biggest difference from CSV is one extra dimension: sheets.
df = pd.read_excel("report.xlsx") # first sheet
df = pd.read_excel("report.xlsx", sheet_name="July") # by name
df = pd.read_excel("report.xlsx", sheet_name=1) # by position (from 0)
# All sheets at once: returns a dict keyed by sheet name
sheets = pd.read_excel("report.xlsx", sheet_name=None)It requires the openpyxl package, so add it to the project from part 1 with uv add openpyxl. For the automation angle — reading, writing, and even formatting Excel files — there’s a separate write-up in part 2 of the automation series. In this series we only go as far as “bring it into a DataFrame for analysis.”
read_json and the clipboard #
JSON files saved from API responses are read with read_json.
df = pd.read_json("data.json")And there’s one feature that gets used surprisingly often. Copy a table from Excel or a web page, run this, and the clipboard contents become a DataFrame.
df = pd.read_clipboard()It’s handy for quickly checking a small table that isn’t quite worth saving to a file.
The verification routine right after loading #
Successfully reading the file isn’t the end. The habit of checking the data’s state the moment it’s loaded is the single thing this series most wants to drive home. Let’s follow the flow with a small sales dataset.
df = pd.read_csv("sales.csv", encoding="cp949")
df.head() date product region qty price
0 2026-04-01 Keyboard Seoul 3 45000
1 2026-04-01 Mouse Busan 5 12000
2 2026-04-02 Monitor Seoul 1 320,000
3 2026-04-02 Keyboard Daegu 2 45000
4 2026-04-03 Mouse Seoul NaN 12000head() shows the first 5 rows, tail() the last 5. Two issues are visible from the output alone: price has values with commas mixed in, and qty has a NaN (missing value). Next is info().
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 12 non-null object
1 product 12 non-null object
2 region 12 non-null object
3 qty 11 non-null float64
4 price 12 non-null object
dtype: memory usage: 608.0+ bytesOne info() call gives you three reads.
- Row count: 12 rows total. Compare against the row count of the source file
- Missing values: qty is 11 non-null, so 1 value is empty
- dtype: each column’s type. This is where the warning signs show up
Check the distribution of numeric columns with describe().
df.describe() qty
count 11.000000
mean 2.818182
std 1.401298
min 1.000000
25% 2.000000
50% 3.000000
75% 3.500000
max 5.000000Scanning the mean, min, and max lets you catch outliers early — negative quantities, absurdly large values. To see just the row and column counts quickly, use shape.
df.shape
# (12, 5)The sign that a dtype is wrong #
The most important discovery in the info() output above is that price is object. A price should obviously be numeric, yet it was read as a string. And price is entirely absent from the describe() output — it was excluded from the statistics because it isn’t a numeric column.
The cause is the 320,000 we saw in the head output. If even one value in a column contains a comma, pandas reads the entire column as strings. Stray whitespace (" 45000") produces the same symptom. There are two fixes.
df = pd.read_csv("sales.csv", encoding="cp949", thousands=",")thousands="," tells pandas to interpret commas as thousands separators. It resolves the issue at read time, which is the cleanest option. If you’ve already loaded the data, clean the strings and convert.
df["price"] = df["price"].str.replace(",", "").astype(int)
df.info()
# ...
# 4 price 12 non-null int64price is now int64, and it’s included in describe() too. Either way, re-checking the result with info() after the conversion is part of the same move.
Saving: to_csv and to_excel #
To export a cleaned DataFrame to a file, use to_csv and to_excel. There’s one trap here.
df.to_csv("sales_clean.csv")Save it like this and the index — 0, 1, 2, ... — gets written as the file’s first column. Read that file back and you get a mystery column called Unnamed: 0. If a file you received from someone else has an Unnamed: 0 column, it’s almost certainly the residue of this mistake. Make index=False your default habit.
df.to_csv("sales_clean.csv", index=False, encoding="utf-8-sig")
df.to_excel("sales_clean.xlsx", index=False, sheet_name="clean")Saving with encoding="utf-8-sig" keeps non-ASCII text intact even when the file is opened by double-clicking in Excel on Windows. If you’re handing the result to Excel users, it’s safer to include this option as well.
Wrap-up #
The flow covered in this part:
- Real-world data starts with messy files, and the reader adapts to the format
- The four key
read_csvarguments:encoding(the legacy codepage trap),sep,header,dtype read_excelselects sheets withsheet_name, andsheet_name=Nonereturns them all as a dict- The post-load routine:
head()/tail()→info()(row count, missing values, dtype) →describe()→shape - If a column that should be numeric is object, commas or whitespace are the cause — fix with
thousands=","orstr.replace+astype - Save with
index=Falseby default, and addutf-8-sigwhen handing files to Excel users
In the next part (#3 Selecting and filtering) we cover pulling out just the parts you need from loaded data: column selection, the difference between loc and iloc, and boolean indexing for filtering rows by condition.