Python Data Analysis #2: Loading Data — CSV, Excel, and First Exploration

7 min read

In part 1 we set up the environment and saw what a DataFrame is. This time we cover the real starting point of any analysis: loading files. Tutorial datasets are always clean, but the files you receive at work are not. The encoding is different, the delimiter is a tab, the first two lines are a decorative title, and the numeric columns have commas mixed in. In this post we’ll learn how to read those files with pandas, and lock in the verification routine you should run the moment a file is loaded.

Analysis starts with someone else’s file #

The first input in most data analysis work is a file you didn’t create. An Excel sheet from a business partner, a CSV downloaded from an internal system, JSON returned by an API. These files carry their creator’s environment with them, so the reader has to adapt to the format. That’s why pandas’ reading functions take dozens of arguments. You don’t need to memorize them all — knowing the four you’ll run into most often solves the majority of cases.

read_csv basics #

This is the function you’ll use the most. With a well-formed CSV, one line is all it takes.

basic read
import pandas as pd

df = pd.read_csv("sales.csv")

The problem is CSVs that aren’t well formed. Let’s go through the frequently used arguments one by one.

encoding: the legacy codepage trap #

Files saved with “Save as CSV” from Excel on Windows are often not UTF-8 but a legacy system codepage — cp949 on Korean Windows (an EUC-KR extension), cp1252 in many Western locales, and so on. Read such a file as-is and this happens.

encoding error
df = pd.read_csv("sales.csv")
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0 ...

When you see UnicodeDecodeError, a legacy-codepage file is almost always the culprit. The encoding argument fixes it.

read with cp949
df = pd.read_csv("sales.csv", encoding="cp949")

The opposite case exists too. If a UTF-8 file shows garbled non-ASCII text, try encoding="utf-8-sig". It handles UTF-8 files that carry an invisible marker called a BOM at the very beginning.

sep: when the delimiter isn’t a comma #

Files labeled as CSVs that are actually tab- or semicolon-separated are common. If you read a file and all the columns come out mashed into one, it’s a delimiter problem.

specifying the delimiter
df = pd.read_csv("sales.tsv", sep="\t")       # tab-separated
df = pd.read_csv("export.csv", sep=";")       # semicolon-separated

header: when the first line isn’t the header #

Files exported from internal systems sometimes have a title like “Sales Report, July 2026” in the first line. If the actual column names are on the third line, pass header=2. Note that counting starts from 0.

specifying the header position
df = pd.read_csv("report.csv", header=2)   # use the 3rd line as column names
df = pd.read_csv("raw.csv", header=None)   # file with no header at all

dtype: pinning a type up front #

Some columns — product codes, for example — “look like numbers but aren’t.” Read a code like 001234 naively and pandas converts it to the integer 1234, dropping the leading zeros. dtype prevents that.

specifying a type
df = pd.read_csv("sales.csv", dtype={"product_code": str})

read_excel: files with sheets #

Excel files are read with read_excel. The biggest difference from CSV is one extra dimension: sheets.

reading Excel
df = pd.read_excel("report.xlsx")                       # first sheet
df = pd.read_excel("report.xlsx", sheet_name="July")    # by name
df = pd.read_excel("report.xlsx", sheet_name=1)         # by position (from 0)

# All sheets at once: returns a dict keyed by sheet name
sheets = pd.read_excel("report.xlsx", sheet_name=None)

It requires the openpyxl package, so add it to the project from part 1 with uv add openpyxl. For the automation angle — reading, writing, and even formatting Excel files — there’s a separate write-up in part 2 of the automation series. In this series we only go as far as “bring it into a DataFrame for analysis.”

read_json and the clipboard #

JSON files saved from API responses are read with read_json.

reading JSON
df = pd.read_json("data.json")

And there’s one feature that gets used surprisingly often. Copy a table from Excel or a web page, run this, and the clipboard contents become a DataFrame.

reading from the clipboard
df = pd.read_clipboard()

It’s handy for quickly checking a small table that isn’t quite worth saving to a file.

The verification routine right after loading #

Successfully reading the file isn’t the end. The habit of checking the data’s state the moment it’s loaded is the single thing this series most wants to drive home. Let’s follow the flow with a small sales dataset.

sample data
df = pd.read_csv("sales.csv", encoding="cp949")
df.head()
head() output
         date  product   region  qty    price
0  2026-04-01  Keyboard   Seoul    3    45000
1  2026-04-01  Mouse      Busan    5    12000
2  2026-04-02  Monitor    Seoul    1  320,000
3  2026-04-02  Keyboard   Daegu    2    45000
4  2026-04-03  Mouse      Seoul  NaN    12000

head() shows the first 5 rows, tail() the last 5. Two issues are visible from the output alone: price has values with commas mixed in, and qty has a NaN (missing value). Next is info().

info()
df.info()
info() output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   date     12 non-null     object
 1   product  12 non-null     object
 2   region   12 non-null     object
 3   qty      11 non-null     float64
 4   price    12 non-null     object
dtype: memory usage: 608.0+ bytes

One info() call gives you three reads.

  • Row count: 12 rows total. Compare against the row count of the source file
  • Missing values: qty is 11 non-null, so 1 value is empty
  • dtype: each column’s type. This is where the warning signs show up

Check the distribution of numeric columns with describe().

describe()
df.describe()
describe() output
             qty
count  11.000000
mean    2.818182
std     1.401298
min     1.000000
25%     2.000000
50%     3.000000
75%     3.500000
max     5.000000

Scanning the mean, min, and max lets you catch outliers early — negative quantities, absurdly large values. To see just the row and column counts quickly, use shape.

shape
df.shape
# (12, 5)

The sign that a dtype is wrong #

The most important discovery in the info() output above is that price is object. A price should obviously be numeric, yet it was read as a string. And price is entirely absent from the describe() output — it was excluded from the statistics because it isn’t a numeric column.

The cause is the 320,000 we saw in the head output. If even one value in a column contains a comma, pandas reads the entire column as strings. Stray whitespace (" 45000") produces the same symptom. There are two fixes.

fix 1: handle it at read time
df = pd.read_csv("sales.csv", encoding="cp949", thousands=",")

thousands="," tells pandas to interpret commas as thousands separators. It resolves the issue at read time, which is the cleanest option. If you’ve already loaded the data, clean the strings and convert.

fix 2: convert after reading
df["price"] = df["price"].str.replace(",", "").astype(int)
df.info()
# ...
#  4   price    12 non-null     int64

price is now int64, and it’s included in describe() too. Either way, re-checking the result with info() after the conversion is part of the same move.

Saving: to_csv and to_excel #

To export a cleaned DataFrame to a file, use to_csv and to_excel. There’s one trap here.

saving
df.to_csv("sales_clean.csv")

Save it like this and the index — 0, 1, 2, ... — gets written as the file’s first column. Read that file back and you get a mystery column called Unnamed: 0. If a file you received from someone else has an Unnamed: 0 column, it’s almost certainly the residue of this mistake. Make index=False your default habit.

saving with index=False
df.to_csv("sales_clean.csv", index=False, encoding="utf-8-sig")
df.to_excel("sales_clean.xlsx", index=False, sheet_name="clean")

Saving with encoding="utf-8-sig" keeps non-ASCII text intact even when the file is opened by double-clicking in Excel on Windows. If you’re handing the result to Excel users, it’s safer to include this option as well.

Wrap-up #

The flow covered in this part:

  • Real-world data starts with messy files, and the reader adapts to the format
  • The four key read_csv arguments: encoding (the legacy codepage trap), sep, header, dtype
  • read_excel selects sheets with sheet_name, and sheet_name=None returns them all as a dict
  • The post-load routine: head() / tail()info() (row count, missing values, dtype) → describe()shape
  • If a column that should be numeric is object, commas or whitespace are the cause — fix with thousands="," or str.replace + astype
  • Save with index=False by default, and add utf-8-sig when handing files to Excel users

In the next part (#3 Selecting and filtering) we cover pulling out just the parts you need from loaded data: column selection, the difference between loc and iloc, and boolean indexing for filtering rows by condition.

X