Python Basics #17 — File Read/Write Vol. 1

Sunday, May 10, 2020

8 min read

In this lesson we’ll cover reading and writing files.

When developing, you’ll often read data from files, write results to files, or log errors to files when something goes wrong.

File formats commonly handled in Python #

TEXT
CSV
JSON
YAML
EXCEL
PDF
Image

The most common file formats in Python work include plain text files, CSV, JSON, YAML, EXCEL, PDF, and image files. Since this is a basics course, we’ll cover only plain text files.

First, reading a file.

To read a file you must first open it. Use the built-in open function. open has many parameters; for now, remember three: file, mode, and encoding.

Python code

open(file='data.txt', mode='r', encoding='utf8')

The file parameter takes a file path — absolute, or relative to your Python file. If the file is in the same folder as the script, just the filename works.

Now the mode parameter options:

File read/write modes #

Mode	Description
r	Read-only (default — can be omitted).
w	Write-only. Creates a new file if one with the same name doesn’t exist; otherwise overwrites. Overwriting deletes existing data, so be careful.
x	Write-only, but raises an error if a file with the same name already exists. Think of this as a safer `w`.
a	Append. Creates a new file if one with the same name doesn’t exist; otherwise appends to the end.
t	Text mode (default — can be omitted).
b	Binary mode. Use this for non-text files like PDFs and images.
r+	Read and write simultaneously.
w+	Read and write simultaneously.

In this lesson we’ll only cover r, w, x, and a.

The encoding parameter defaults to "cp932". To read or write files containing Unicode characters (Korean, Chinese, Japanese, etc.), use the "utf8" option.

Save a text file country.txt containing the following data — country, capital, and region:

country.txt

South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe

First, check what open returns. Since the file is in the same folder as the script, only the filename is needed. Use mode r and encoding utf8:

Python code

file = open(file='country.txt', mode='r', encoding='utf8')
print(file)

Result

<_io.TextIOWrapper name='country.txt' mode='r' encoding='utf8'>

open returned a TextIOWrapper object. This object has many methods — let’s start with read.

Python code

file = open(file='country.txt', mode='r', encoding='utf8')
print(file.read())
file.close()

Result

South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe

The contents of country.txt were printed.

After opening a file with open, you must call close when done. The file does close automatically when the program ends, but while the program is running the file is locked — preventing other people or programs from accessing it. To avoid this, prefer the context manager instead.

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    print(file.read())  # body block
# file.close()  # not needed when using `with`

A context manager starts with the with keyword. The variable that holds the object goes after the as keyword. The indented block below is the body. When that body exits, the context manager closes the file automatically.

The file object has a closed attribute that tells you whether the file is closed. Let’s verify:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    print('[Inside the with block]', file.closed)
    print(file.read())

print('[Outside the with block]', file.closed)

Result

[Inside the with block] False
South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe
[Outside the with block] True

Inside with, the file is open; outside, it’s closed.

read accepts an integer parameter n. With it, only n characters are read and returned.

Pass 4 to read just the first four characters:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    print(file.read(4))

Result

Sout

Internally the file object uses a cursor. When the file opens, the cursor sits at the beginning. When you call read, the cursor moves forward by that many characters as it reads. After reading 4 characters, the cursor has moved 4 places. The next call to read continues from wherever the cursor stopped.

Read commas and words separately by chaining several read calls:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    print(file.read(11))
    print(file.read(1))
    print(file.read(5))
    print(file.read(1))
    print(file.read(4))

Result

South Korea
,
Seoul
,
Asia

Now what happens if you call read twice in a row? Will it print the entire file twice?

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    print(file.read())
    print(file.read())

Result

South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe

Only once. Why? After the first read, the cursor is at the end of the file. The second read has nothing left to read, so it returns nothing. To re-read the whole file, use seek. Calling seek(0) moves the cursor back to the beginning.

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    print(file.read())
    file.seek(0)
    print(file.read())

Result

South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe
South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe

The file is printed twice.

Next, the readlines method. It reads each line into a list item and returns the entire file as one list:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    lines = file.readlines()

print(lines)

Result

['South Korea,Seoul,Asia\n', 'Japan,Tokyo,Asia\n', 'China,Beijing,Asia\n', 'United Kingdom,London,Europe\n', 'France,Paris,Europe\n', 'Italy,Rome,Europe']

Each line is stored neatly in a list. The typical reason to do this is to iterate over it with a for loop. Let’s try:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    lines = file.readlines()

for line in lines:
    print(line)

Result

South Korea,Seoul,Asia

Japan,Tokyo,Asia

China,Beijing,Asia

United Kingdom,London,Europe

France,Paris,Europe

Italy,Rome,Europe

Compared to the actual file content, blank lines were inserted between each line. That’s because each list item ends with a newline character, and print adds another newline — two newlines per line.

Fix this by passing an empty string to print’s end parameter so it doesn’t add a newline:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    lines = file.readlines()

for line in lines:
    print(line, end='')

Result

South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe

Now let’s filter countries by region. First, split each line by comma and store in a list:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    lines = file.readlines()

countries = []

for line in lines:
    countries.append(line.split(','))

print(countries)

Result

[['South Korea', 'Seoul', 'Asia\n'], ['Japan', 'Tokyo', 'Asia\n'], ['China', 'Beijing', 'Asia\n'], ['United Kingdom', 'London', 'Europe\n'], ['France', 'Paris', 'Europe\n'], ['Italy', 'Rome', 'Europe']]

The region still has trailing newline characters. Use strip to remove them:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    lines = file.readlines()

countries = []

for line in lines:
    split_items = [x.strip() for x in line.split(',')]
    countries.append(split_items)

print(countries)

Result

[['South Korea', 'Seoul', 'Asia'], ['Japan', 'Tokyo', 'Asia'], ['China', 'Beijing', 'Asia'], ['United Kingdom', 'London', 'Europe'], ['France', 'Paris', 'Europe'], ['Italy', 'Rome', 'Europe']]

Cleaned up. Now use if to filter only Asian countries:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    lines = file.readlines()

countries = []

for line in lines:
    split_items = [x.strip() for x in line.split(',')]
    countries.append(split_items)

print('Asian countries:')
for country in countries:
    if country[2] == 'Asia':
        print('- {}'.format(country[0]))

Result

Asian countries:
- South Korea
- Japan
- China

Reuse the countries data to also filter European countries:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    lines = file.readlines()

countries = []

for line in lines:
    split_items = [x.strip() for x in line.split(',')]
    countries.append(split_items)

print('Asian countries:')
for country in countries:
    if country[2] == 'Asia':
        print('- {}'.format(country[0]))

print('\nEuropean countries:')
for country in countries:
    if country[2] == 'Europe':
        print('- {}'.format(country[0]))

Result

Asian countries:
- South Korea
- Japan
- China

European countries:
- United Kingdom
- France
- Italy

There’s one problem: opening a large file with read or readlines loads the entire text into memory, wasting a lot of it. It also slows the program down. A program that consumes gigabytes of memory just to read a file and runs slowly isn’t going to win any fans. read and readlines are fine for small files, but for large files use a different approach.

The file object can be used directly in a for loop, returning one line at a time. This way you can read the whole file without loading all the data into memory:

Python code

with open(file='country.txt', mode='r', encoding='utf8') as file:
    for line in file:
        print(line, end='')

Result

South Korea,Seoul,Asia
Japan,Tokyo,Asia
China,Beijing,Asia
United Kingdom,London,Europe
France,Paris,Europe
Italy,Rome,Europe

Let’s compare memory and performance between read and iterating the file object directly. I prepared a ~135 MB file for the test. Use the time package to measure execution time — subtract start time from end time to get elapsed seconds.

First, test read:

Python code

import time

start = time.time()

with open(file='big_file.txt', mode='r', encoding='utf8') as file:
    text = file.read()
    for line in text.splitlines():
        print(line)

end = time.time()
time_took = end - start
print('Elapsed: {} sec'.format(time_took))

Result

line 1
line 2
...
line 9999998
line 9999999
Elapsed: 407.7338275909424 sec

read took 407 seconds. While running, it consumed 824.4 MB of memory (image below).

Now test using the file object directly in a for loop:

Python code

import time

start = time.time()

with open(file='big_file.txt', mode='r', encoding='utf8') as file:
    # text = file.read()
    for line in file:
        print(line, end='')

end = time.time()
time_took = end - start
print('Elapsed: {} sec'.format(time_took))

Result

line 1
line 2
...
line 9999998
line 9999999
Elapsed: 405.07602643966675 sec

Iterating the file object directly used much less memory than read, with similar (slightly faster) execution time.

In the next lesson we’ll cover file writing. Thanks.