Python — Generators

Tuesday, October 4, 2016

9 min read

Following the previous post on Python — Decorators, this post covers Python’s generators.

If you’ve done any programming you probably know Python is an easy language to pick up. But there are still a few things most beginners say are hard to wrap their heads around — and one of them is the concept of generators and yield.

The dictionary definition of “generator” is “a generator (of electricity)” or “a person or thing that produces something.” In computer science, Wikipedia describes it like this:

A generator is a special function or routine used to control loop behavior, similar to an iterator. In fact, every generator is an iterator. A generator resembles a function that returns an array or list — it takes parameters and produces a sequence of values. Instead of building the entire array in memory and returning it all at once, it uses yield to return one value at a time, so it requires far less memory than a regular iterator. Simply put, a generator is a function that acts like an iterator.

When a regular function is called, execution starts at the first line and continues until it hits a return statement, an exception, or (if it doesn’t return) the last line, then returns control to the caller. All of the function’s inner functions and local variables disappear from memory. If the function is called again, everything starts fresh from the beginning.

But at some point, programmers wanted smarter functions — functions that don’t disappear after one call but instead remember where they left off, wait, and resume on the next call. That’s exactly what generators are. Generators give you better performance than regular functions and save memory. Enough preamble — let’s see what they look like.

Create a file called generator.py in any directory you like and save this code:

generator.py

def square_numbers(nums):
result = []
for i in nums:
result.append(i * i)
return result

my_nums = square_numbers([1, 2, 3, 4, 5])

print(my_nums)

A simple function that loops over the input list with for, builds a new list with i * i values, and returns it.

Open a terminal or command prompt, navigate to the directory containing generator.py, and run:

$ python generator.py
[1, 4, 9, 16, 25]

The new list is returned. Now let’s turn this into a generator.

generator.py

def square_numbers(nums):
for i in nums:
yield i * i

my_nums = square_numbers([1, 2, 3, 4, 5])  #1

print(my_nums)

$ python generator.py
<generator object square_numbers at 0x0000016B17E19EB0>

A generator object is returned. Generators don’t store all of their values in memory, which is why we don’t see a list like before. A generator yields one value at a time, when asked. In other words, up to #1 above, no computation has happened yet — the generator is just sitting there waiting for someone to ask for the next value. Let’s verify:

generator.py

def square_numbers(nums):
for i in nums:
yield i * i

my_nums = square_numbers([1, 2, 3, 4, 5])

print(next(my_nums))

$ python generator.py
1

Using next() we asked for the next value, and got 1. Let’s keep asking:

generator.py

def square_numbers(nums):
for i in nums:
yield i * i

my_nums = square_numbers([1, 2, 3, 4, 5])

print(next(my_nums))
print(next(my_nums))
print(next(my_nums))
print(next(my_nums))
print(next(my_nums))

$ python generator.py
1
4
9
16
25

Every value the original function-returning-a-list version produced has now been printed. What if we call next() one more time?

generator.py

def square_numbers(nums):
for i in nums:
yield i * i

my_nums = square_numbers([1, 2, 3, 4, 5])

print(next(my_nums))
print(next(my_nums))
print(next(my_nums))
print(next(my_nums))
print(next(my_nums))
print(next(my_nums))

$ python generator.py
1
4
9
16
25
Traceback (most recent call last):
File "generator.py", line 12, in <module>
print(next(my_nums))
StopIteration

A StopIteration exception is raised — that’s the generator’s way of saying “I have nothing left.”

Generators are usually consumed via a for loop. Let’s see that:

generator.py

def square_numbers(nums):
for i in nums:
yield i * i

my_nums = square_numbers([1, 2, 3, 4, 5])

for num in my_nums:
print(num)

$ python generator.py
1
4
9
16
25

This time all values are printed and there’s no StopIteration — because the for loop knows when to stop.

Here’s one advantage of generators over regular functions: the code is simpler. Item #3 of The Zen of Python says “Simple is better than complex.” Right — given a choice, simpler code wins.

We can make this even shorter using a list comprehension:

generator.py

my_nums = [x*x for x in [1, 2, 3, 4, 5]]

print(my_nums)

for num in my_nums:
print(num)

$ python generator.py
[1, 4, 9, 16, 25]
1
4
9
16
25

We get the same list as the very first example. With a tiny change we can turn this into a generator:

generator.py

my_nums = (x*x for x in [1, 2, 3, 4, 5])  #1

print(my_nums)

for num in my_nums:
print(num)

$ python generator.py
<generator object <genexpr> at 0x1007c8f50>
1
4
9
16
25

Just by changing [] to () at #1, we get a generator. Easy.

What if you want to look at all the generator’s data at once without a for loop? Just convert it to a list:

generator.py

my_nums = (x*x for x in [1, 2, 3, 4, 5])  # create generator

print(my_nums)
print(list(my_nums))  # convert generator to list

$ python generator.py
<generator object <genexpr> at 0x0000026FD7A99EB0>
[1, 4, 9, 16, 25]

Easy conversion to a list. One catch: once you convert to a list, you lose all the generator’s advantages — most importantly, performance. As mentioned, generators are faster because they don’t store all results in memory. Let’s verify with a benchmark:

generator.py

from __future__ import division
import os
import psutil
import random
import time

names = ['Choi Yongho', 'Ji Giljeong', 'Jin Yeonguk', 'Kim Sehun', 'Oh Sehun', 'Kim Minu']
majors = ['Computer Science', 'Korean Literature', 'English Literature', 'Math', 'Politics']

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / 1024 / 1024


def people_list(num_people):
result = []
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
result.append(person)
return result


def people_generator(num_people):
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
yield person

t1 = time.time()
people = people_list(1000000)  # 1 call people_list
t2 = time.time()
mem_after = process.memory_info().rss / 1024 / 1024
total_time = t2 - t1

print('Memory before: {} MB'.format(mem_before))
print('Memory after: {} MB'.format(mem_after))
print('Total time: {:.6f} sec'.format(total_time))

$ python generator.py
Memory before: 13.76171875 MB
Memory after: 284.30078125 MB
Total time: 1.215000 sec

At #1, we called people_list(1000000) to build a list of one million students. Memory went from 13 MB to 284 MB, and it took 1.2 seconds. Let’s swap people_list(1000000) for people_generator(1000000) and benchmark the generator:

generator.py

from __future__ import division
import os
import psutil
import random
import time

names = ['Choi Yongho', 'Ji Giljeong', 'Jin Yeonguk', 'Kim Sehun', 'Oh Sehun', 'Kim Minu']
majors = ['Computer Science', 'Korean Literature', 'English Literature', 'Math', 'Politics']

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / 1024 / 1024


def people_list(num_people):
result = []
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
result.append(person)
return result


def people_generator(num_people):
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
yield person

t1 = time.time()
people = people_generator(1000000)  # 1 call people_generator
t2 = time.time()
mem_after = process.memory_info().rss / 1024 / 1024
total_time = t2 - t1

print('Memory before: {} MB'.format(mem_before))
print('Memory after: {} MB'.format(mem_after))
print('Total time: {:.6f} sec'.format(total_time))

$ python generator.py
Memory before: 13.75390625 MB
Memory after: 13.7578125 MB
Total time: 0.000000 sec

No memory growth, and the elapsed time was less than 0.1 seconds. So creating a generator object is both lighter on memory and faster than building a list object.

But what about when we actually process the data? Let’s iterate over the list version with a for loop:

generator.py

from __future__ import division
import os
import psutil
import random
import time

names = ['Choi Yongho', 'Ji Giljeong', 'Jin Yeonguk', 'Kim Sehun', 'Oh Sehun', 'Kim Minu']
majors = ['Computer Science', 'Korean Literature', 'English Literature', 'Math', 'Politics']

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / 1024 / 1024


def people_list(num_people):
result = []
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
result.append(person)
return result


def people_generator(num_people):
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
yield person

t1 = time.time()

people = people_list(1000000)

# iterate the list with a for loop
for p in people:
print(p)

t2 = time.time()
mem_after = process.memory_info().rss / 1024 / 1024
total_time = t2 - t1

print('Memory before: {} MB'.format(mem_before))
print('Memory after: {} MB'.format(mem_after))
print('Total time: {:.6f} sec'.format(total_time))

$ python generator.py
{'id': 999998, 'name': 'Jin Yeonguk', 'major': 'English Literature'}
{'id': 999999, 'name': 'Jin Yeonguk', 'major': 'Computer Science'}
{'id': 999999, 'name': 'Jin Yeonguk', 'major': 'Computer Science'}
Memory before: 13.7578125 MB
Memory after: 285.84765625 MB
Total time: 97.907999 sec

Memory usage didn’t change further, but this took 97.9 seconds.

Now let’s iterate the generator version:

generator.py

from __future__ import division
import os
import psutil
import random
import time

names = ['Choi Yongho', 'Ji Giljeong', 'Jin Yeonguk', 'Kim Sehun', 'Oh Sehun', 'Kim Minu']
majors = ['Computer Science', 'Korean Literature', 'English Literature', 'Math', 'Politics']

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / 1024 / 1024


def people_list(num_people):
result = []
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
result.append(person)
return result


def people_generator(num_people):
for i in range(num_people):
person = {
'id': i,
'name': random.choice(names),
'major': random.choice(majors)
}
yield person

t1 = time.time()

people = people_generator(1000000)  # 1 call people_generator

# iterate the generator with a for loop
for p in people:
print(p)

t2 = time.time()
mem_after = process.memory_info().rss / 1024 / 1024
total_time = t2 - t1

print('Memory before: {} MB'.format(mem_before))
print('Memory after: {} MB'.format(mem_after))
print('Total time: {:.6f} sec'.format(total_time))

$ python generator.py
{'id': 999997, 'name': 'Jin Yeonguk', 'major': 'Computer Science'}
{'id': 999998, 'name': 'Oh Sehun', 'major': 'Computer Science'}
{'id': 999999, 'name': 'Ji Giljeong', 'major': 'English Literature'}
Memory before: 13.76171875 MB
Memory after: 13.75390625 MB
Total time: 102.774121 sec

Again no memory growth, but the elapsed time was 102.7 seconds — about 5 seconds slower than the list version.

The takeaway: when you need to save memory, use a generator. When you need to save execution time more than memory, use a list.

That said, when working with much larger datasets and processing many in parallel, you usually need to use limited resources efficiently rather than save a few seconds. In those cases generators are the right choice.

The next post covers Object-Oriented Programming (OOP).