Iterables, generators, yield from
How for actually works. The iterable protocol, generator functions and expressions, delegation with yield from, and send/throw — all in one place.
In Chapter 4 collections and comprehensions we glimpsed the generator expression (x for x in iter) at the end. This chapter covers that topic. We start with how for actually works, then cover user-defined iterables, generator functions, and yield from.
The generators in this chapter come back in two places. First, the @contextmanager from Chapter 10 context managers is in fact built on generators — we show that at the end of this chapter. Second, the async def / await of Chapter 14 asyncio intro follows the same pause/resume model.
What for in really is — the iterable protocol
#
for x in [1, 2, 3]:
print(x)What that one line is doing internally:
items = [1, 2, 3]
it = iter(items) # 1) iterable → iterator
while True:
try:
x = next(it) # 2) request the next value
except StopIteration:
break # 3) stop when finished
print(x)Two key steps:
iter(obj)— get an iterator from an iterable (calls__iter__)next(it)— request the next value (calls__next__); raisesStopIterationwhen done
Iterable vs iterator #
The terms can be confusing, so let’s sort them out.
| Definition | Methods | Examples | |
|---|---|---|---|
| Iterable | anything iter() can be called on | __iter__ | list, dict, str, range, files, generators |
| Iterator | something that can give “the next value” | __next__ (and __iter__) | the result of iter([1,2,3]), generators |
Every iterator is also an iterable (it has an __iter__ that returns itself). The reverse isn’t true — a list is iterable but isn’t an iterator. next(my_list) is an error.
User-defined iterables — via a class #
class MyRange:
def __init__(self, start: int, stop: int):
self.start = start
self.stop = stop
def __iter__(self):
return MyRangeIterator(self.start, self.stop)
class MyRangeIterator:
def __init__(self, current: int, stop: int):
self.current = current
self.stop = stop
def __iter__(self):
return self
def __next__(self):
if self.current >= self.stop:
raise StopIteration
value = self.current
self.current += 1
return value
for x in MyRange(0, 3):
print(x)
# 0, 1, 2Two classes — iterable and iterator are separated. The iterable can be iterated multiple times, but once an iterator is exhausted, it’s done.
r = MyRange(0, 3)
list(r) # [0, 1, 2]
list(r) # [0, 1, 2] ← a fresh iterator each timeGenerator functions — the same job in one function #
We can shrink those two classes above into a single function with yield in it.
def my_range(start: int, stop: int):
current = start
while current < stop:
yield current
current += 1
for x in my_range(0, 3):
print(x)
# 0, 1, 2If the function body contains yield even once, calling the function returns a generator object instead of a regular value. It does the same job as the two classes above.
How yield works
#
This is the most confusing part.
def gen():
print("step 1")
yield 1
print("step 2")
yield 2
print("step 3")
g = gen()
# the function body has not yet executed!g = gen() alone does not execute the function body. The first next(g) is what starts it.
print(next(g))
# step 1
# 1
print(next(g))
# step 2
# 2
print(next(g))
# step 3
# StopIteration ← once there are no more yieldsThe function pauses at each yield. The next next() call resumes from there. The fact that the function’s execution is interleaved with the caller’s is the heart of generators. The same pause / resume model returns in await in Chapter 14 asyncio intro.
How does this differ from a generator expression? #
It does the same thing as the (x for x in iter) we saw in Chapter 4, but the function form is more expressive.
# Expressible on one line — expression fits
squares = (x ** 2 for x in range(10))
# Complex logic — function fits
def squares_evens_only():
for x in range(10):
if x % 2 != 0:
continue
yield x ** 2The value of laziness — memory and speed #
The biggest value of generators is not producing every value at once.
# list comprehension — creates a million values right away, uses lots of memory
squares_list = [x ** 2 for x in range(1_000_000)]
# generator — produces on demand, almost no memory
squares_gen = (x ** 2 for x in range(1_000_000))
total = sum(squares_gen) # done in one passInfinite sequences are possible #
def counter(start: int = 0):
n = start
while True:
yield n
n += 1
# Only use the first 5
from itertools import islice
first_five = list(islice(counter(), 5))
print(first_five) # [0, 1, 2, 3, 4]Not possible with a list. A generator only makes as many values as requested, so infinite is fine.
Pipelines — chaining generators #
A data-processing pipeline where each stage is a generator is memory-efficient and makes the intent clear.
def read_lines(path: str):
with open(path) as f:
for line in f:
yield line.rstrip()
def filter_errors(lines):
for line in lines:
if "ERROR" in line:
yield line
def parse_timestamp(lines):
for line in lines:
ts, _, msg = line.partition(" ")
yield (ts, msg)
# Combine and use
errors = parse_timestamp(filter_errors(read_lines("app.log")))
for ts, msg in errors:
print(ts, msg)Each stage processes one line at a time. Even a 100GB file doesn’t get loaded into memory.
yield from — generator delegation
#
When you want to pass through values from another iterable.
def chain_two(a, b):
for x in a:
yield x
for y in b:
yield ydef chain_two(a, b):
yield from a
yield from bSame job, but yield from is shorter and has two extra benefits:
- send / throw are automatically delegated (covered below)
- You can receive the return value of the sub-generator
Natural for tree / recursive traversal #
def flatten(items):
for item in items:
if isinstance(item, list):
yield from flatten(item)
else:
yield item
result = list(flatten([1, [2, [3, [4]], 5]]))
print(result) # [1, 2, 3, 4, 5]A single yield from flatten(...) line unfolds recursion naturally.
send, throw, close — coroutine features
#
Generators can also receive values. If you bind the result of yield to a variable, external code can push values in with send.
def echo():
while True:
received = yield
print(f"received: {received}")
g = echo()
next(g) # advance to the first yield (priming)
g.send("hello") # received: hello
g.send("world") # received: worldAsynchrony (Chapter 14 asyncio intro) and cooperative multitasking are built on top of this mechanism. That said, in regular code you almost never deal with send directly. Just be aware of the concept.
g.throw(ValueError, "inject exception") # equivalent to raise inside the generator
g.close() # terminate the generator (throws GeneratorExit)close() is widely used — if a generator needs resource cleanup, putting cleanup code in try/finally runs it at close time.
def read_lines(path):
f = open(path)
try:
for line in f:
yield line
finally:
f.close()Even if you don’t iterate this generator to the end (you break out), garbage collection eventually calls close() and the file is closed.
itertools — gem of the standard library
#
The tools you use a lot in data pipelines live in itertools.
from itertools import (
count, cycle, repeat, # infinite
islice, # slicing
chain, # concatenation
groupby, # grouping
accumulate, # accumulation
combinations, permutations, product, # combinatorics
starmap, filterfalse, dropwhile, takewhile, # transform / filter
)
# First N
list(islice(count(), 5)) # [0, 1, 2, 3, 4]
# Chain multiple iterables
list(chain([1, 2], [3, 4])) # [1, 2, 3, 4]
# Running total
list(accumulate([1, 2, 3, 4])) # [1, 3, 6, 10]
# Grouping (input must be sorted)
data = [("a", 1), ("a", 2), ("b", 3)]
for key, group in groupby(data, key=lambda x: x[0]):
print(key, list(group))
# a [('a', 1), ('a', 2)]
# b [('b', 3)]If you do data pipelines often, scanning these once pays off for a lifetime.
Standard library collections use the same protocol #
The ABCs in collections.abc are formalizations of this protocol. Same spot as the Protocol use sites we saw in Chapter 9 typing in earnest.
from collections.abc import Iterable, Iterator
def consume(items: Iterable[int]) -> int:
total = 0
for x in items:
total += x
return total
# list, tuple, set, generator, range, ... all pass
consume([1, 2, 3])
consume(range(10))
consume(x for x in [1, 2, 3])Typing a function parameter as Iterable[T] is the widest and safest. There’s no need to narrow to list[T] — whether the caller passes a generator or a set, it works the same.
@contextmanager is in fact a generator
#
Now you can see how the @contextmanager from Chapter 10 works.
from contextlib import contextmanager
@contextmanager
def chdir(path):
old = os.getcwd()
os.chdir(path)
try:
yield path
finally:
os.chdir(old)This function has a yield, so it’s a generator function. @contextmanager takes that generator and wraps it into an object whose __enter__ calls the first next and whose __exit__ calls the second next or throw. That confirms context managers are tools built on top of generators.
Exercises #
- Write an infinite generator with signature
def fibonacci() -> Iterator[int]:. Starting from the first two values 0 and 1, each yield is the sum of the previous two. Afterfrom itertools import islice, confirm thatlist(islice(fibonacci(), 10))is[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]. - Write
def read_log_errors(path: str) -> Iterator[str]:. Open the file, read line by line, and yield only lines containing “ERROR”. Usewith open(...)inside the function to guarantee file close. Write it with awareness that even a 100GB fake file should keep memory usage constant. - Write
def flatten(items)recursively withyield from. Confirm thatflatten([1, [2, [3, [4]], 5], 6])flattens to[1, 2, 3, 4, 5, 6]. Also write a version withoutyield from, usingfor ... yield ..., and compare line counts.
In one line:
foris sugar foriter()+next()+StopIteration. Iterable (__iter__) ⊃ iterator (__iter__+__next__). Oneyieldmakes a generator function; eachyieldpauses execution. Laziness enables memory savings / infinite sequences / pipelines.yield fromis natural for delegation and recursion. Type function arguments broadly asIterable[T].@contextmanageris sugar on top of generators.
Next chapter #
In Chapter 12 decorator patterns we cover the tool for wrapping functions and classes — every pattern of decorators. @contextmanager and @dataclass were both instances of that pattern.