Python Testing #6: Test Design — Good Tests and How to Read Coverage
Some teams have hundreds of tests and still ship bugs with every release. Coverage is above 90%, yet the bugs customers actually hit slip right past the test suite. Open up such a team’s test code and the same patterns appear: tests that call code without asserting anything, tests that transcribe the implementation line by line, tests that pass one day and fail the next. The series so far has taught the tools for writing tests; this post is about the design principles that determine the quality of your tests rather than their quantity — and about how to read coverage numbers.
The AAA pattern: the basic skeleton of a test #
A good test function splits into three sections: Arrange, Act, Assert.
def test_apply_coupon_reduces_total():
# Arrange: set up the subject under test and its inputs
cart = Cart()
cart.add(Item("Keyboard", price=50_000))
coupon = Coupon(discount_rate=0.1)
# Act: perform the behavior under test, once
total = cart.checkout(coupon)
# Assert: check the result
assert total == 45_000The comments aren’t mandatory, but a test whose three sections are separated even by blank lines is far easier to read when it fails. If the arrange section grows so long it buries the point, that’s a signal to extract it into a fixture from part 2.
The “one assertion per test” principle comes along with this. What it really means is not “one assert line” but “one behavior.” If verifying the outcome of one behavior takes three assert lines, write three lines. What you should avoid is chaining two distinct behaviors into one test. If a single test verifies adding to the cart and then removing from it, a failure in the first check means the second behavior never even runs, and you get only half a failure report. Split it in two and each fails independently — the test name alone tells you which behavior broke.
Test behavior, not implementation #
If your tests collapse in a heap every time you refactor, that’s a sign they’re coupled to the implementation.
def test_get_user_uses_cache(mocker):
service = UserService()
spy = mocker.spy(service, "_load_from_db")
service.get_user(1)
service.get_user(1)
spy.assert_called_once_with(1) # asserts on internal method call countRename the internal _load_from_db method or change the caching strategy, and this test breaks even though the externally visible behavior is unchanged. Assert on observable outcomes and the test survives refactoring.
def test_get_user_returns_same_user_for_same_id():
service = UserService()
first = service.get_user(1)
second = service.get_user(1)
assert first == secondThe litmus test is simple: if changing the implementation leaves user-visible behavior the same, the test should still pass.
Test double terminology #
In part 4 we used mock and monkeypatch as practical tools; pinning down the vocabulary now makes code-review conversations much faster. The umbrella term for anything standing in for a real object is a test double.
| Term | Role | Example |
|---|---|---|
| dummy | fills a signature, never actually used | a None passed for an argument |
| stub | returns a predetermined value | a fake response that always returns 200 |
| fake | a simplified real implementation | in-memory SQLite, a dict-backed store |
| mock | also verifies calls and their arguments | assert_called_once_with(...) |
The practical rule of thumb: when you can check the result (state) directly, stubs and fakes produce tests that break less. Save mock-style call verification for cases like “was the email sent” where there’s no other way to observe the outcome. Overuse call verification and you end up with the implementation-coupled tests we saw above.
Flaky tests: tests that fail only sometimes #
A test that passes on some runs and fails on others — with the same code — is called a flaky test. The cause is almost always one of three things.
| Cause | Typical symptom | Fix |
|---|---|---|
| Time dependency | fails just before midnight or at month-end | freeze time with monkeypatch or freezegun |
| Order dependency | passes alone, fails in the full run | remove global state, isolate with fixtures |
| External dependency | fails depending on network conditions | block with responses and fakes from part 5 |
To surface order dependencies early, the pytest-randomly plugin is useful: it shuffles test order on every run, so hidden order dependencies show up sooner. And the worst possible response is “it passes on re-run, so leave it.” Once you build the habit of ignoring red, you’ll ignore the signal from real bugs too. When you find a flaky test, fix the cause immediately — or if you can’t, quarantine it (skip + file an issue).
Measuring coverage with pytest-cov #
Coverage is the tool that measures which parts of your code the tests executed. In pytest you use the pytest-cov plugin, installed with pip install pytest-cov.
pytest --cov=myapp --cov-report=term-missingName Stmts Miss Cover Missing
-----------------------------------------------
myapp/cart.py 40 4 90% 52-55
myapp/coupon.py 18 0 100%
-----------------------------------------------
TOTAL 58 4 93%The key here isn’t the 93% — it’s the Missing column. Open up lines 52–55 and you discover facts like “the error-handling path has never been tested once.” Adding the --cov-branch option measures at the branch level instead of the line level, and turning it on is recommended.
The trap in coverage numbers #
100% line coverage is not 100% verification.
def test_discount():
apply_discount(10_000, rate=0.1) # no assertThis test executes every line of the function, so coverage counts it — but it passes even when the result is wrong. Coverage measures only “did it run,” never “did you check.”
Branches and boundaries slip through too.
def fee(age: int) -> int:
discount = 0
if age >= 65:
discount = 2_000
return 5_000 - discountTest only fee(70) and every line executes, giving 100% line coverage. But the under-65 path was never verified. --cov-branch will catch that omission — yet an off-by-one bug where the condition was mistakenly written as age > 65 only surfaces when you test the boundary values (64 and 65) directly.
That’s why making the coverage number a team target backfires. The moment the number becomes a KPI, assert-less tests and getter-poking tests multiply: the number climbs while quality stays flat. Use coverage as a map for finding where tests haven’t reached, and keep any enforced threshold to a modest gate like “80% on newly added code.”
How far to go: start where value per cost is highest #
Not all code deserves the same test density. The investment priority looks like this.
- Places that have had bugs: code that broke once breaks again. The habit with the highest return is writing a reproduction test first whenever you fix a bug
- Boundary values: zero, the empty list, maximums, the values just before and after a boundary
- Core domain logic: money calculations, permission checks — anywhere a mistake is expensive
Conversely, simple delegation code, behavior the framework already guarantees, and prototypes about to be thrown away come last. Tests are maintenance burden too; piling up low-value tests accumulates debt, not assets.
Recap #
- Structure tests into three sections: Arrange, Act, Assert
- One behavior per test; be flexible about the number of assert lines
- Verify observable behavior, not implementation internals
- Distinguish test doubles — dummy, stub, fake, mock — and use mock’s call verification sparingly
- Flakiness comes from time, order, or external dependencies; don’t paper over it with re-runs
- Coverage is a map — read the
Missingcolumn and branches rather than chasing the number - Invest first in past bug sites, boundary values, and core domain logic
In the next post (#7 CI integration) we’ll wrap up the series by building a pipeline that runs tests automatically in GitHub Actions and reports coverage.