Python Testing #6: Test Design — Good Tests and How to Read Coverage

6 min read

Some teams have hundreds of tests and still ship bugs with every release. Coverage is above 90%, yet the bugs customers actually hit slip right past the test suite. Open up such a team’s test code and the same patterns appear: tests that call code without asserting anything, tests that transcribe the implementation line by line, tests that pass one day and fail the next. The series so far has taught the tools for writing tests; this post is about the design principles that determine the quality of your tests rather than their quantity — and about how to read coverage numbers.

The AAA pattern: the basic skeleton of a test #

A good test function splits into three sections: Arrange, Act, Assert.

a test where AAA is visible
def test_apply_coupon_reduces_total():
    # Arrange: set up the subject under test and its inputs
    cart = Cart()
    cart.add(Item("Keyboard", price=50_000))
    coupon = Coupon(discount_rate=0.1)

    # Act: perform the behavior under test, once
    total = cart.checkout(coupon)

    # Assert: check the result
    assert total == 45_000

The comments aren’t mandatory, but a test whose three sections are separated even by blank lines is far easier to read when it fails. If the arrange section grows so long it buries the point, that’s a signal to extract it into a fixture from part 2.

The “one assertion per test” principle comes along with this. What it really means is not “one assert line” but “one behavior.” If verifying the outcome of one behavior takes three assert lines, write three lines. What you should avoid is chaining two distinct behaviors into one test. If a single test verifies adding to the cart and then removing from it, a failure in the first check means the second behavior never even runs, and you get only half a failure report. Split it in two and each fails independently — the test name alone tells you which behavior broke.

Test behavior, not implementation #

If your tests collapse in a heap every time you refactor, that’s a sign they’re coupled to the implementation.

🚫 a test coupled to the implementation
def test_get_user_uses_cache(mocker):
    service = UserService()
    spy = mocker.spy(service, "_load_from_db")
    service.get_user(1)
    service.get_user(1)
    spy.assert_called_once_with(1)   # asserts on internal method call count

Rename the internal _load_from_db method or change the caching strategy, and this test breaks even though the externally visible behavior is unchanged. Assert on observable outcomes and the test survives refactoring.

✅ asserting on behavior
def test_get_user_returns_same_user_for_same_id():
    service = UserService()
    first = service.get_user(1)
    second = service.get_user(1)
    assert first == second

The litmus test is simple: if changing the implementation leaves user-visible behavior the same, the test should still pass.

Test double terminology #

In part 4 we used mock and monkeypatch as practical tools; pinning down the vocabulary now makes code-review conversations much faster. The umbrella term for anything standing in for a real object is a test double.

TermRoleExample
dummyfills a signature, never actually useda None passed for an argument
stubreturns a predetermined valuea fake response that always returns 200
fakea simplified real implementationin-memory SQLite, a dict-backed store
mockalso verifies calls and their argumentsassert_called_once_with(...)

The practical rule of thumb: when you can check the result (state) directly, stubs and fakes produce tests that break less. Save mock-style call verification for cases like “was the email sent” where there’s no other way to observe the outcome. Overuse call verification and you end up with the implementation-coupled tests we saw above.

Flaky tests: tests that fail only sometimes #

A test that passes on some runs and fails on others — with the same code — is called a flaky test. The cause is almost always one of three things.

CauseTypical symptomFix
Time dependencyfails just before midnight or at month-endfreeze time with monkeypatch or freezegun
Order dependencypasses alone, fails in the full runremove global state, isolate with fixtures
External dependencyfails depending on network conditionsblock with responses and fakes from part 5

To surface order dependencies early, the pytest-randomly plugin is useful: it shuffles test order on every run, so hidden order dependencies show up sooner. And the worst possible response is “it passes on re-run, so leave it.” Once you build the habit of ignoring red, you’ll ignore the signal from real bugs too. When you find a flaky test, fix the cause immediately — or if you can’t, quarantine it (skip + file an issue).

Measuring coverage with pytest-cov #

Coverage is the tool that measures which parts of your code the tests executed. In pytest you use the pytest-cov plugin, installed with pip install pytest-cov.

pytest --cov=myapp --cov-report=term-missing
sample output
Name              Stmts   Miss  Cover   Missing
-----------------------------------------------
myapp/cart.py        40      4    90%   52-55
myapp/coupon.py      18      0   100%
-----------------------------------------------
TOTAL                58      4    93%

The key here isn’t the 93% — it’s the Missing column. Open up lines 52–55 and you discover facts like “the error-handling path has never been tested once.” Adding the --cov-branch option measures at the branch level instead of the line level, and turning it on is recommended.

The trap in coverage numbers #

100% line coverage is not 100% verification.

100% coverage, 0% verification
def test_discount():
    apply_discount(10_000, rate=0.1)   # no assert

This test executes every line of the function, so coverage counts it — but it passes even when the result is wrong. Coverage measures only “did it run,” never “did you check.”

Branches and boundaries slip through too.

every line ran, but
def fee(age: int) -> int:
    discount = 0
    if age >= 65:
        discount = 2_000
    return 5_000 - discount

Test only fee(70) and every line executes, giving 100% line coverage. But the under-65 path was never verified. --cov-branch will catch that omission — yet an off-by-one bug where the condition was mistakenly written as age > 65 only surfaces when you test the boundary values (64 and 65) directly.

That’s why making the coverage number a team target backfires. The moment the number becomes a KPI, assert-less tests and getter-poking tests multiply: the number climbs while quality stays flat. Use coverage as a map for finding where tests haven’t reached, and keep any enforced threshold to a modest gate like “80% on newly added code.”

How far to go: start where value per cost is highest #

Not all code deserves the same test density. The investment priority looks like this.

  1. Places that have had bugs: code that broke once breaks again. The habit with the highest return is writing a reproduction test first whenever you fix a bug
  2. Boundary values: zero, the empty list, maximums, the values just before and after a boundary
  3. Core domain logic: money calculations, permission checks — anywhere a mistake is expensive

Conversely, simple delegation code, behavior the framework already guarantees, and prototypes about to be thrown away come last. Tests are maintenance burden too; piling up low-value tests accumulates debt, not assets.

Recap #

  • Structure tests into three sections: Arrange, Act, Assert
  • One behavior per test; be flexible about the number of assert lines
  • Verify observable behavior, not implementation internals
  • Distinguish test doubles — dummy, stub, fake, mock — and use mock’s call verification sparingly
  • Flakiness comes from time, order, or external dependencies; don’t paper over it with re-runs
  • Coverage is a map — read the Missing column and branches rather than chasing the number
  • Invest first in past bug sites, boundary values, and core domain logic

In the next post (#7 CI integration) we’ll wrap up the series by building a pipeline that runs tests automatically in GitHub Actions and reports coverage.

X