Contents
31 Chapter

Logging and observability

Pitfalls of the standard logging module, structured logging with structlog, distributed tracing with OpenTelemetry, and error tracking with Sentry.

In Chapter 30 you automated static correctness of the code. This chapter takes one more step — the tools that make what’s happening in production visible: how to use the standard logging module safely, structured logging, distributed tracing with OpenTelemetry, and error tracking with Sentry.

“You can’t fix what you can’t see” is the one-line motivation for this chapter. This chapter is the next step after Chapter 28 Testing and deployment.

Why logging instead of print #

For small scripts, print is fine. But once it’s running in production, you need:

  • Levels — not every message carries the same importance.
  • Timestamp and location — when and where it happened.
  • Routed destinations — stdout / file / external systems (e.g., Datadog, CloudWatch).
  • Uniform format — for humans to read or machines to parse.
  • On/off control — turn DEBUG off in production, turn it on for debugging.

The standard logging module gives you all of this. Its mental model is complex enough that it also has its own set of pitfalls.

The mental model of the standard logging module #

logging is made of four objects.

ObjectRole
LoggerThe call site. The logger in logger.info(...)
HandlerThe output destination. StreamHandler / FileHandler / RotatingFileHandler
FormatterThe shape of the message. Time, level, message into one line
FilterConditional pass-through. Only certain modules, only certain levels

The flow:

logger.info("hello") is called
  → Logger hands the LogRecord to its handlers
  → Each Handler formats it through its Formatter
  → The Handler emits the final output (stdout / file / network)

First use #

log_hello.py
import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

handler = logging.StreamHandler()
handler.setFormatter(
    logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s")
)
logger.addHandler(handler)

logger.info("hello")
logger.warning("something off")
2026-05-17 12:00:00,123 [INFO] __main__: hello
2026-05-17 12:00:00,124 [WARNING] __main__: something off

Pitfall 1 — propagate and duplicate output #

Loggers are hierarchical. app.services.user is a child of app.services and app, and it propagates messages up to its parents (propagate).

# main.py
import logging
logging.basicConfig(level=logging.INFO)

logger = logging.getLogger("app.services.user")
logger.addHandler(logging.StreamHandler())  # added a handler

logger.info("hello")
INFO:app.services.user:hello   ← root logger emitted this
hello                          ← the handler we added emitted this

The same message appears twice. Both the root logger’s handler and ours emitted output.

Fix: set propagate = False on the child logger, or only attach handlers in one place (root).

logger.propagate = False

Principle: in application code, only call logger = logging.getLogger(__name__) and put handler setup in one place at app startup.

Pitfall 2 — the cost of % formatting #

There’s a difference between logger.info("user %s logged in", user_id) and logger.info(f"user {user_id} logged in").

The former only formats when the level is INFO or higher. The latter always evaluates the f-string at call time. Stuff a heavy object into a DEBUG message and you pay the cost in production (where DEBUG is off) too.

# Recommended
logger.debug("user info: %s", expensive_serialize(user))

# Not recommended (expensive_serialize runs even with DEBUG off)
logger.debug(f"user info: {expensive_serialize(user)}")

Pitfall 3 — exceptions and tracebacks #

When logging an exception, use logger.exception or exc_info=True.

try:
    do_something()
except Exception:
    logger.exception("do_something failed")  # traceback included automatically

logger.error(str(e)) loses the traceback.

Structured logging — the value of JSON output #

Logs that humans read line by line are hard to search, aggregate, and alert on. Output JSON and your log collector (Datadog, Loki, CloudWatch Logs Insights) can index and query at the field level.

{"ts": "2026-05-17T12:00:00Z", "level": "info", "logger": "app.services.user", "msg": "user logged in", "user_id": 42, "ip": "1.2.3.4"}

Building this with the standard logging module means customizing a Formatter — tedious. structlog is the common choice.

Configuring structlog #

uv add structlog
app/core/logging.py
import logging
import sys

import structlog


def configure_logging(level: str = "INFO") -> None:
    """Call once at app startup."""
    timestamper = structlog.processors.TimeStamper(fmt="iso")

    shared_processors = [
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        timestamper,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
    ]

    structlog.configure(
        processors=[
            *shared_processors,
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.make_filtering_bound_logger(
            getattr(logging, level)
        ),
        logger_factory=structlog.PrintLoggerFactory(file=sys.stdout),
        cache_logger_on_first_use=True,
    )

Use #

app/services/user.py
import structlog

logger = structlog.get_logger(__name__)


async def login(user_id: int, ip: str) -> None:
    logger.info("user logged in", user_id=user_id, ip=ip)

Output:

{"event": "user logged in", "user_id": 42, "ip": "1.2.3.4", "level": "info", "logger": "app.services.user", "timestamp": "2026-05-17T12:00:00.123Z"}

Key idea: the message is the event name (a short, invariant label) and the variable data is split off as kwargs. Embed variables in the body like f"user {id} logged in" and you can no longer aggregate instances of the same event.

Log levels — practical guidance #

LevelWhen
DEBUGOnly viewed during development. Off in production. Variable dumps, trace info
INFOMeaningful events in the normal flow. Login, payment, job completed
WARNINGAbnormal but auto-recovered. Retry succeeded, external API slow, fallback path taken
ERRORA failure. The current request / job failed. Alert-worthy
CRITICALSystem-wide danger. DB down, disk full. Immediate page

Baseline: production runs at INFO and alerts on ERROR. WARNING is collected and reviewed for patterns.

Correlation ID — per-request tracing #

One user’s one request flows through multiple modules and multiple services. Attach the same ID to every log line and you can view everything that happened to that request together.

structlog’s contextvars solves this.

app/api/middleware.py
import uuid

import structlog
from starlette.middleware.base import BaseHTTPMiddleware


class RequestIdMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(request_id=request_id)
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        return response

Now every logger.info(...) emitted while handling the same request automatically includes request_id.

app/main.py
from fastapi import FastAPI

from app.api.middleware import RequestIdMiddleware
from app.core.logging import configure_logging

configure_logging()
app = FastAPI()
app.add_middleware(RequestIdMiddleware)

OpenTelemetry — the standard for distributed tracing #

When several services handle one request (e.g., web → API → DB → external API), to see how long each step took and where it failed you need traces.

OpenTelemetry (OTel from here on) is the standard in this space. Instrument once and you can send to Jaeger, Tempo, Datadog, Honeycomb, etc.

Mental model #

  • Trace — the full flow of one request. Unique ID.
  • Span — a single unit of work inside a Trace. Has a start time, end time, attributes.
  • Span hierarchy — parent-child relationships form a call tree.
  • Metric — numerical values over time (request count, latency percentiles, CPU).
  • Log — the logs we covered above. Higher value when correlated to a trace.

FastAPI auto-instrumentation #

uv add opentelemetry-distro opentelemetry-exporter-otlp \
       opentelemetry-instrumentation-fastapi \
       opentelemetry-instrumentation-sqlalchemy \
       opentelemetry-instrumentation-httpx
app/core/telemetry.py
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor


def configure_telemetry(service_name: str, app, engine) -> None:
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
    trace.set_tracer_provider(provider)

    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument(engine=engine)
    HTTPXClientInstrumentor().instrument()

The OTEL_EXPORTER_OTLP_ENDPOINT env var points to the collector address (something like http://localhost:4317). Start by running Jaeger locally.

docker run --rm -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one

Open http://localhost:16686 in your browser to view traces.

Manual spans #

Wrap business logic that auto-instrumentation can’t reach by hand.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)


async def calculate_invoice(user_id: int) -> Invoice:
    with tracer.start_as_current_span("calculate_invoice") as span:
        span.set_attribute("user.id", user_id)
        # ... compute
        span.set_attribute("invoice.total", total)
        return invoice

Sentry — error tracking #

If OTel is “overall flow and performance,” Sentry focuses deeply on the errors themselves.

  • Auto-groups the same error (by stack-trace signature).
  • Auto-aggregates frequency, first seen, number of users affected.
  • Tracks per release — “this error first showed up in v1.2.3.”
  • Breadcrumbs — what happened just before the error.

Install #

uv add sentry-sdk
app/core/sentry.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration


def configure_sentry(dsn: str, environment: str, release: str) -> None:
    sentry_sdk.init(
        dsn=dsn,
        environment=environment,
        release=release,
        traces_sample_rate=0.1,         # only 10% traced
        profiles_sample_rate=0.1,       # only 10% profiled
        integrations=[
            FastApiIntegration(),
            SqlalchemyIntegration(),
        ],
        send_default_pii=False,         # don't send default PII
    )

Pass the Git SHA or version tag as release and Sentry shows you “new errors since this commit” automatically. Reading it from a CI env var is the common pattern.

export SENTRY_RELEASE=$(git rev-parse HEAD)

What you must never log #

This is a security and legal-liability area, so automated blocking is mandatory.

PII (personally identifiable information) #

  • Full email, national ID, phone number, card number
  • Address, date of birth, location coordinates
  • Original photo / audio / video

Internal IDs (DB primary keys) usually aren’t PII and are fine. External identifiers (emails, etc.) are the risk.

Secrets #

  • API tokens, passwords (even hashed)
  • TLS private keys
  • Database connection strings (with passwords)
  • AWS access keys

Large payloads #

Logging full request/response bodies brings two problems together.

  • Log cost balloons.
  • If PII / secrets are mixed in, they leak together.

Automatic blocking pattern #

Structured logging shines here. Register a processor that auto-masks sensitive keys.

extending app/core/logging.py
SENSITIVE_KEYS = {"password", "token", "authorization", "secret", "card_number"}


def mask_sensitive(logger, method_name, event_dict):
    for key in list(event_dict.keys()):
        if key.lower() in SENSITIVE_KEYS:
            event_dict[key] = "***"
    return event_dict
structlog.configure(
    processors=[
        *shared_processors,
        mask_sensitive,
        structlog.processors.JSONRenderer(),
    ],
    ...
)

Now a mistake like logger.info("login", password="hunter2") comes out masked.

Local dev vs production — config differences #

ItemLocalProduction
Output formatHuman-readable colored textJSON
LevelDEBUGINFO (or WARNING)
Destinationstdoutstdout (container logs → collector)
Sentrydev DSN or disabledproduction DSN, release tag
OTel sample rate100%1~10%

Branch on the ENV env var.

app/core/logging.py
import os

def configure_logging() -> None:
    env = os.getenv("ENV", "local")
    renderer = (
        structlog.dev.ConsoleRenderer(colors=True)
        if env == "local"
        else structlog.processors.JSONRenderer()
    )
    level = "DEBUG" if env == "local" else "INFO"
    # ... structlog.configure(processors=[..., renderer], ...)

Applying it to the Chapter 29 capstone #

Integrate this chapter’s tools into the health check and main of Chapter 29 TODO API.

extending app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI

from app.core.config import settings
from app.core.logging import configure_logging
from app.core.telemetry import configure_telemetry
from app.core.sentry import configure_sentry
from app.api.middleware import RequestIdMiddleware
from app.db import engine


@asynccontextmanager
async def lifespan(app: FastAPI):
    configure_logging()
    if settings.sentry_dsn:
        configure_sentry(settings.sentry_dsn, settings.env, settings.release)
    yield


app = FastAPI(lifespan=lifespan)
app.add_middleware(RequestIdMiddleware)

if settings.otel_enabled:
    configure_telemetry("todo-api", app, engine.sync_engine)

Now every request gets a request_id, errors flow to Sentry, and traces flow to Jaeger / Tempo automatically.

Exercises #

  1. Strengthen structlog masking — This chapter’s mask_sensitive only inspects top-level keys. To also mask keys inside nested dicts like user = {"email": "...", "password": "..."}, how would you make it recursive? And where do you think that processing starts to become a performance problem on large dicts?
  2. Limits of OTel auto-instrumentation — FastAPI’s auto-instrumentation creates per-route spans but doesn’t reach into business logic inside a route. Pick the slowest route in your project, add manual spans in 3–4 places inside it, and look at the time share of those spans in Jaeger. Which step took longer than you expected?
  3. Sentry release and regressions — Register a release in Sentry and it shows “errors that first appeared in this release.” Pass SENTRY_RELEASE=$(git rev-parse HEAD) from your GitHub Actions release job and deliberately introduce an error that lives only in one version. Does Sentry correctly classify it as “new in release X”?
Note
In one line — Emit structured JSON logs with structlog instead of print, tie one request together with request_id, trace flow and performance through OpenTelemetry, and capture errors in Sentry. Use a masking processor to automatically block PII, secrets, and large payloads.

The next chapter is publishing a library with uv. Now that the operational tooling is in place, we publish code to PyPI so other people can use it.

X