Contents
19 Chapter

Observability

We organize the three axes that give a production cluster visibility — metrics (Prometheus + kube-state-metrics + node-exporter), logs (Loki), and traces (OpenTelemetry + Tempo) — together with the standard visualization stack (Grafana) and alerting (Alertmanager). We cover the ServiceMonitor · PrometheusRule pieces of kube-prometheus-stack, examples of PromQL · LogQL, and the operational guardrails of cardinality · retention · alert SNR · golden signals.

Chapter 15 CNI in Depth, Chapter 16 RBAC / ServiceAccount in Depth, Chapter 17 Admission Controller, and Chapter 18 the CRD and Operator pattern were all components that build the cluster’s behavior. We need one more layer to see whether all these components run well — observability. How much memory does which Pod use, which Service’s latency is higher than usual, and how did one request flow between microservices? These three dimensions are metrics · logs · traces, and the standard Kubernetes operations stack sits on top of them.

By the end of this chapter you’ll have one tool stack that brings together the signals of Chapter 11 resources.requests / limits · Chapter 12 Health check · Chapter 13 Autoscaling. It’s the starting point for deepening the cluster’s field of vision with a single manifest.

The three axes of observability #

Observability is commonly discussed split into three data kinds.

AxisWhat it isQuestion
Metricsnumeric time series over time“what is happening right now”
Logstext records of events“what are the detailed circumstances of that event”
Tracesthe path of one request through several services“why was this request slow”

These three are complementary. Discovering an anomaly with metrics, pinning down the detailed situation with logs, and narrowing down which segment of the request path is the problem with traces — that flow is the everyday debugging pattern. Having all three in a production cluster is standard, and the K8s tool for each axis is nearly set. The diagnostic tree of Chapter 27 kubectl debugging patterns also stands on the flow of looking at these three axes together.

Metrics — the Prometheus-centered standard stack #

The de facto standard for K8s metrics is Prometheus. It’s a CNCF graduated project, and K8s’s own components (API server, kubelet, controller-manager, scheduler) all expose metrics in a format Prometheus can understand. Prometheus’s model is simple.

  • Pull-based — Prometheus periodically scrapes each target’s /metrics endpoint over HTTP.
  • Time-series database — the scraped data is stored as labeled time series.
  • PromQL — its own query language for querying time series.

The components of the standard stack #

When you install Prometheus on a production cluster, the following components almost always come along.

ComponentRole
Prometheus Servermetric collection + storage + querying
kube-state-metricsexposes the state of K8s objects (Deployment, Pod, Node, etc.) as metrics
node-exporterexposes each node’s system metrics (CPU, memory, disk). One per node via a Chapter 8 DaemonSet
Alertmanageralert routing, grouping, silencing
Pushgateway (optional)receives metrics from short-lived Jobs via push

The standard manifest that installs this bundle at once is the kube-prometheus-stack Helm chart. It’s the de facto first step of adoption in a production cluster. The hands-on EKS setup is covered once more in Chapter 25 Monitoring · Alerts.

ServiceMonitor / PodMonitor — the role of the Prometheus Operator #

Instead of writing Prometheus’s scrape targets directly in a manifest, the Prometheus Operator introduced two CRDs. It’s an example where the model of Chapter 18 the CRD and Operator pattern applies directly to the observability stack.

servicemonitor-app.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: my-app
  labels:
    release: prometheus  # matches kube-prometheus-stack's selector
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Apply this ServiceMonitor and the Prometheus Operator automatically updates the Prometheus config so it scrapes metrics from that Service’s Pods every 30 seconds. It’s the K8s-native way of declaring scrape targets with a manifest.

PodMonitor is a variant that attaches directly to Pods without a Service. Thanks to these two CRDs, the application team only needs to write one ServiceMonitor next to the Service and metric collection starts automatically.

One-line PromQL examples #

PromQL is a deep subject in itself, but let’s pin down a few patterns used most often in operations.

Sum of Pod memory usage in a namespace
sum(container_memory_usage_bytes{namespace="payments"}) by (pod)
5xx response ratio over the last 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
P95 latency (histogram)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

rate() computes the per-second increase of a time series, and histogram_quantile() computes a specific quantile from a histogram. These two functions cover about 70% of PromQL use. The container_memory_working_set_bytes time-series analysis we touched on in §“Memory usage vs memory limits” of Chapter 11 becomes practical on top of this PromQL.

Logs — the new Loki-centered stack #

The old standard for log collection was the EFK stack (Elasticsearch + Fluentd + Kibana). Powerful but heavy — Elasticsearch full-text-indexes every log body, so it consumes a lot of disk and memory.

Loki is a lightweight alternative built by Grafana Labs. The model is different — it doesn’t index the log body, only the labels. At search time it narrows by label and then group-scans the body. It’s brought a label model similar to Prometheus over to logs.

The components of the Loki stack #

ComponentRole
Lokilog storage + querying
Promtail (or Fluent Bit)reads logs from each node and sends them to Loki. A DaemonSet
GrafanaLogQL queries + visualization

Promtail reads the node’s /var/log/containers/, automatically attaches K8s metadata (Pod name, namespace, container name, labels) as labels, and sends them to Loki. Every container’s stdout / stderr is collected unchanged without any application change. If the kubectl logs of Chapter 3 kubectl and your first Pod is a tool that shows logs scattered per node by one Pod unit, Loki + Promtail is the shape that extends that flow to the entire cluster.

LogQL — Loki’s query language #

It follows the same pattern as PromQL.

ERROR logs in the payments namespace
{namespace="payments"} |= "ERROR"
A specific Pod's error rate (converted to a metric)
sum(rate({pod="checkout-abc123"} |= "ERROR" [5m]))

{...} for label filtering, |= for body substring matching, |~ for regex matching. With rate() it can be handled like a metric, so it can be drawn alongside metric charts on a Grafana dashboard.

Loki vs EFK — the tradeoff #

DimensionLokiEFK
Indexinglabels onlythe full body
Disk costlowhigh
Full-text searchgroup scan (slow)fast
Operational burdenlowhigh (operating an Elasticsearch cluster)
Grafana integrationfirst-classpossible

For a new adoption, Loki is closer to the standard. If full-text search is a core requirement, EFK or OpenSearch is a better fit, but for the everyday debugging of K8s operations, Loki’s label + group model is enough.

Traces — the OpenTelemetry-centered convergence #

The old standard for distributed tracing was split into two branches — OpenTracing and OpenCensus. The merger of the two projects produced OpenTelemetry (OTel). It’s now a single standard that handles distributed tracing · metrics · logs together, and one of the most active projects in the CNCF.

OpenTelemetry’s core concepts are three.

  • Instrumentation library — a per-language SDK is inserted into the application code to create traces. There are also many cases where an auto-instrumentation tool (a Java agent, etc.) attaches without a code change.
  • OpenTelemetry Collector — receives the data the application sends and processes · routes it. Typically brought up in K8s as a DaemonSet or Deployment.
  • Backend — actually stores and visualizes traces. Jaeger, Tempo, Datadog, Honeycomb, etc.

The trace model — a tree of Spans #

The unit datum of distributed tracing is the Span. As one request passes through several services, one span is created at each step, and they’re bound in parent-child relationships to form a tree.

An example span tree of one request
[gateway] /api/orders POST  (200ms)
 ├─ [orders-service] create order   (180ms)
 │   ├─ [postgres] INSERT orders    (15ms)
 │   ├─ [postgres] INSERT items     (12ms)
 │   └─ [kafka] publish order.created (45ms)
 └─ [auth-service] verify token     (10ms)

Looking at this tree, you can see at a glance which segment of the 200ms took the most time. When P99 latency is higher than usual, a trace is decisive for narrowing down which service is the cause.

Tempo — a trace store in the same direction as Loki #

Grafana Labs built the trace store with a lightweight model too — Tempo. It’s done for traces what Loki did for logs. It minimizes the index and stores the trace body in object storage (S3 / GCS). It’s optimized for direct lookup by trace ID, and when used together with Loki · Prometheus, a flow forms in Grafana that flows naturally from metric → log → trace.

Grafana — the standard for visualization #

The tool for looking into the data of all three axes in one place is Grafana. It can tie nearly every data source — Prometheus, Loki, Tempo, Elasticsearch, CloudWatch, etc. — into one dashboard, and each panel fetches data with its own query language.

A production cluster’s standard dashboard set is usually composed roughly as follows.

  • Cluster overview — CPU · memory · disk per node, Pod count, resource use per namespace
  • Workload overview — replica state per Deployment / StatefulSet, restart count, OOMKilled
  • API server health — request rate, error rate, P99 latency, etcd lag
  • Each application — business metrics + the 4 golden signals (latency, traffic, errors, saturation)

When you adopt kube-prometheus-stack, the cluster · workload · API server dashboards come along pre-configured. You only need to newly build the application dashboard to fit your domain.

Alerting — the role of Alertmanager #

Sending an alert when a metric satisfies some condition is handled not by Prometheus itself but by Alertmanager. Prometheus evaluates an alert rule and sends the fired alert to Alertmanager, and Alertmanager does the routing · grouping · silencing · repetition handling.

PrometheusRule — the CRD for alert definitions #

prometheusrule-high-error-rate.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: my-app
  labels:
    release: prometheus
spec:
  groups:
    - name: my-app
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5..", app="my-app"}[5m]))
              / sum(rate(http_requests_total{app="my-app"}[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
            team: payments
          annotations:
            summary: "High 5xx rate on my-app ({{ $value | humanizePercentage }})"
            description: "5xx rate over the last 5 minutes is above 5%."

expr is the condition evaluated in Prometheus, and for: 5m means “this condition must be true for 5 minutes straight for the alert to fire.” The severity and team of labels are used as routing keys in Alertmanager.

Alertmanager’s routing #

In Alertmanager’s config, you decide where to send alerts based on labels.

alertmanager.yaml — simplified
route:
  receiver: default
  group_by: ['alertname', 'team']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        team: payments
      receiver: payments-slack

receivers:
  - name: default
    slack_configs:
      - channel: '#alerts'
  - name: pagerduty
    pagerduty_configs:
      - service_key: ...
  - name: payments-slack
    slack_configs:
      - channel: '#payments-alerts'

Thanks to this model, alert routing is managed as code (a manifest) and branches to channels like Slack / PagerDuty / Email.

Principles to pin down in operations #

Once well set up, the observability stack deepens the cluster’s field of vision in one stroke, but operated poorly it becomes a tool that eats a large part of the cluster’s resources. It’s good to pin down the following four.

1. Beware metric cardinality explosion #

Prometheus creates a separate time series per label combination. Put high-cardinality values like a user ID, request ID, or UUID into a label and the number of time series explodes, and Prometheus quickly exhausts memory. The first principle of the operational guide is don’t put high-cardinality values in labels.

Cardinality check — the metrics with the most time series
topk(10, count by (__name__)({__name__=~".+"}))

Periodically checking the metrics with many time series with this query is part of operations.

2. Retention period and remote storage #

Prometheus’s local storage retains 15 days by default. To keep more than that, you must also send it to remote storage (Thanos, Cortex, Mimir, VictoriaMetrics). Likewise, it’s standard to put long-term storage for Loki and Tempo in object storage (S3 / GCS).

The retention period is directly tied to cost. Metrics 6 months, logs 30 days, traces 7 days is a common starting point, and you can set each axis’s retention separately. The cost-perspective guardrails are covered once more in Chapter 28 Cost Optimization.

3. Alert SNR — too many alerts is the same as no alerts #

Create too many alerts and the operator starts to ignore them, and ends up missing even the important ones. The standard principle of alert design is “one alert = one immediate human response.”

  • Symptom-based — alert on the symptom, not the cause. Not “the DB connection pool is 80% full” but “the API’s 5xx rate exceeds 5%.”
  • Remove noise with the for period — so the alert doesn’t fire on a short spike.
  • severity branchescritical is what should wake you up, warning is what you can look at tomorrow morning. When the branches blur, they start to be ignored.

4. Make golden signals the standard #

The 4 golden signals, originating from Google’s SRE culture, are the starting point of nearly all workload monitoring.

SignalMeaning
Latencyrequest-handling time (P50 / P95 / P99)
Trafficrequests per second
Errorsfailure rate
Saturationresource saturation (CPU, memory, queue length)

Expose these four in the same form on every service and the dashboards and alerts get standardized too. Domain metrics layer on top.

The knot with HPA · probe · debugging #

This chapter’s observability stack becomes the one place where the signals of the whole book gather. Let’s tie it up briefly.

  • Chapter 11 resources.requests / limits — the container_memory_working_set_bytes, container_cpu_cfs_throttled_seconds_total time series come into Prometheus. The postmortem analysis of OOMKilled and CPU throttling is done on top of this chapter’s PromQL.
  • Chapter 12 Health check — probe-failure events are exposed both as a metric (kube_pod_container_status_ready) and as logs. The first line of an alert rule is usually this readiness metric.
  • Chapter 13 Autoscaling — the flow where the HPA’s input metrics pass through the Prometheus Adapter beyond metrics-server ties directly to this chapter’s ServiceMonitor. KEDA is in the same position.
  • Chapter 17 Admission Controller — tying webhook latency P99 and rejection rate to Alertmanager is this chapter’s standard.
  • Chapter 27 kubectl debugging patterns — the flow where the diagnostic tree starting with describe · logs extends into the observability stack’s metric · log time series is natural.

Exercises #

  1. Install kube-prometheus-stack on your cluster with Helm, then check which ServiceMonitors and PrometheusRules are pre-configured with kubectl get servicemonitor -A and kubectl get prometheusrule -A. Read the expr of one of those PrometheusRules yourself, and organize in one paragraph what symptom that alert watches, against the Symptom-based principle of §“Alert SNR.”
  2. Query histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) directly on one workload of your cluster to check P95 latency. Look at it in the same time window as Chapter 11’s CPU throttling time series (container_cpu_cfs_throttled_seconds_total), and organize in one paragraph how the shape of throttling enlarging response latency looks on top of PromQL.
  3. Let’s write one PrometheusRule ourselves — a rule like “alert if a Pod’s OOMKilled occurs 3 or more times within 5 minutes.” Choose which time series to use in expr (kube_pod_container_status_last_terminated_reason="OOMKilled"), and set for, severity to match the guardrails of §“Principles to pin down in operations.” Decide in one paragraph which channel to send this alert to in Alertmanager’s routing too.

In one line: the standard bundle of K8s observability is Prometheus + kube-state-metrics + node-exporter for metrics, Loki for logs, OpenTelemetry + Tempo for traces, Grafana for visualization, and Alertmanager for alerting. The kube-prometheus-stack Helm chart is the first step of adoption, and the CRDs of ServiceMonitor · PodMonitor · PrometheusRule build manifest-level observability on top of Chapter 18’s Operator model. The four operational guardrails are avoiding cardinality explosion · retention period and remote storage · alert SNR · golden signals.

Next chapter #

Up through this chapter we’ve organized the tools that build the cluster’s field of vision. The next chapter covers the operational model of how that cluster is changed. The flow so far has been a model where a person applies a manifest with kubectl apply. But in operations where several people · several environments · several clusters run together, the source of truth for manifests must be in git, not in a person’s hands.

Chapter 20 GitOps covers the model of ArgoCD and Flux. It wraps up Part 3, organizing the flow where git’s manifest is automatically synced to the cluster, drift detection, sync policy, multi-cluster operational patterns, and how Chapter 18’s status subresource meshes with ArgoCD’s drift detection.

X