K8s Advanced #5: Observability — Prometheus / Grafana / Loki / OpenTelemetry

The fifth post in the K8s Advanced series. The CNI, RBAC / IRSA, Admission, and CRD / Operator topics covered so far all shape cluster behavior. One more layer is needed on top — a way to see whether all those components are actually running well: observability. Which Pod uses how much memory, which Service has higher-than-usual latency, how a single request flowed through microservices. These three dimensions are metrics, logs, and traces, and on top of them sits the standard K8s operational stack.

This series is K8s Advanced, 6 posts.

Three axes of observability #

Observability is commonly split into three kinds of data.

AxisWhat it isQuestion
MetricsNumerical time series over time“What’s happening right now”
LogsText records of events“What were the details of that event”
TracesA single request’s path through multiple services“Why was this request slow”

These three are complementary. A typical debugging pattern is to spot an anomaly via metrics, examine the details via logs, and then narrow down which segment of the request path is the problem via traces. Having all three in operational clusters is the standard, and the K8s tools for each axis are nearly settled.

Metrics — the standard stack centered on Prometheus #

The de facto standard for K8s metrics is Prometheus. A CNCF graduated project, and K8s’s own components (API server, kubelet, controller-manager, scheduler) all expose metrics in a format Prometheus can understand. Prometheus’s model is simple.

  • Pull-based — Prometheus periodically scrapes each target’s /metrics endpoint via HTTP.
  • Time-series database — scraped data is stored as labeled time series.
  • PromQL — its own query language for querying time series.

Standard stack components #

Installing Prometheus on an operational cluster nearly always brings these components together.

ComponentRole
Prometheus ServerMetric collection + storage + querying
kube-state-metricsExposes K8s objects (Deployment, Pod, Node, etc.) state as metrics
node-exporterExposes each node’s system metrics (CPU, memory, disk). DaemonSet, one per node.
AlertmanagerAlert routing, grouping, silencing
Pushgateway (optional)Receives metrics via push from short-lived Jobs

The standard way to install this bundle at once is the kube-prometheus-stack Helm chart, which has become the de facto first step for operational cluster observability.

ServiceMonitor / PodMonitor — the role of Prometheus Operator #

Rather than writing Prometheus scrape targets directly in config files, Prometheus Operator introduces two CRDs.

servicemonitor-app.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: my-app
  labels:
    release: prometheus  # matches kube-prometheus-stack's selector
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Applying this ServiceMonitor causes Prometheus Operator to automatically update the Prometheus configuration to scrape metrics from that Service’s Pods every 30 seconds. A K8s-native way of declaring scrape targets via manifest.

PodMonitor is a variant attached directly to Pods without a Service. Thanks to these two CRDs, application teams only need to write a ServiceMonitor alongside their Service, and metric collection starts automatically.

One-line PromQL examples #

PromQL is a deep topic in its own right, but a few patterns come up frequently in operations:

Sum of Pod memory usage in a namespace
sum(container_memory_usage_bytes{namespace="payments"}) by (pod)
5xx response rate over the last 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
P95 latency (histogram)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

rate() computes per-second increase of a time series, and histogram_quantile() computes a specific quantile from a histogram. These two functions cover about 70% of PromQL usage.

Logs — the new stack centered on Loki #

The older standard for log collection was the EFK stack (Elasticsearch + Fluentd + Kibana). Powerful, but heavy — Elasticsearch full-text indexes every log body, consuming considerable disk and memory.

Loki is a lightweight alternative made by Grafana Labs. The model is different — log bodies aren’t indexed, only labels are. At search time, narrow down by label and then grep the body. It brings a label model similar to Prometheus to logs.

Components of the Loki stack #

ComponentRole
LokiLog storage + querying
Promtail (or Fluent Bit)Reads logs from each node and sends to Loki. DaemonSet.
GrafanaLogQL queries + visualization

Promtail reads each node’s /var/log/containers/, automatically attaches K8s metadata (Pod name, namespace, container name, labels) as labels, and sends to Loki. Without separate application changes, every container’s stdout/stderr is collected as is.

LogQL — Loki’s query language #

Same grain as PromQL.

ERROR logs in the payments namespace
{namespace="payments"} |= "ERROR"
Specific Pod's error rate (converted to metric)
sum(rate({pod="checkout-abc123"} |= "ERROR" [5m]))

{...} for label filtering, |= for body substring matching, |~ for regex matching. With rate() it can be treated like a metric, so it can be drawn alongside metric charts in Grafana dashboards.

Loki vs EFK — the grain of selection #

DimensionLokiEFK
IndexingLabels onlyWhole body
Disk costLowHigh
Full-text searchGrep (slow)Fast
Operational burdenLowHigh (operating Elasticsearch cluster)
Grafana integrationFirst-classPossible

For new deployments, Loki is closer to the standard. If full-text search is a core requirement, EFK or OpenSearch is more suitable, but for the day-to-day debugging of K8s operations, Loki’s label + grep model is sufficient.

Traces — integration centered on OpenTelemetry #

The old standards for distributed tracing were split into two — OpenTracing and OpenCensus. The two projects merged to create OpenTelemetry (OTel). It’s now a single standard handling distributed tracing, metrics, and logs together, and one of the most active projects in CNCF.

OpenTelemetry’s core concepts are:

  • Instrumentation libraries — per-language SDKs inserted into application code to create traces. Auto-instrumentation tools (Java agents, etc.) often attach without code changes.
  • OpenTelemetry Collector — receives data sent by applications, processes and routes it. Generally deployed as DaemonSet or Deployment in K8s.
  • Backend — actually stores and visualizes traces. Jaeger, Tempo, Datadog, Honeycomb, etc.

Trace model — a tree of Spans #

The unit data of distributed tracing is the Span. As a single request passes through multiple services, one span is created at each step, bound in parent-child relationships forming a tree.

Example span tree of one request
[gateway] /api/orders POST  (200ms)
 ├─ [orders-service] create order   (180ms)
 │   ├─ [postgres] INSERT orders    (15ms)
 │   ├─ [postgres] INSERT items     (12ms)
 │   └─ [kafka] publish order.created (45ms)
 └─ [auth-service] verify token     (10ms)

Looking at this tree, you can see at a glance which segment of the 200ms took the most time. Traces are invaluable for narrowing down which service is the cause when P99 latency is elevated.

Tempo — a trace store of the same grain as Loki #

Grafana Labs also made a trace store with a lightweight model — Tempo. What Loki is for logs, Tempo is for traces. Minimizes indexes and stores trace bodies on object storage (S3 / GCS). Optimized for direct lookup by trace ID, and when used with Loki and Prometheus, the flow of metric → log → trace flows naturally in Grafana.

Grafana — the standard for visualization #

The tool that looks at the three axes’ data in one place is Grafana. Almost every data source — Prometheus, Loki, Tempo, Elasticsearch, CloudWatch — can be bound into a single dashboard, with each panel pulling data via its own query language.

The standard dashboard set for an operational cluster usually consists of:

  • Cluster overview — per-node CPU/memory/disk, Pod count, per-namespace resource use
  • Workload overview — per-Deployment/StatefulSet replica state, restart count, OOMKilled
  • API server health — request rate, error rate, P99 latency, etcd lag
  • Per application — business metrics + 4 golden signals (latency, traffic, errors, saturation)

Adopting kube-prometheus-stack brings cluster, workload, and API server dashboards together pre-configured. Only application dashboards need to be newly made tailored to the domain.

Alerting — the role of Alertmanager #

Sending alerts when metrics meet certain conditions is handled by Alertmanager, not Prometheus itself. Prometheus evaluates alert rules and sends triggered alerts to Alertmanager, which handles routing, grouping, silencing, and repetition.

PrometheusRule — CRD for alert definition #

prometheusrule-high-error-rate.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: my-app
  labels:
    release: prometheus
spec:
  groups:
    - name: my-app
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5..", app="my-app"}[5m]))
              / sum(rate(http_requests_total{app="my-app"}[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
            team: payments
          annotations:
            summary: "High 5xx rate on my-app ({{ $value | humanizePercentage }})"
            description: "5xx rate over the last 5 minutes is above 5%."

expr is the condition Prometheus evaluates, and for: 5m means “the alert triggers when this condition is true for 5 consecutive minutes.” severity and team in labels are used as routing keys in Alertmanager.

Alertmanager routing #

Alertmanager configuration decides where to send alerts based on labels.

alertmanager.yaml — simplified
route:
  receiver: default
  group_by: ['alertname', 'team']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        team: payments
      receiver: payments-slack

receivers:
  - name: default
    slack_configs:
      - channel: '#alerts'
  - name: pagerduty
    pagerduty_configs:
      - service_key: ...
  - name: payments-slack
    slack_configs:
      - channel: '#payments-alerts'

Thanks to this model, alert routing is managed as code (manifest), branching to channels like Slack / PagerDuty / Email.

Operational principles to lock in #

The observability stack, once well set up, dramatically deepens cluster visibility, but if operated poorly it consumes a large share of cluster resources. Locking in the following four principles is recommended.

1. Beware of metric cardinality explosion #

Prometheus creates a separate time series for each unique combination of labels. Putting high-cardinality values like user IDs, request IDs, or UUIDs into labels causes the series count to explode, and Prometheus burns through memory quickly. The cardinal rule is: don’t put high-cardinality values in labels.

Cardinality check — metrics with the most series
topk(10, count by (__name__)({__name__=~".+"}))

Periodically checking metrics with high series count via this query is part of operations.

2. Retention period and remote storage #

Prometheus’s local storage retention defaults to 15 days. To retain data beyond that, you must ship to remote storage (Thanos, Cortex, Mimir, VictoriaMetrics). Similarly, backing Loki and Tempo with long-term object storage (S3 / GCS) is the standard.

Retention is directly tied to cost. A typical starting point is 6 months for metrics, 30 days for logs, and 7 days for traces — each axis configured independently.

3. Alert SNR — too many alerts equals no alerts #

Too many alerts cause operators to start ignoring them, and the truly important ones get missed. The guiding principle of alert design is “one alert = one immediate human action required.”

  • Symptom-based — alert on symptoms, not causes. Not “DB connection pool is 80% full” but “API’s 5xx rate exceeds 5%.”
  • Eliminate noise with the for period — so alerts don’t ring on short spikes.
  • severity tierscritical is what wakes you up, warning is what can wait until morning. When the tiers blur, both start being ignored.

4. Golden signals as standard #

The 4 golden signals from Google SRE culture are nearly the starting point of monitoring every workload.

SignalMeaning
LatencyRequest processing time (P50 / P95 / P99)
TrafficRequests per second
ErrorsFailure rate
SaturationResource saturation (CPU, memory, queue length)

Exposing these four signals in a consistent form across all services standardizes both dashboards and alerts, with domain-specific metrics layered on top.

Closing #

The K8s operational observability stack has been organized in one cycle. Prometheus + kube-state-metrics + node-exporter for metrics, Loki (or EFK) for logs, OpenTelemetry + Tempo for traces, Grafana for visualization, and Alertmanager for alerting — this bundle is nearly a settled standard, and kube-prometheus-stack along with the Loki / Tempo Helm charts are the natural first step of adoption. Locking in the four principles of cardinality / retention / alert SNR / golden signals as operational guardrails keeps the stack from consuming cluster resources unnecessarily, while only deepening visibility. The next post — and the last in the K8s Advanced series — covers the operational model placing the source of truth for manifests in git, GitOps based on ArgoCD and Flux.

X