K8s Advanced #5: Observability — Prometheus / Grafana / Loki / OpenTelemetry
The fifth post in the K8s Advanced series. The CNI, RBAC / IRSA, Admission, and CRD / Operator topics covered so far all shape cluster behavior. One more layer is needed on top — a way to see whether all those components are actually running well: observability. Which Pod uses how much memory, which Service has higher-than-usual latency, how a single request flowed through microservices. These three dimensions are metrics, logs, and traces, and on top of them sits the standard K8s operational stack.
This series is K8s Advanced, 6 posts.
- #1 CNI in depth — Calico / Cilium / eBPF
- #2 RBAC / ServiceAccount in depth — Aggregated ClusterRole / Impersonation / IRSA / Workload Identity
- #3 Admission Controller — OPA Gatekeeper / Kyverno
- #4 CRD and the Operator pattern — controller-runtime
- #5 Observability — Prometheus / Grafana / Loki / OpenTelemetry ← this post
- #6 GitOps — ArgoCD / Flux
Three axes of observability #
Observability is commonly split into three kinds of data.
| Axis | What it is | Question |
|---|---|---|
| Metrics | Numerical time series over time | “What’s happening right now” |
| Logs | Text records of events | “What were the details of that event” |
| Traces | A single request’s path through multiple services | “Why was this request slow” |
These three are complementary. A typical debugging pattern is to spot an anomaly via metrics, examine the details via logs, and then narrow down which segment of the request path is the problem via traces. Having all three in operational clusters is the standard, and the K8s tools for each axis are nearly settled.
Metrics — the standard stack centered on Prometheus #
The de facto standard for K8s metrics is Prometheus. A CNCF graduated project, and K8s’s own components (API server, kubelet, controller-manager, scheduler) all expose metrics in a format Prometheus can understand. Prometheus’s model is simple.
- Pull-based — Prometheus periodically scrapes each target’s
/metricsendpoint via HTTP. - Time-series database — scraped data is stored as labeled time series.
- PromQL — its own query language for querying time series.
Standard stack components #
Installing Prometheus on an operational cluster nearly always brings these components together.
| Component | Role |
|---|---|
| Prometheus Server | Metric collection + storage + querying |
| kube-state-metrics | Exposes K8s objects (Deployment, Pod, Node, etc.) state as metrics |
| node-exporter | Exposes each node’s system metrics (CPU, memory, disk). DaemonSet, one per node. |
| Alertmanager | Alert routing, grouping, silencing |
| Pushgateway (optional) | Receives metrics via push from short-lived Jobs |
The standard way to install this bundle at once is the kube-prometheus-stack Helm chart, which has become the de facto first step for operational cluster observability.
ServiceMonitor / PodMonitor — the role of Prometheus Operator #
Rather than writing Prometheus scrape targets directly in config files, Prometheus Operator introduces two CRDs.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
release: prometheus # matches kube-prometheus-stack's selector
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metricsApplying this ServiceMonitor causes Prometheus Operator to automatically update the Prometheus configuration to scrape metrics from that Service’s Pods every 30 seconds. A K8s-native way of declaring scrape targets via manifest.
PodMonitor is a variant attached directly to Pods without a Service. Thanks to these two CRDs, application teams only need to write a ServiceMonitor alongside their Service, and metric collection starts automatically.
One-line PromQL examples #
PromQL is a deep topic in its own right, but a few patterns come up frequently in operations:
sum(container_memory_usage_bytes{namespace="payments"}) by (pod)sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)rate() computes per-second increase of a time series, and histogram_quantile() computes a specific quantile from a histogram. These two functions cover about 70% of PromQL usage.
Logs — the new stack centered on Loki #
The older standard for log collection was the EFK stack (Elasticsearch + Fluentd + Kibana). Powerful, but heavy — Elasticsearch full-text indexes every log body, consuming considerable disk and memory.
Loki is a lightweight alternative made by Grafana Labs. The model is different — log bodies aren’t indexed, only labels are. At search time, narrow down by label and then grep the body. It brings a label model similar to Prometheus to logs.
Components of the Loki stack #
| Component | Role |
|---|---|
| Loki | Log storage + querying |
| Promtail (or Fluent Bit) | Reads logs from each node and sends to Loki. DaemonSet. |
| Grafana | LogQL queries + visualization |
Promtail reads each node’s /var/log/containers/, automatically attaches K8s metadata (Pod name, namespace, container name, labels) as labels, and sends to Loki. Without separate application changes, every container’s stdout/stderr is collected as is.
LogQL — Loki’s query language #
Same grain as PromQL.
{namespace="payments"} |= "ERROR"sum(rate({pod="checkout-abc123"} |= "ERROR" [5m])){...} for label filtering, |= for body substring matching, |~ for regex matching. With rate() it can be treated like a metric, so it can be drawn alongside metric charts in Grafana dashboards.
Loki vs EFK — the grain of selection #
| Dimension | Loki | EFK |
|---|---|---|
| Indexing | Labels only | Whole body |
| Disk cost | Low | High |
| Full-text search | Grep (slow) | Fast |
| Operational burden | Low | High (operating Elasticsearch cluster) |
| Grafana integration | First-class | Possible |
For new deployments, Loki is closer to the standard. If full-text search is a core requirement, EFK or OpenSearch is more suitable, but for the day-to-day debugging of K8s operations, Loki’s label + grep model is sufficient.
Traces — integration centered on OpenTelemetry #
The old standards for distributed tracing were split into two — OpenTracing and OpenCensus. The two projects merged to create OpenTelemetry (OTel). It’s now a single standard handling distributed tracing, metrics, and logs together, and one of the most active projects in CNCF.
OpenTelemetry’s core concepts are:
- Instrumentation libraries — per-language SDKs inserted into application code to create traces. Auto-instrumentation tools (Java agents, etc.) often attach without code changes.
- OpenTelemetry Collector — receives data sent by applications, processes and routes it. Generally deployed as DaemonSet or Deployment in K8s.
- Backend — actually stores and visualizes traces. Jaeger, Tempo, Datadog, Honeycomb, etc.
Trace model — a tree of Spans #
The unit data of distributed tracing is the Span. As a single request passes through multiple services, one span is created at each step, bound in parent-child relationships forming a tree.
[gateway] /api/orders POST (200ms)
├─ [orders-service] create order (180ms)
│ ├─ [postgres] INSERT orders (15ms)
│ ├─ [postgres] INSERT items (12ms)
│ └─ [kafka] publish order.created (45ms)
└─ [auth-service] verify token (10ms)Looking at this tree, you can see at a glance which segment of the 200ms took the most time. Traces are invaluable for narrowing down which service is the cause when P99 latency is elevated.
Tempo — a trace store of the same grain as Loki #
Grafana Labs also made a trace store with a lightweight model — Tempo. What Loki is for logs, Tempo is for traces. Minimizes indexes and stores trace bodies on object storage (S3 / GCS). Optimized for direct lookup by trace ID, and when used with Loki and Prometheus, the flow of metric → log → trace flows naturally in Grafana.
Grafana — the standard for visualization #
The tool that looks at the three axes’ data in one place is Grafana. Almost every data source — Prometheus, Loki, Tempo, Elasticsearch, CloudWatch — can be bound into a single dashboard, with each panel pulling data via its own query language.
The standard dashboard set for an operational cluster usually consists of:
- Cluster overview — per-node CPU/memory/disk, Pod count, per-namespace resource use
- Workload overview — per-Deployment/StatefulSet replica state, restart count, OOMKilled
- API server health — request rate, error rate, P99 latency, etcd lag
- Per application — business metrics + 4 golden signals (latency, traffic, errors, saturation)
Adopting kube-prometheus-stack brings cluster, workload, and API server dashboards together pre-configured. Only application dashboards need to be newly made tailored to the domain.
Alerting — the role of Alertmanager #
Sending alerts when metrics meet certain conditions is handled by Alertmanager, not Prometheus itself. Prometheus evaluates alert rules and sends triggered alerts to Alertmanager, which handles routing, grouping, silencing, and repetition.
PrometheusRule — CRD for alert definition #
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: my-app
labels:
release: prometheus
spec:
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5..", app="my-app"}[5m]))
/ sum(rate(http_requests_total{app="my-app"}[5m])) > 0.05
for: 5m
labels:
severity: warning
team: payments
annotations:
summary: "High 5xx rate on my-app ({{ $value | humanizePercentage }})"
description: "5xx rate over the last 5 minutes is above 5%."expr is the condition Prometheus evaluates, and for: 5m means “the alert triggers when this condition is true for 5 consecutive minutes.” severity and team in labels are used as routing keys in Alertmanager.
Alertmanager routing #
Alertmanager configuration decides where to send alerts based on labels.
route:
receiver: default
group_by: ['alertname', 'team']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty
- match:
team: payments
receiver: payments-slack
receivers:
- name: default
slack_configs:
- channel: '#alerts'
- name: pagerduty
pagerduty_configs:
- service_key: ...
- name: payments-slack
slack_configs:
- channel: '#payments-alerts'Thanks to this model, alert routing is managed as code (manifest), branching to channels like Slack / PagerDuty / Email.
Operational principles to lock in #
The observability stack, once well set up, dramatically deepens cluster visibility, but if operated poorly it consumes a large share of cluster resources. Locking in the following four principles is recommended.
1. Beware of metric cardinality explosion #
Prometheus creates a separate time series for each unique combination of labels. Putting high-cardinality values like user IDs, request IDs, or UUIDs into labels causes the series count to explode, and Prometheus burns through memory quickly. The cardinal rule is: don’t put high-cardinality values in labels.
topk(10, count by (__name__)({__name__=~".+"}))Periodically checking metrics with high series count via this query is part of operations.
2. Retention period and remote storage #
Prometheus’s local storage retention defaults to 15 days. To retain data beyond that, you must ship to remote storage (Thanos, Cortex, Mimir, VictoriaMetrics). Similarly, backing Loki and Tempo with long-term object storage (S3 / GCS) is the standard.
Retention is directly tied to cost. A typical starting point is 6 months for metrics, 30 days for logs, and 7 days for traces — each axis configured independently.
3. Alert SNR — too many alerts equals no alerts #
Too many alerts cause operators to start ignoring them, and the truly important ones get missed. The guiding principle of alert design is “one alert = one immediate human action required.”
- Symptom-based — alert on symptoms, not causes. Not “DB connection pool is 80% full” but “API’s 5xx rate exceeds 5%.”
- Eliminate noise with the
forperiod — so alerts don’t ring on short spikes. - severity tiers —
criticalis what wakes you up,warningis what can wait until morning. When the tiers blur, both start being ignored.
4. Golden signals as standard #
The 4 golden signals from Google SRE culture are nearly the starting point of monitoring every workload.
| Signal | Meaning |
|---|---|
| Latency | Request processing time (P50 / P95 / P99) |
| Traffic | Requests per second |
| Errors | Failure rate |
| Saturation | Resource saturation (CPU, memory, queue length) |
Exposing these four signals in a consistent form across all services standardizes both dashboards and alerts, with domain-specific metrics layered on top.
Closing #
The K8s operational observability stack has been organized in one cycle. Prometheus + kube-state-metrics + node-exporter for metrics, Loki (or EFK) for logs, OpenTelemetry + Tempo for traces, Grafana for visualization, and Alertmanager for alerting — this bundle is nearly a settled standard, and kube-prometheus-stack along with the Loki / Tempo Helm charts are the natural first step of adoption. Locking in the four principles of cardinality / retention / alert SNR / golden signals as operational guardrails keeps the stack from consuming cluster resources unnecessarily, while only deepening visibility. The next post — and the last in the K8s Advanced series — covers the operational model placing the source of truth for manifests in git, GitOps based on ArgoCD and Flux.