K8s Practice #5: Monitoring & Alerting — Prometheus / CloudWatch / Alertmanager

The fifth post in the K8s Practice series. Through #4, myshop-api now has automated version delivery, but half of operations is being able to see what that system is actually doing. Without visibility into how CPU, memory, request latency, and error rate change, neither canary auto-promotion nor incident response can keep up. This post lays out the observability stack on EKS. It takes the standard stack covered in Advanced #5 (Prometheus + Grafana + Loki + Alertmanager), adapts it to the EKS environment, and adds the AWS-managed CloudWatch Container Insights path alongside it.

This series is K8s Practice, 6 posts.

Combining two axes — in-cluster Prometheus + managed CloudWatch #

Observability in EKS environments typically combines two axes.

AxisResponsibility
In-cluster (Prometheus + Grafana + Loki)Workload metrics, business metrics, alerts, dashboards
CloudWatch (Container Insights + Logs)AWS managed metrics, log long-term retention, AWS console integration

Using only one is possible, but the standard for production clusters is to combine both. Prometheus is the source of truth for operational metrics and alerts, while CloudWatch serves as the integration point for long-term retention and metrics from AWS-native resources (RDS, ALB, EBS). AWS Managed Prometheus (AMP) and Managed Grafana (AMG) are increasingly viable options for reducing in-cluster operational overhead, but this post focuses on the most common in-cluster model.

kube-prometheus-stack — the standard bundle installed at once #

The standard Helm chart covered in Advanced #5. One command brings in Prometheus + Grafana + Alertmanager + kube-state-metrics + node-exporter + Prometheus Operator CRDs all at once.

Installation #

Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace \
  --values prometheus-values.yaml
prometheus-values.yaml — key parts
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "50GB"

    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi

    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false

    additionalScrapeConfigs:
      - job_name: ec2-spot-instance
        ec2_sd_configs:
          - region: ap-northeast-2

    remoteWrite:
      - url: https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write
        sigv4:
          region: ap-northeast-2

grafana:
  adminPassword: ""
  ingress:
    enabled: true
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
    hosts:
      - grafana.myshop.example.com
  persistence:
    enabled: true
    storageClassName: gp3
    size: 10Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 10Gi
  config:
    route:
      receiver: default
    receivers:
      - name: default

A few key settings to call out:

  • retention: 30d + storageSpec — 30-day metric retention + 100GB EBS. To retain beyond 30 days, send together to AMP or Thanos via remoteWrite.
  • serviceMonitorSelectorNilUsesHelmValues: false — automatically discovers ServiceMonitors in all namespaces. ServiceMonitors in the myshop namespace are picked up even though they are not in the monitoring namespace.
  • remoteWrite — long-term metric storage in AWS Managed Prometheus (AMP). For cases needing analysis beyond 30 days.

Checks right after installation #

Basic health check
kubectl get pods -n monitoring
kubectl get servicemonitors -A
kubectl get prometheusrules -A
Expected output
NAME                                                 READY   STATUS    RESTARTS
prometheus-grafana-xxx                               3/3     Running   0
prometheus-kube-prometheus-operator-xxx              1/1     Running   0
prometheus-kube-state-metrics-xxx                    1/1     Running   0
prometheus-prometheus-kube-prometheus-prometheus-0   2/2     Running   0
prometheus-prometheus-node-exporter-xxx              1/1     Running   0
alertmanager-prometheus-kube-prometheus-alertmanager-0  2/2  Running   0

Over 100 default PrometheusRules are included automatically. K8s-native alerts for node down, etcd failure, and kubelet issues are pre-defined, so cluster incidents surface as alerts immediately without any additional configuration.

Adding metric exposure to myshop-api #

Here is the full cycle for exposing myshop-api’s metrics on a cluster with the standard stack installed.

1. Application exposes /metrics #

Prometheus client libraries exist in nearly every language. For Python (FastAPI), it starts with a single line:

myshop-api/main.py — Prometheus metrics exposure
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()
Instrumentator().instrument(app).expose(app)

This single line auto-exposes the following metrics:

  • http_requests_total{handler, method, status} — request counter
  • http_request_duration_seconds_bucket{handler, method} — latency histogram
  • http_request_size_bytes / http_response_size_bytes — payload size
  • Standard Python runtime metrics (GC, threads, memory)

Domain-specific metrics (e.g., orders created counter, payment success rate) can be added on top.

Adding domain metrics
from prometheus_client import Counter, Histogram

orders_created = Counter(
    "myshop_orders_created_total",
    "Total orders created",
    ["status"]
)

checkout_duration = Histogram(
    "myshop_checkout_duration_seconds",
    "Checkout flow duration"
)

2. ServiceMonitor manifest #

Create a ServiceMonitor that Prometheus Operator will watch.

charts/myshop-api/templates/servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ include "myshop-api.fullname" . }}
  namespace: {{ .Release.Namespace }}
  labels:
    app.kubernetes.io/name: myshop-api
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: myshop-api
  endpoints:
    - port: http
      interval: 30s
      path: /metrics

From the moment this manifest is applied, Prometheus starts scraping /metrics on all myshop-api Pods every 30 seconds. Once data is visible in Grafana’s Explore using http_requests_total{namespace="myshop"}, metric collection is confirmed to be working.

4 golden signals — the skeleton of the alert rule set #

Below are the 4 golden signals (Latency / Traffic / Errors / Saturation) covered in Advanced #5, written as a PrometheusRule for myshop-api.

charts/myshop-api/templates/prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: {{ include "myshop-api.fullname" . }}
  namespace: {{ .Release.Namespace }}
  labels:
    release: prometheus
spec:
  groups:
    - name: myshop-api.golden-signals
      interval: 30s
      rules:
        # Errors — 5xx rate
        - alert: MyshopApiHighErrorRate
          expr: |
            sum(rate(http_requests_total{app="myshop-api",status=~"5.."}[5m]))
              / sum(rate(http_requests_total{app="myshop-api"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
            team: backend
          annotations:
            summary: "myshop-api 5xx rate > 5% ({{ "{{ $value | humanizePercentage }}" }})"
            description: "5xx rate stayed above 5% for 5+ minutes."
            runbook_url: "https://runbooks.myshop.example.com/myshop-api-5xx"

        # Latency — P95
        - alert: MyshopApiHighLatencyP95
          expr: |
            histogram_quantile(0.95,
              sum by (le) (rate(http_request_duration_seconds_bucket{app="myshop-api"}[5m]))
            ) > 1.0
          for: 10m
          labels:
            severity: warning
            team: backend
          annotations:
            summary: "myshop-api P95 latency > 1s ({{ "{{ $value | printf \"%.2f\" }}" }}s)"

        # Traffic — sudden traffic drop (downstream failure signal)
        - alert: MyshopApiTrafficDrop
          expr: |
            sum(rate(http_requests_total{app="myshop-api"}[5m]))
              < 0.3 * sum(rate(http_requests_total{app="myshop-api"}[5m] offset 1h))
          for: 10m
          labels:
            severity: warning
            team: backend
          annotations:
            summary: "myshop-api traffic drop (less than 30% of last hour)"

        # Saturation — Pod memory utilization
        - alert: MyshopApiPodMemoryHigh
          expr: |
            sum by (pod) (
              container_memory_working_set_bytes{namespace="myshop",pod=~"myshop-api-.*"}
            ) / sum by (pod) (
              kube_pod_container_resource_limits{namespace="myshop",pod=~"myshop-api-.*",resource="memory"}
            ) > 0.85
          for: 10m
          labels:
            severity: warning
            team: backend
          annotations:
            summary: "myshop-api Pod memory > 85% of limit"

Three key patterns to note across each rule:

  • for period — requires a sustained duration of 5–10 minutes so alerts do not fire on short-lived spikes.
  • severity labelcritical for an immediate page, warning for next-business-day review. This label drives Alertmanager routing.
  • runbook_url — a link to the response procedure document that the alert recipient can follow immediately. Embodies the principle of one alert = one clear response.

Alertmanager routing — Slack and PagerDuty branching #

The alert flow is Prometheus → Alertmanager → channel. Alertmanager inspects labels to determine routing.

alertmanager.yaml — operational routing
route:
  receiver: default
  group_by: ['alertname', 'team', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - matchers:
        - severity = "critical"
      receiver: pagerduty-backend
      continue: true
      routes:
        - matchers:
            - severity = "critical"
            - team = "backend"
          receiver: pagerduty-backend
        - matchers:
            - severity = "critical"
            - team = "platform"
          receiver: pagerduty-platform

    - matchers:
        - severity = "warning"
      receiver: slack-warnings
      group_wait: 1m
      repeat_interval: 12h

receivers:
  - name: default
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#alerts'

  - name: slack-warnings
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#alerts-warning'
        title: '⚠️  {{ "{{ .GroupLabels.alertname }}" }}'

  - name: pagerduty-backend
    pagerduty_configs:
      - service_key: ${PAGERDUTY_BACKEND_KEY}

  - name: pagerduty-platform
    pagerduty_configs:
      - service_key: ${PAGERDUTY_PLATFORM_KEY}

inhibit_rules:
  - source_matchers: [severity = "critical"]
    target_matchers: [severity = "warning"]
    equal: [alertname, namespace]

Three key patterns:

  • Branching by severity — critical pages PagerDuty; warning notifies Slack.
  • Branching by team — even within critical, backend and platform teams are paged separately.
  • inhibit_rules — while a critical alert with the same alertname is firing, warnings for the same namespace are silenced. Prevents alert flooding.

The secrets (SLACK_WEBHOOK, PAGERDUTY_*) are injected via External Secrets, as covered in #3.

Loki — adding the log stack #

Beyond metrics, capturing logs alongside them is standard. Apply the Loki stack from Advanced #5 as-is.

Loki + Promtail Helm install
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  -n monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.storageClassName=gp3 \
  --set loki.persistence.size=100Gi

After installation, the Loki data source is automatically added to Grafana and LogQL queries become available in Explore.

myshop-api ERROR logs
{namespace="myshop", app="myshop-api"} |= "ERROR"
Convert error rate to metric (Loki → like a metric)
sum(rate({namespace="myshop", app="myshop-api"} |= "ERROR" [5m]))

For long-term retention, set Loki’s storage backend to S3. This is the standard operational setup.

CloudWatch Container Insights — the second axis #

Installing CloudWatch Container Insights on the same EKS cluster lets you immediately check cluster, node, Pod, and container metrics from the AWS console. For teams familiar with the AWS console, this reduces the daily monitoring burden.

CloudWatch Container Insights — Helm
helm repo add aws-observability https://aws-observability.github.io/helm-charts
helm install amazon-cloudwatch-observability \
  aws-observability/amazon-cloudwatch-observability \
  -n amazon-cloudwatch --create-namespace \
  --set clusterName=myshop-prod \
  --set region=ap-northeast-2

This chart deploys Fluent Bit as a DaemonSet, which ships each node’s stdout/stderr to CloudWatch Logs and collects metrics via the CloudWatch Agent.

The role of Fluent Bit #

The standard setup has Fluent Bit read each node’s /var/log/containers/ and route logs to the following two destinations.

Fluent Bit's two outputs
container logs
   ├─→ Loki (in-cluster, short-term search)
   └─→ CloudWatch Logs (S3 export, long-term retention)

The reason for sending the same logs to two destinations is that their responsibilities differ — Loki for daily debugging, CloudWatch for compliance, auditing, and long-term analysis. Using only Loki is cheaper, but in regulated environments CloudWatch is typically added as well.

Grafana dashboard standards #

The standard dashboard set going into operational cluster Grafana:

Dashboardsource
Kubernetes / Compute Resources / Clusterkube-prometheus-stack default (ID 7249)
Kubernetes / Compute Resources / Namespace (Workloads)default (ID 7250)
Kubernetes / Compute Resources / Poddefault (ID 7251)
Kubernetes / Networking / Clusterdefault (ID 7253)
Node Exporter / Nodesdefault (ID 1860)
myshop-api operational dashboardself-authored — golden signals + business metrics

The 5 default dashboards are auto-registered by kube-prometheus-stack and are immediately visible without any extra work. Adding a single self-authored dashboard tailored to the domain nearly completes the daily monitoring view.

Standard panel set of self-authored dashboard #

myshop-api dashboard — 9 panels
Row 1: Latency P50 / P95 / P99
Row 2: Request rate (per domain, per status)
Row 3: Error rate (4xx / 5xx)
Row 4: Pod CPU / memory utilization
Row 5: HPA current replicas
Row 6: PgBouncer active connections / wait queue
Row 7: business metrics (orders/min, checkout success rate)
Row 8: top ERROR logs (Loki)
Row 9: recent deploys (annotations)

The last panel, “recent deploy annotations,” injects ArgoCD or GitHub Actions events into Grafana as annotations. Deploy timestamps appear as vertical lines on metric graphs, making it easy to see at a glance which deploy a latency spike appeared right after.

on-call flow — together with runbooks #

An alert firing is not the end of the story. The recipient needs to know exactly where to look within 5 minutes. The standard on-call flow:

Standard 5 minutes right after receiving on-call alert
1. Check alert body in PagerDuty (alertname, team, severity)
2. Click runbook_url in annotation
3. Follow Runbook's "primary check" section — relevant Grafana dashboard / log query / kubectl commands presented
4. Primary response (scale up, restart, traffic block, etc.)
5. Share status in Slack incident channel

Managing runbooks as Markdown in a separate git repo is the standard. Each alert rule’s runbook_url points to a page in that repo, and adding a new alert means the corresponding runbook is brought in via a PR at the same time.

Checks after first operational cycle #

Items to check after the stack has been installed and running for a few days.

Prometheus time series count (cardinality check)
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
  promtool tsdb analyze /prometheus/wal | head -50
Alert firing frequency (rules that never fired / fire too often)
kubectl exec -n monitoring alertmanager-prometheus-kube-prometheus-alertmanager-0 -c alertmanager -- \
  amtool alert query --alertmanager.url=http://localhost:9093
Loki disk usage
kubectl exec -n monitoring loki-0 -- df -h /data

After a month of operations, cardinality explosion, alert signal-to-noise degradation, and log disk pressure tend to surface at least once each. Scheduling these as periodic checks is standard practice.

Closing #

We walked through the full cycle of laying an observability stack on top of myshop-api: installing Prometheus + Grafana + Alertmanager in one shot with kube-prometheus-stack, standardizing myshop-api’s 4 golden signals alerts via ServiceMonitor + PrometheusRule, and locking in Slack/PagerDuty routing by severity and team through Alertmanager. We also added the dual log axes of Loki and CloudWatch and established the operational pattern of tying every alert to a runbook URL. At this point, the full loop from code through deploy, operations, and observation is automated. The next and final post in the series covers the periodic operational cycle for running this cluster safely across months, quarters, and years — EKS upgrades, RDS backup/recovery, cost management, and security checks — along with a retrospective of the entire K8s Practice series.

X