25 Chapter

Monitoring · Alerts

The myshop-api built through Chapter 24 is automated from code to deployment, but if you cannot see its behavior, operations do not move. This chapter layers on the EKS cluster's observability stack. We install Prometheus · Grafana · Alertmanager at once with kube-prometheus-stack, standardize myshop-api metrics and the 4 golden signals alerts with ServiceMonitor / PrometheusRule, capture logs with Loki, keep AWS-coupled metrics and long-term retention with CloudWatch Container Insights, and organize the on-call flow of Slack / PagerDuty with severity · team routing.

Having gone through Chapter 24 CI / CD pipeline, myshop-api already has new versions coming in automatically, but half of the operations work is seeing that behavior. If you cannot see where and how CPU · memory · request latency · and error rate change, canary promotion is impossible and incident response is slow. This chapter layers the observability stack on top of EKS.

It is the stage where the standard stack (Prometheus + Grafana + Loki + Alertmanager) covered in Chapter 19 Observability becomes a full EKS · AWS-coupled operational setup. If Chapter 19 identified the three data kinds — metric · log · trace — at the object level, this chapter adds the operational layer of alert rule sets · routing · on-call procedures. The goal is to have myshop-api’s 4 golden signals alerts in place, so critical ones are paged to PagerDuty and warnings go to Slack.

Combining the two axes — in-cluster Prometheus + managed CloudWatch #

EKS-environment observability is usually achieved by combining two axes.

Axis	Responsibility
In-cluster (Prometheus + Grafana + Loki)	Workload metrics, business metrics, alerts, dashboards
CloudWatch (Container Insights + Logs)	AWS managed metrics, long-term log retention, AWS console integration

A way of using only one of the two is possible, but the standard for a production cluster is combining the two. Prometheus is the baseline source of operational metrics and alerts, and CloudWatch is the integration point for long-term retention and AWS’s own resource (RDS · ALB · EBS) metrics. The decision in Chapter 21 EKS setup to send PostgreSQL logs to CloudWatch with RDS’s enabled_cloudwatch_logs_exports joins naturally at this chapter’s second axis.

AWS’s managed Prometheus (AMP) and managed Grafana (AMG) are establishing themselves as options that reduce the in-cluster operational burden, but in this chapter we look mainly at the most common in-cluster model and touch on the remote write option to AMP alongside it.

kube-prometheus-stack — the standard bundle that installs at once #

This is the standard Helm chart touched on in Chapter 19 Observability §“kube-prometheus-stack.” In one command, you get the CRDs for Prometheus + Grafana + Alertmanager + kube-state-metrics + node-exporter + the Prometheus Operator. The Operator model of Chapter 18 CRD and Operator leads into the full metric stack here.

Install #

Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace \
  --values prometheus-values.yaml

prometheus-values.yaml — the key part

prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "50GB"

    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi

    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false

    additionalScrapeConfigs:
      - job_name: ec2-spot-instance
        ec2_sd_configs:
          - region: ap-northeast-2

    remoteWrite:
      - url: https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write
        sigv4:
          region: ap-northeast-2

grafana:
  adminPassword: ""
  ingress:
    enabled: true
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
    hosts:
      - grafana.myshop.example.com
  persistence:
    enabled: true
    storageClassName: gp3
    size: 10Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 10Gi
  config:
    route:
      receiver: default
    receivers:
      - name: default

We point out the three key settings.

retention: 30d + storageSpec — 30 days of metric retention + 100 GB of EBS. To retain beyond 30 days, send to AMP or Thanos together with remoteWrite. The gp3 StorageClass of Chapter 9 PV / PVC / StorageClass becomes the source of a full production PV.
serviceMonitorSelectorNilUsesHelmValues: false — auto-recognizes ServiceMonitors in every namespace. The ServiceMonitor in the myshop namespace works without the monitoring namespace too.
remoteWrite — stores metrics long-term in AWS Managed Prometheus (AMP). The option for cases needing analysis beyond 30 days.

Grafana’s Ingress is resolved to an ALB by the AWS Load Balancer Controller created in Chapter 22 App deployment skeleton — it’s the shape where the same component handles the three entry points of myshop-api, ArgoCD, and Grafana.

Checks right after install #

Basic health check

kubectl get pods -n monitoring
kubectl get servicemonitors -A
kubectl get prometheusrules -A

Expected output

NAME                                                      READY   STATUS    RESTARTS
prometheus-grafana-xxx                                    3/3     Running   0
prometheus-kube-prometheus-operator-xxx                   1/1     Running   0
prometheus-kube-state-metrics-xxx                         1/1     Running   0
prometheus-prometheus-kube-prometheus-prometheus-0        2/2     Running   0
prometheus-prometheus-node-exporter-xxx                   1/1     Running   0
alertmanager-prometheus-kube-prometheus-alertmanager-0    2/2     Running   0

About 100 or so default PrometheusRules come in automatically. K8s’s own alerts like node down, etcd failure, and kubelet issues are predefined inside them, so a cluster incident becomes an alert right away with no separate work. Policy violations covered in Chapter 14 RBAC / NetworkPolicy / ResourceQuota are also automatically turned into metrics through kube-state-metrics.

Adding metric exposure to myshop-api #

This is the flow for exposing myshop-api’s metrics on a cluster where the standard stack is installed.

1. The application exposes `/metrics` #

A Prometheus client library exists for almost every language. For Python (FastAPI), you start with the following one line.

myshop-api/main.py — exposing Prometheus metrics

from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()
Instrumentator().instrument(app).expose(app)

This one line automatically exposes the following metrics.

http_requests_total{handler, method, status} — request counter
http_request_duration_seconds_bucket{handler, method} — latency histogram
http_request_size_bytes / http_response_size_bytes — payload size
Standard Python runtime metrics (GC, threads, memory)

Domain metrics (order-creation counter, payment success rate, etc.) are added on top of that.

Adding domain metrics

from prometheus_client import Counter, Histogram

orders_created = Counter(
    "myshop_orders_created_total",
    "Total orders created",
    ["status"]
)

checkout_duration = Histogram(
    "myshop_checkout_duration_seconds",
    "Checkout flow duration"
)

It’s the stage where the business metrics pointed out in Chapter 19 §“RED · USE · 4 golden signals” turn into actual code in this chapter, one line at a time.

2. ServiceMonitor manifest #

We make a ServiceMonitor for the Prometheus Operator to watch.

charts/myshop-api/templates/servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ include "myshop-api.fullname" . }}
  namespace: {{ .Release.Namespace }}
  labels:
    app.kubernetes.io/name: myshop-api
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: myshop-api
  endpoints:
    - port: http
      interval: 30s
      path: /metrics

From the moment this manifest is applied, Prometheus starts scraping /metrics of all of myshop-api’s Pods every 30 seconds. If data shows up in Grafana’s Explore with http_requests_total{namespace="myshop"}, metric collection is healthy. It’s the shape where the ServiceMonitor CRD of Chapter 18 settles in as one manifest of a full metric pipeline.

4 golden signals — the skeleton of the alert rule set #

We write the 4 golden signals (Latency / Traffic / Errors / Saturation) covered in Chapter 19 as myshop-api’s PrometheusRule.

charts/myshop-api/templates/prometheusrule.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: {{ include "myshop-api.fullname" . }}
  namespace: {{ .Release.Namespace }}
  labels:
    release: prometheus
spec:
  groups:
    - name: myshop-api.golden-signals
      interval: 30s
      rules:
        # Errors — 5xx ratio
        - alert: MyshopApiHighErrorRate
          expr: |
            sum(rate(http_requests_total{app="myshop-api",status=~"5.."}[5m]))
              / sum(rate(http_requests_total{app="myshop-api"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
            team: backend
          annotations:
            summary: "myshop-api 5xx rate > 5% ({{ "{{ $value | humanizePercentage }}" }})"
            description: "5xx ratio has stayed above 5% for over 5 minutes."
            runbook_url: "https://runbooks.myshop.example.com/myshop-api-5xx"

        # Latency — P95
        - alert: MyshopApiHighLatencyP95
          expr: |
            histogram_quantile(0.95,
              sum by (le) (rate(http_request_duration_seconds_bucket{app="myshop-api"}[5m]))
            ) > 1.0
          for: 10m
          labels:
            severity: warning
            team: backend
          annotations:
            summary: "myshop-api P95 latency > 1s ({{ "{{ $value | printf \"%.2f\" }}" }}s)"

        # Traffic — sharp traffic drop (a downstream-failure signal)
        - alert: MyshopApiTrafficDrop
          expr: |
            sum(rate(http_requests_total{app="myshop-api"}[5m]))
              < 0.3 * sum(rate(http_requests_total{app="myshop-api"}[5m] offset 1h))
          for: 10m
          labels:
            severity: warning
            team: backend
          annotations:
            summary: "myshop-api traffic sharp drop (below 30% vs the past 1 hour)"

        # Saturation — Pod memory usage
        - alert: MyshopApiPodMemoryHigh
          expr: |
            sum by (pod) (
              container_memory_working_set_bytes{namespace="myshop",pod=~"myshop-api-.*"}
            ) / sum by (pod) (
              kube_pod_container_resource_limits{namespace="myshop",pod=~"myshop-api-.*",resource="memory"}
            ) > 0.85
          for: 10m
          labels:
            severity: warning
            team: backend
          annotations:
            summary: "myshop-api Pod memory > 85% of limit"

We point out the three key patterns of each rule.

The for duration — requires a sustained duration of 5 ~ 10 minutes so an alert doesn’t fire on a short spike.
The severity label — critical is paged immediately, warning is reviewed the next business day. It’s the key for Alertmanager routing.
runbook_url — the response-procedure document that the person who received the alert can follow right away. It follows the principle of one alert = one clear response.

The Argo Rollouts AnalysisTemplate of Chapter 24 CI / CD pipeline uses the same PromQL expression as this chapter’s first rule (5xx ratio) as the input of automation. The shape where the same metric is used in two places at once — a human’s alert and code’s promote decision — is the operational value of observability.

Alertmanager routing — the split between Slack and PagerDuty #

The flow of an alert is Prometheus → Alertmanager → channel. Alertmanager looks at labels and decides routing.

alertmanager.yaml — production routing

route:
  receiver: default
  group_by: ['alertname', 'team', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - matchers:
        - severity = "critical"
      receiver: pagerduty-backend
      continue: true
      routes:
        - matchers:
            - severity = "critical"
            - team = "backend"
          receiver: pagerduty-backend
        - matchers:
            - severity = "critical"
            - team = "platform"
          receiver: pagerduty-platform

    - matchers:
        - severity = "warning"
      receiver: slack-warnings
      group_wait: 1m
      repeat_interval: 12h

receivers:
  - name: default
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#alerts'

  - name: slack-warnings
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#alerts-warning'
        title: '{{ "{{ .GroupLabels.alertname }}" }}'

  - name: pagerduty-backend
    pagerduty_configs:
      - service_key: ${PAGERDUTY_BACKEND_KEY}

  - name: pagerduty-platform
    pagerduty_configs:
      - service_key: ${PAGERDUTY_PLATFORM_KEY}

inhibit_rules:
  - source_matchers: [severity = "critical"]
    target_matchers: [severity = "warning"]
    equal: [alertname, namespace]

The three key patterns.

Split by severity — critical is paged to PagerDuty, warning is notified to Slack.
Split by team — even the same critical pages the backend / platform teams separately.
inhibit_rules — while a critical of the same alertname is firing, the warning of the same namespace is bundled and muted. The standard pattern for preventing an alert storm.

Secrets (SLACK_WEBHOOK, PAGERDUTY_*) are injected with the External Secrets covered in Chapter 23 DB integration. If you set up a myshop/${env}/alerting/* pattern in Secrets Manager, ESO automatically unpacks it into a K8s Secret.

Loki — adding the log stack #

Securing logs together with metrics is standard. We apply the model seen in Chapter 19 Observability §“Loki — the lightweight log stack” directly.

Loki + Promtail Helm install

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  -n monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.storageClassName=gp3 \
  --set loki.persistence.size=100Gi

After installation, a Loki data source is added to Grafana automatically, and LogQL queries become possible in Explore.

myshop-api's ERROR logs

{namespace="myshop", app="myshop-api"} |= "ERROR"

Converting to an error-rate metric (Loki -> metric-like)

sum(rate({namespace="myshop", app="myshop-api"} |= "ERROR" [5m]))

To put long-term retention in S3, set Loki’s storage backend to S3. A standard operational setup. In Chapter 28 Cost optimization we point out how Loki’s S3 backend changes the cost profile compared to EBS.

CloudWatch Container Insights — the second axis #

If you install CloudWatch Container Insights together on the same EKS cluster, you can immediately check cluster · node · Pod · container metrics in the AWS console. If the operations team is comfortable with the AWS console, the burden of daily checks decreases.

CloudWatch Container Insights — Helm

helm repo add aws-observability https://aws-observability.github.io/helm-charts
helm install amazon-cloudwatch-observability \
  aws-observability/amazon-cloudwatch-observability \
  -n amazon-cloudwatch --create-namespace \
  --set clusterName=myshop-prod \
  --set region=ap-northeast-2

This chart brings up Fluent Bit as a DaemonSet to send each node’s stdout / stderr to CloudWatch Logs, and collects metrics too with the CloudWatch Agent. It’s the shape where the DaemonSet pattern of Chapter 8 StatefulSet · DaemonSet · Job becomes a full log-collection agent.

The DaemonSet’s ServiceAccount receives CloudWatch’s PutLogEvents, PutMetricData permissions via IRSA — the same direction as the EBS CSI IRSA pattern of Chapter 21.

The two exits of Fluent Bit #

It’s standard for Fluent Bit to read the node’s /var/log/containers/ and route it to the following two places.

The two exits of Fluent Bit

container logs
   |
   |-> Loki (in-cluster, short-term search)
   `-> CloudWatch Logs (S3 export, long-term retention)

The reason for sending the same logs to two places is that their responsibilities differ — Loki is for daily debugging, CloudWatch for compliance · audit · long-term analysis. On the cost side, using only Loki is lighter, but in regulated environments CloudWatch usually comes along too. CloudWatch’s log retention policy follows from the audit procedure of Chapter 26 Operations checklist.

Grafana dashboard standards #

We organize the standard dashboard set that goes into a production cluster’s Grafana.

Dashboard	Source
Kubernetes / Compute Resources / Cluster	kube-prometheus-stack default (ID 7249)
Kubernetes / Compute Resources / Namespace (Workloads)	default (ID 7250)
Kubernetes / Compute Resources / Pod	default (ID 7251)
Kubernetes / Networking / Cluster	default (ID 7253)
Node Exporter / Nodes	default (ID 1860)
myshop-api operational dashboard	self-written — golden signals + business metrics

The 5 default dashboards are registered automatically by kube-prometheus-stack, so they show up right away with no separate work. If you make just one self-written dashboard fitted to your domain, the field of view for daily checks is nearly complete.

The standard panel set of the self-written dashboard #

myshop-api dashboard — 9 panels

Row 1: Latency P50 / P95 / P99
Row 2: Request rate (per domain, per status)
Row 3: Error rate (4xx / 5xx)
Row 4: Pod CPU / memory usage
Row 5: HPA current replicas
Row 6: PgBouncer active connections / wait queue
Row 7: Business metrics (orders/min, checkout success rate)
Row 8: Top ERROR logs (Loki)
Row 9: Recent deployments (annotations)

The “recent deployment annotations” of the last panel are ArgoCD or GitHub Actions events injected into Grafana as annotations. The deployment time is shown as a vertical line over the metric graph, so you can see at a glance “which deployment did this latency spike happen right after.” It’s the point where the CI / CD of Chapter 24 and this chapter’s observability meet on one screen.

The on-call flow — with the runbook #

The alert firing itself is not the end. It has to be clear where the person who received it should look within 5 minutes. The standard operational flow.

The standard 5 minutes right after receiving an on-call alert

1. Check the alert body in PagerDuty (alertname, team, severity)
2. Click the runbook_url in the annotation
3. Follow the Runbook's "first checks" section — it presents related Grafana dashboards / log queries / kubectl commands
4. First response (scale up, restart, block traffic, etc.)
5. Share status in the Slack incident channel

It’s standard to manage the Runbook as markdown in a separate git repo. The alert rule’s runbook_url points to a page in that repo, and when you add a new alert the Runbook comes in together as a PR. It’s the shape where the git single-source model of Chapter 20 GitOps extends from alert definitions to operational-procedure documents. Chapter 27 kubectl debugging patterns comes to play part of this Runbook’s role at the book level.

Checks after the first operational cycle #

These are the items to check at the point when you’ve installed the stack and run it for a few days.

Prometheus's time-series count (cardinality check)

kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
  promtool tsdb analyze /prometheus/wal | head -50

Alert occurrence frequency (rules that never fired / rules that fire too often)

kubectl exec -n monitoring alertmanager-prometheus-kube-prometheus-alertmanager-0 -c alertmanager -- \
  amtool alert query --alertmanager.url=http://localhost:9093

Loki's disk usage

kubectl exec -n monitoring loki-0 -- df -h /data

After a month of operations, cardinality explosion, a drop in alert SNR (signal-to-noise ratio), and log disk pressure usually each show up once. It’s standard to make this check a regular check, and in the monthly-check section of Chapter 26 Operations checklist these three commands of this chapter are included directly.

The trap of cardinality explosion #

The most common operational incident is high-cardinality values (user ID, request ID, UUID, etc.) going into metric labels. The time-series count explodes and Prometheus dies of OOM, or the AMP cost suddenly becomes dozens of times higher. The principle is to put only finite enum values (status code, handler name, environment) in labels, and detailed identifiers fall out to logs or traces.

Exercises #

Install this chapter’s kube-prometheus-stack on the dev EKS cluster and apply a bundle of ServiceMonitor + PrometheusRule to myshop-api. Deliberately generate 5xx for a while and observe the MyshopApiHighErrorRate alert firing after passing the for: 5m duration. Visualize the moment the alert’s state moves from pending to firing in the Alertmanager UI, and compare in one paragraph how the Argo Rollouts AnalysisTemplate of Chapter 24 uses the same PromQL for analysis.
Redesign Alertmanager’s routing tree to fit your own organization’s team composition. Draw the matrix of severity (critical / warning / info) × team (backend / platform / data) as a single table, and fill in which channel (PagerDuty / Slack channel name) goes into each cell. Write two or three of the alert storms that inhibit_rules would prevent, with your own scenarios.
Deliberately make and apply a poorly designed rule that puts a user ID in a metric label, then observe the change in the time-series count with the cardinality check command. Organize in one paragraph how Prometheus’s memory usage changes and how AMP’s remote write cost can explode, from the point of view of Chapter 28 Cost optimization.

In one line: a production cluster’s observability usually spans two layers: in-cluster Prometheus and managed CloudWatch. kube-prometheus-stack bundles Prometheus · Grafana · Alertmanager · kube-state-metrics · node-exporter in one command. With ServiceMonitor + PrometheusRule you write myshop-api’s 4 golden signals alerts, and Alertmanager routes PagerDuty / Slack by severity and team labels. Loki handles logs for daily debugging, and CloudWatch Logs handles long-term retention and audit. Observability is valuable because the same metric can drive both a human alert and a code-driven promote decision, and linking alerts to response procedures with runbook_url is standard on-call practice. High label cardinality is one of the most common operational failure modes.

Next chapter #

At this point myshop-api is in a state where the flow from code to deployment · operations · observation is all automated. In the next chapter, also the last chapter of Part 4, we cover the regular operations cycle of running this cluster safely on a monthly, quarterly, and yearly basis.

In Chapter 26 Operations checklist we organize the standard procedures of EKS upgrade, RDS backup · recovery, cost check, and security check, and wrap up with a retrospective on the 6 chapters of Part 4 (EKS in Production). After that we move on to the full operational scope of Part 5 (Operations · Debugging · Cost).