K8s Practice #5: Monitoring & Alerting — Prometheus / CloudWatch / Alertmanager
The fifth post in the K8s Practice series. Through #4, myshop-api now has automated version delivery, but half of operations is being able to see what that system is actually doing. Without visibility into how CPU, memory, request latency, and error rate change, neither canary auto-promotion nor incident response can keep up. This post lays out the observability stack on EKS. It takes the standard stack covered in Advanced #5 (Prometheus + Grafana + Loki + Alertmanager), adapts it to the EKS environment, and adds the AWS-managed CloudWatch Container Insights path alongside it.
This series is K8s Practice, 6 posts.
- #1 EKS Cluster Setup — Terraform / eksctl / IRSA / Addons
- #2 App deployment skeleton — Deployment / Service / Ingress / Helm
- #3 DB integration — RDS / Secrets Manager / External Secrets / connection pool
- #4 CI/CD pipeline — GitHub Actions / ECR / ArgoCD
- #5 Monitoring/alarming — Prometheus / CloudWatch / Alertmanager ← this post
- #6 Operations checklist — upgrades / backup,recovery / cost / security
Combining two axes — in-cluster Prometheus + managed CloudWatch #
Observability in EKS environments typically combines two axes.
| Axis | Responsibility |
|---|---|
| In-cluster (Prometheus + Grafana + Loki) | Workload metrics, business metrics, alerts, dashboards |
| CloudWatch (Container Insights + Logs) | AWS managed metrics, log long-term retention, AWS console integration |
Using only one is possible, but the standard for production clusters is to combine both. Prometheus is the source of truth for operational metrics and alerts, while CloudWatch serves as the integration point for long-term retention and metrics from AWS-native resources (RDS, ALB, EBS). AWS Managed Prometheus (AMP) and Managed Grafana (AMG) are increasingly viable options for reducing in-cluster operational overhead, but this post focuses on the most common in-cluster model.
kube-prometheus-stack — the standard bundle installed at once #
The standard Helm chart covered in Advanced #5. One command brings in Prometheus + Grafana + Alertmanager + kube-state-metrics + node-exporter + Prometheus Operator CRDs all at once.
Installation #
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
--values prometheus-values.yamlprometheus:
prometheusSpec:
retention: 30d
retentionSize: "50GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
additionalScrapeConfigs:
- job_name: ec2-spot-instance
ec2_sd_configs:
- region: ap-northeast-2
remoteWrite:
- url: https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write
sigv4:
region: ap-northeast-2
grafana:
adminPassword: ""
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
hosts:
- grafana.myshop.example.com
persistence:
enabled: true
storageClassName: gp3
size: 10Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 10Gi
config:
route:
receiver: default
receivers:
- name: defaultA few key settings to call out:
retention: 30d+storageSpec— 30-day metric retention + 100GB EBS. To retain beyond 30 days, send together to AMP or Thanos viaremoteWrite.serviceMonitorSelectorNilUsesHelmValues: false— automatically discovers ServiceMonitors in all namespaces. ServiceMonitors in the myshop namespace are picked up even though they are not in the monitoring namespace.remoteWrite— long-term metric storage in AWS Managed Prometheus (AMP). For cases needing analysis beyond 30 days.
Checks right after installation #
kubectl get pods -n monitoring
kubectl get servicemonitors -A
kubectl get prometheusrules -ANAME READY STATUS RESTARTS
prometheus-grafana-xxx 3/3 Running 0
prometheus-kube-prometheus-operator-xxx 1/1 Running 0
prometheus-kube-state-metrics-xxx 1/1 Running 0
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0
prometheus-prometheus-node-exporter-xxx 1/1 Running 0
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0Over 100 default PrometheusRules are included automatically. K8s-native alerts for node down, etcd failure, and kubelet issues are pre-defined, so cluster incidents surface as alerts immediately without any additional configuration.
Adding metric exposure to myshop-api #
Here is the full cycle for exposing myshop-api’s metrics on a cluster with the standard stack installed.
1. Application exposes /metrics #
Prometheus client libraries exist in nearly every language. For Python (FastAPI), it starts with a single line:
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
app = FastAPI()
Instrumentator().instrument(app).expose(app)This single line auto-exposes the following metrics:
http_requests_total{handler, method, status}— request counterhttp_request_duration_seconds_bucket{handler, method}— latency histogramhttp_request_size_bytes/http_response_size_bytes— payload size- Standard Python runtime metrics (GC, threads, memory)
Domain-specific metrics (e.g., orders created counter, payment success rate) can be added on top.
from prometheus_client import Counter, Histogram
orders_created = Counter(
"myshop_orders_created_total",
"Total orders created",
["status"]
)
checkout_duration = Histogram(
"myshop_checkout_duration_seconds",
"Checkout flow duration"
)2. ServiceMonitor manifest #
Create a ServiceMonitor that Prometheus Operator will watch.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "myshop-api.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: myshop-api
spec:
selector:
matchLabels:
app.kubernetes.io/name: myshop-api
endpoints:
- port: http
interval: 30s
path: /metricsFrom the moment this manifest is applied, Prometheus starts scraping /metrics on all myshop-api Pods every 30 seconds. Once data is visible in Grafana’s Explore using http_requests_total{namespace="myshop"}, metric collection is confirmed to be working.
4 golden signals — the skeleton of the alert rule set #
Below are the 4 golden signals (Latency / Traffic / Errors / Saturation) covered in Advanced #5, written as a PrometheusRule for myshop-api.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "myshop-api.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
release: prometheus
spec:
groups:
- name: myshop-api.golden-signals
interval: 30s
rules:
# Errors — 5xx rate
- alert: MyshopApiHighErrorRate
expr: |
sum(rate(http_requests_total{app="myshop-api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{app="myshop-api"}[5m])) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "myshop-api 5xx rate > 5% ({{ "{{ $value | humanizePercentage }}" }})"
description: "5xx rate stayed above 5% for 5+ minutes."
runbook_url: "https://runbooks.myshop.example.com/myshop-api-5xx"
# Latency — P95
- alert: MyshopApiHighLatencyP95
expr: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{app="myshop-api"}[5m]))
) > 1.0
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "myshop-api P95 latency > 1s ({{ "{{ $value | printf \"%.2f\" }}" }}s)"
# Traffic — sudden traffic drop (downstream failure signal)
- alert: MyshopApiTrafficDrop
expr: |
sum(rate(http_requests_total{app="myshop-api"}[5m]))
< 0.3 * sum(rate(http_requests_total{app="myshop-api"}[5m] offset 1h))
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "myshop-api traffic drop (less than 30% of last hour)"
# Saturation — Pod memory utilization
- alert: MyshopApiPodMemoryHigh
expr: |
sum by (pod) (
container_memory_working_set_bytes{namespace="myshop",pod=~"myshop-api-.*"}
) / sum by (pod) (
kube_pod_container_resource_limits{namespace="myshop",pod=~"myshop-api-.*",resource="memory"}
) > 0.85
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "myshop-api Pod memory > 85% of limit"Three key patterns to note across each rule:
forperiod — requires a sustained duration of 5–10 minutes so alerts do not fire on short-lived spikes.severitylabel —criticalfor an immediate page,warningfor next-business-day review. This label drives Alertmanager routing.runbook_url— a link to the response procedure document that the alert recipient can follow immediately. Embodies the principle of one alert = one clear response.
Alertmanager routing — Slack and PagerDuty branching #
The alert flow is Prometheus → Alertmanager → channel. Alertmanager inspects labels to determine routing.
route:
receiver: default
group_by: ['alertname', 'team', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity = "critical"
receiver: pagerduty-backend
continue: true
routes:
- matchers:
- severity = "critical"
- team = "backend"
receiver: pagerduty-backend
- matchers:
- severity = "critical"
- team = "platform"
receiver: pagerduty-platform
- matchers:
- severity = "warning"
receiver: slack-warnings
group_wait: 1m
repeat_interval: 12h
receivers:
- name: default
slack_configs:
- api_url: ${SLACK_WEBHOOK}
channel: '#alerts'
- name: slack-warnings
slack_configs:
- api_url: ${SLACK_WEBHOOK}
channel: '#alerts-warning'
title: '⚠️ {{ "{{ .GroupLabels.alertname }}" }}'
- name: pagerduty-backend
pagerduty_configs:
- service_key: ${PAGERDUTY_BACKEND_KEY}
- name: pagerduty-platform
pagerduty_configs:
- service_key: ${PAGERDUTY_PLATFORM_KEY}
inhibit_rules:
- source_matchers: [severity = "critical"]
target_matchers: [severity = "warning"]
equal: [alertname, namespace]Three key patterns:
- Branching by severity — critical pages PagerDuty; warning notifies Slack.
- Branching by team — even within critical, backend and platform teams are paged separately.
inhibit_rules— while a critical alert with the same alertname is firing, warnings for the same namespace are silenced. Prevents alert flooding.
The secrets (SLACK_WEBHOOK, PAGERDUTY_*) are injected via External Secrets, as covered in #3.
Loki — adding the log stack #
Beyond metrics, capturing logs alongside them is standard. Apply the Loki stack from Advanced #5 as-is.
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
-n monitoring \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.storageClassName=gp3 \
--set loki.persistence.size=100GiAfter installation, the Loki data source is automatically added to Grafana and LogQL queries become available in Explore.
{namespace="myshop", app="myshop-api"} |= "ERROR"sum(rate({namespace="myshop", app="myshop-api"} |= "ERROR" [5m]))For long-term retention, set Loki’s storage backend to S3. This is the standard operational setup.
CloudWatch Container Insights — the second axis #
Installing CloudWatch Container Insights on the same EKS cluster lets you immediately check cluster, node, Pod, and container metrics from the AWS console. For teams familiar with the AWS console, this reduces the daily monitoring burden.
helm repo add aws-observability https://aws-observability.github.io/helm-charts
helm install amazon-cloudwatch-observability \
aws-observability/amazon-cloudwatch-observability \
-n amazon-cloudwatch --create-namespace \
--set clusterName=myshop-prod \
--set region=ap-northeast-2This chart deploys Fluent Bit as a DaemonSet, which ships each node’s stdout/stderr to CloudWatch Logs and collects metrics via the CloudWatch Agent.
The role of Fluent Bit #
The standard setup has Fluent Bit read each node’s /var/log/containers/ and route logs to the following two destinations.
container logs
│
├─→ Loki (in-cluster, short-term search)
└─→ CloudWatch Logs (S3 export, long-term retention)The reason for sending the same logs to two destinations is that their responsibilities differ — Loki for daily debugging, CloudWatch for compliance, auditing, and long-term analysis. Using only Loki is cheaper, but in regulated environments CloudWatch is typically added as well.
Grafana dashboard standards #
The standard dashboard set going into operational cluster Grafana:
| Dashboard | source |
|---|---|
| Kubernetes / Compute Resources / Cluster | kube-prometheus-stack default (ID 7249) |
| Kubernetes / Compute Resources / Namespace (Workloads) | default (ID 7250) |
| Kubernetes / Compute Resources / Pod | default (ID 7251) |
| Kubernetes / Networking / Cluster | default (ID 7253) |
| Node Exporter / Nodes | default (ID 1860) |
| myshop-api operational dashboard | self-authored — golden signals + business metrics |
The 5 default dashboards are auto-registered by kube-prometheus-stack and are immediately visible without any extra work. Adding a single self-authored dashboard tailored to the domain nearly completes the daily monitoring view.
Standard panel set of self-authored dashboard #
Row 1: Latency P50 / P95 / P99
Row 2: Request rate (per domain, per status)
Row 3: Error rate (4xx / 5xx)
Row 4: Pod CPU / memory utilization
Row 5: HPA current replicas
Row 6: PgBouncer active connections / wait queue
Row 7: business metrics (orders/min, checkout success rate)
Row 8: top ERROR logs (Loki)
Row 9: recent deploys (annotations)The last panel, “recent deploy annotations,” injects ArgoCD or GitHub Actions events into Grafana as annotations. Deploy timestamps appear as vertical lines on metric graphs, making it easy to see at a glance which deploy a latency spike appeared right after.
on-call flow — together with runbooks #
An alert firing is not the end of the story. The recipient needs to know exactly where to look within 5 minutes. The standard on-call flow:
1. Check alert body in PagerDuty (alertname, team, severity)
2. Click runbook_url in annotation
3. Follow Runbook's "primary check" section — relevant Grafana dashboard / log query / kubectl commands presented
4. Primary response (scale up, restart, traffic block, etc.)
5. Share status in Slack incident channelManaging runbooks as Markdown in a separate git repo is the standard. Each alert rule’s runbook_url points to a page in that repo, and adding a new alert means the corresponding runbook is brought in via a PR at the same time.
Checks after first operational cycle #
Items to check after the stack has been installed and running for a few days.
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
promtool tsdb analyze /prometheus/wal | head -50kubectl exec -n monitoring alertmanager-prometheus-kube-prometheus-alertmanager-0 -c alertmanager -- \
amtool alert query --alertmanager.url=http://localhost:9093kubectl exec -n monitoring loki-0 -- df -h /dataAfter a month of operations, cardinality explosion, alert signal-to-noise degradation, and log disk pressure tend to surface at least once each. Scheduling these as periodic checks is standard practice.
Closing #
We walked through the full cycle of laying an observability stack on top of myshop-api: installing Prometheus + Grafana + Alertmanager in one shot with kube-prometheus-stack, standardizing myshop-api’s 4 golden signals alerts via ServiceMonitor + PrometheusRule, and locking in Slack/PagerDuty routing by severity and team through Alertmanager. We also added the dual log axes of Loki and CloudWatch and established the operational pattern of tying every alert to a runbook URL. At this point, the full loop from code through deploy, operations, and observation is automated. The next and final post in the series covers the periodic operational cycle for running this cluster safely across months, quarters, and years — EKS upgrades, RDS backup/recovery, cost management, and security checks — along with a retrospective of the entire K8s Practice series.