Monitoring · Alerts
The myshop-api built through Chapter 24 is automated from code to deployment, but if you cannot see its behavior, operations do not move. This chapter layers on the EKS cluster's observability stack. We install Prometheus · Grafana · Alertmanager at once with kube-prometheus-stack, standardize myshop-api metrics and the 4 golden signals alerts with ServiceMonitor / PrometheusRule, capture logs with Loki, keep AWS-coupled metrics and long-term retention with CloudWatch Container Insights, and organize the on-call flow of Slack / PagerDuty with severity · team routing.
Having gone through Chapter 24 CI / CD pipeline, myshop-api already has new versions coming in automatically, but half of the operations work is seeing that behavior. If you cannot see where and how CPU · memory · request latency · and error rate change, canary promotion is impossible and incident response is slow. This chapter layers the observability stack on top of EKS.
It is the stage where the standard stack (Prometheus + Grafana + Loki + Alertmanager) covered in Chapter 19 Observability becomes a full EKS · AWS-coupled operational setup. If Chapter 19 identified the three data kinds — metric · log · trace — at the object level, this chapter adds the operational layer of alert rule sets · routing · on-call procedures. The goal is to have myshop-api’s 4 golden signals alerts in place, so critical ones are paged to PagerDuty and warnings go to Slack.
Combining the two axes — in-cluster Prometheus + managed CloudWatch #
EKS-environment observability is usually achieved by combining two axes.
| Axis | Responsibility |
|---|---|
| In-cluster (Prometheus + Grafana + Loki) | Workload metrics, business metrics, alerts, dashboards |
| CloudWatch (Container Insights + Logs) | AWS managed metrics, long-term log retention, AWS console integration |
A way of using only one of the two is possible, but the standard for a production cluster is combining the two. Prometheus is the baseline source of operational metrics and alerts, and CloudWatch is the integration point for long-term retention and AWS’s own resource (RDS · ALB · EBS) metrics. The decision in Chapter 21 EKS setup to send PostgreSQL logs to CloudWatch with RDS’s enabled_cloudwatch_logs_exports joins naturally at this chapter’s second axis.
AWS’s managed Prometheus (AMP) and managed Grafana (AMG) are establishing themselves as options that reduce the in-cluster operational burden, but in this chapter we look mainly at the most common in-cluster model and touch on the remote write option to AMP alongside it.
kube-prometheus-stack — the standard bundle that installs at once #
This is the standard Helm chart touched on in Chapter 19 Observability §“kube-prometheus-stack.” In one command, you get the CRDs for Prometheus + Grafana + Alertmanager + kube-state-metrics + node-exporter + the Prometheus Operator. The Operator model of Chapter 18 CRD and Operator leads into the full metric stack here.
Install #
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
--values prometheus-values.yamlprometheus:
prometheusSpec:
retention: 30d
retentionSize: "50GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
additionalScrapeConfigs:
- job_name: ec2-spot-instance
ec2_sd_configs:
- region: ap-northeast-2
remoteWrite:
- url: https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write
sigv4:
region: ap-northeast-2
grafana:
adminPassword: ""
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
hosts:
- grafana.myshop.example.com
persistence:
enabled: true
storageClassName: gp3
size: 10Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 10Gi
config:
route:
receiver: default
receivers:
- name: defaultWe point out the three key settings.
retention: 30d+storageSpec— 30 days of metric retention + 100 GB of EBS. To retain beyond 30 days, send to AMP or Thanos together withremoteWrite. The gp3 StorageClass of Chapter 9 PV / PVC / StorageClass becomes the source of a full production PV.serviceMonitorSelectorNilUsesHelmValues: false— auto-recognizes ServiceMonitors in every namespace. The ServiceMonitor in the myshop namespace works without the monitoring namespace too.remoteWrite— stores metrics long-term in AWS Managed Prometheus (AMP). The option for cases needing analysis beyond 30 days.
Grafana’s Ingress is resolved to an ALB by the AWS Load Balancer Controller created in Chapter 22 App deployment skeleton — it’s the shape where the same component handles the three entry points of myshop-api, ArgoCD, and Grafana.
Checks right after install #
kubectl get pods -n monitoring
kubectl get servicemonitors -A
kubectl get prometheusrules -ANAME READY STATUS RESTARTS
prometheus-grafana-xxx 3/3 Running 0
prometheus-kube-prometheus-operator-xxx 1/1 Running 0
prometheus-kube-state-metrics-xxx 1/1 Running 0
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0
prometheus-prometheus-node-exporter-xxx 1/1 Running 0
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0About 100 or so default PrometheusRules come in automatically. K8s’s own alerts like node down, etcd failure, and kubelet issues are predefined inside them, so a cluster incident becomes an alert right away with no separate work. Policy violations covered in Chapter 14 RBAC / NetworkPolicy / ResourceQuota are also automatically turned into metrics through kube-state-metrics.
Adding metric exposure to myshop-api #
This is the flow for exposing myshop-api’s metrics on a cluster where the standard stack is installed.
1. The application exposes /metrics
#
A Prometheus client library exists for almost every language. For Python (FastAPI), you start with the following one line.
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
app = FastAPI()
Instrumentator().instrument(app).expose(app)This one line automatically exposes the following metrics.
http_requests_total{handler, method, status}— request counterhttp_request_duration_seconds_bucket{handler, method}— latency histogramhttp_request_size_bytes/http_response_size_bytes— payload size- Standard Python runtime metrics (GC, threads, memory)
Domain metrics (order-creation counter, payment success rate, etc.) are added on top of that.
from prometheus_client import Counter, Histogram
orders_created = Counter(
"myshop_orders_created_total",
"Total orders created",
["status"]
)
checkout_duration = Histogram(
"myshop_checkout_duration_seconds",
"Checkout flow duration"
)It’s the stage where the business metrics pointed out in Chapter 19 §“RED · USE · 4 golden signals” turn into actual code in this chapter, one line at a time.
2. ServiceMonitor manifest #
We make a ServiceMonitor for the Prometheus Operator to watch.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "myshop-api.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
app.kubernetes.io/name: myshop-api
spec:
selector:
matchLabels:
app.kubernetes.io/name: myshop-api
endpoints:
- port: http
interval: 30s
path: /metricsFrom the moment this manifest is applied, Prometheus starts scraping /metrics of all of myshop-api’s Pods every 30 seconds. If data shows up in Grafana’s Explore with http_requests_total{namespace="myshop"}, metric collection is healthy. It’s the shape where the ServiceMonitor CRD of Chapter 18 settles in as one manifest of a full metric pipeline.
4 golden signals — the skeleton of the alert rule set #
We write the 4 golden signals (Latency / Traffic / Errors / Saturation) covered in Chapter 19 as myshop-api’s PrometheusRule.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "myshop-api.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
release: prometheus
spec:
groups:
- name: myshop-api.golden-signals
interval: 30s
rules:
# Errors — 5xx ratio
- alert: MyshopApiHighErrorRate
expr: |
sum(rate(http_requests_total{app="myshop-api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{app="myshop-api"}[5m])) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "myshop-api 5xx rate > 5% ({{ "{{ $value | humanizePercentage }}" }})"
description: "5xx ratio has stayed above 5% for over 5 minutes."
runbook_url: "https://runbooks.myshop.example.com/myshop-api-5xx"
# Latency — P95
- alert: MyshopApiHighLatencyP95
expr: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{app="myshop-api"}[5m]))
) > 1.0
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "myshop-api P95 latency > 1s ({{ "{{ $value | printf \"%.2f\" }}" }}s)"
# Traffic — sharp traffic drop (a downstream-failure signal)
- alert: MyshopApiTrafficDrop
expr: |
sum(rate(http_requests_total{app="myshop-api"}[5m]))
< 0.3 * sum(rate(http_requests_total{app="myshop-api"}[5m] offset 1h))
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "myshop-api traffic sharp drop (below 30% vs the past 1 hour)"
# Saturation — Pod memory usage
- alert: MyshopApiPodMemoryHigh
expr: |
sum by (pod) (
container_memory_working_set_bytes{namespace="myshop",pod=~"myshop-api-.*"}
) / sum by (pod) (
kube_pod_container_resource_limits{namespace="myshop",pod=~"myshop-api-.*",resource="memory"}
) > 0.85
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "myshop-api Pod memory > 85% of limit"We point out the three key patterns of each rule.
- The
forduration — requires a sustained duration of 5 ~ 10 minutes so an alert doesn’t fire on a short spike. - The
severitylabel —criticalis paged immediately,warningis reviewed the next business day. It’s the key for Alertmanager routing. runbook_url— the response-procedure document that the person who received the alert can follow right away. It follows the principle of one alert = one clear response.
The Argo Rollouts AnalysisTemplate of Chapter 24 CI / CD pipeline uses the same PromQL expression as this chapter’s first rule (5xx ratio) as the input of automation. The shape where the same metric is used in two places at once — a human’s alert and code’s promote decision — is the operational value of observability.
Alertmanager routing — the split between Slack and PagerDuty #
The flow of an alert is Prometheus → Alertmanager → channel. Alertmanager looks at labels and decides routing.
route:
receiver: default
group_by: ['alertname', 'team', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity = "critical"
receiver: pagerduty-backend
continue: true
routes:
- matchers:
- severity = "critical"
- team = "backend"
receiver: pagerduty-backend
- matchers:
- severity = "critical"
- team = "platform"
receiver: pagerduty-platform
- matchers:
- severity = "warning"
receiver: slack-warnings
group_wait: 1m
repeat_interval: 12h
receivers:
- name: default
slack_configs:
- api_url: ${SLACK_WEBHOOK}
channel: '#alerts'
- name: slack-warnings
slack_configs:
- api_url: ${SLACK_WEBHOOK}
channel: '#alerts-warning'
title: '{{ "{{ .GroupLabels.alertname }}" }}'
- name: pagerduty-backend
pagerduty_configs:
- service_key: ${PAGERDUTY_BACKEND_KEY}
- name: pagerduty-platform
pagerduty_configs:
- service_key: ${PAGERDUTY_PLATFORM_KEY}
inhibit_rules:
- source_matchers: [severity = "critical"]
target_matchers: [severity = "warning"]
equal: [alertname, namespace]The three key patterns.
- Split by severity — critical is paged to PagerDuty, warning is notified to Slack.
- Split by team — even the same critical pages the backend / platform teams separately.
inhibit_rules— while a critical of the same alertname is firing, the warning of the same namespace is bundled and muted. The standard pattern for preventing an alert storm.
Secrets (SLACK_WEBHOOK, PAGERDUTY_*) are injected with the External Secrets covered in Chapter 23 DB integration. If you set up a myshop/${env}/alerting/* pattern in Secrets Manager, ESO automatically unpacks it into a K8s Secret.
Loki — adding the log stack #
Securing logs together with metrics is standard. We apply the model seen in Chapter 19 Observability §“Loki — the lightweight log stack” directly.
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
-n monitoring \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.storageClassName=gp3 \
--set loki.persistence.size=100GiAfter installation, a Loki data source is added to Grafana automatically, and LogQL queries become possible in Explore.
{namespace="myshop", app="myshop-api"} |= "ERROR"sum(rate({namespace="myshop", app="myshop-api"} |= "ERROR" [5m]))To put long-term retention in S3, set Loki’s storage backend to S3. A standard operational setup. In Chapter 28 Cost optimization we point out how Loki’s S3 backend changes the cost profile compared to EBS.
CloudWatch Container Insights — the second axis #
If you install CloudWatch Container Insights together on the same EKS cluster, you can immediately check cluster · node · Pod · container metrics in the AWS console. If the operations team is comfortable with the AWS console, the burden of daily checks decreases.
helm repo add aws-observability https://aws-observability.github.io/helm-charts
helm install amazon-cloudwatch-observability \
aws-observability/amazon-cloudwatch-observability \
-n amazon-cloudwatch --create-namespace \
--set clusterName=myshop-prod \
--set region=ap-northeast-2This chart brings up Fluent Bit as a DaemonSet to send each node’s stdout / stderr to CloudWatch Logs, and collects metrics too with the CloudWatch Agent. It’s the shape where the DaemonSet pattern of Chapter 8 StatefulSet · DaemonSet · Job becomes a full log-collection agent.
The DaemonSet’s ServiceAccount receives CloudWatch’s PutLogEvents, PutMetricData permissions via IRSA — the same direction as the EBS CSI IRSA pattern of Chapter 21.
The two exits of Fluent Bit #
It’s standard for Fluent Bit to read the node’s /var/log/containers/ and route it to the following two places.
container logs
|
|-> Loki (in-cluster, short-term search)
`-> CloudWatch Logs (S3 export, long-term retention)The reason for sending the same logs to two places is that their responsibilities differ — Loki is for daily debugging, CloudWatch for compliance · audit · long-term analysis. On the cost side, using only Loki is lighter, but in regulated environments CloudWatch usually comes along too. CloudWatch’s log retention policy follows from the audit procedure of Chapter 26 Operations checklist.
Grafana dashboard standards #
We organize the standard dashboard set that goes into a production cluster’s Grafana.
| Dashboard | Source |
|---|---|
| Kubernetes / Compute Resources / Cluster | kube-prometheus-stack default (ID 7249) |
| Kubernetes / Compute Resources / Namespace (Workloads) | default (ID 7250) |
| Kubernetes / Compute Resources / Pod | default (ID 7251) |
| Kubernetes / Networking / Cluster | default (ID 7253) |
| Node Exporter / Nodes | default (ID 1860) |
| myshop-api operational dashboard | self-written — golden signals + business metrics |
The 5 default dashboards are registered automatically by kube-prometheus-stack, so they show up right away with no separate work. If you make just one self-written dashboard fitted to your domain, the field of view for daily checks is nearly complete.
The standard panel set of the self-written dashboard #
Row 1: Latency P50 / P95 / P99
Row 2: Request rate (per domain, per status)
Row 3: Error rate (4xx / 5xx)
Row 4: Pod CPU / memory usage
Row 5: HPA current replicas
Row 6: PgBouncer active connections / wait queue
Row 7: Business metrics (orders/min, checkout success rate)
Row 8: Top ERROR logs (Loki)
Row 9: Recent deployments (annotations)The “recent deployment annotations” of the last panel are ArgoCD or GitHub Actions events injected into Grafana as annotations. The deployment time is shown as a vertical line over the metric graph, so you can see at a glance “which deployment did this latency spike happen right after.” It’s the point where the CI / CD of Chapter 24 and this chapter’s observability meet on one screen.
The on-call flow — with the runbook #
The alert firing itself is not the end. It has to be clear where the person who received it should look within 5 minutes. The standard operational flow.
1. Check the alert body in PagerDuty (alertname, team, severity)
2. Click the runbook_url in the annotation
3. Follow the Runbook's "first checks" section — it presents related Grafana dashboards / log queries / kubectl commands
4. First response (scale up, restart, block traffic, etc.)
5. Share status in the Slack incident channelIt’s standard to manage the Runbook as markdown in a separate git repo. The alert rule’s runbook_url points to a page in that repo, and when you add a new alert the Runbook comes in together as a PR. It’s the shape where the git single-source model of Chapter 20 GitOps extends from alert definitions to operational-procedure documents. Chapter 27 kubectl debugging patterns comes to play part of this Runbook’s role at the book level.
Checks after the first operational cycle #
These are the items to check at the point when you’ve installed the stack and run it for a few days.
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
promtool tsdb analyze /prometheus/wal | head -50kubectl exec -n monitoring alertmanager-prometheus-kube-prometheus-alertmanager-0 -c alertmanager -- \
amtool alert query --alertmanager.url=http://localhost:9093kubectl exec -n monitoring loki-0 -- df -h /dataAfter a month of operations, cardinality explosion, a drop in alert SNR (signal-to-noise ratio), and log disk pressure usually each show up once. It’s standard to make this check a regular check, and in the monthly-check section of Chapter 26 Operations checklist these three commands of this chapter are included directly.
The trap of cardinality explosion #
The most common operational incident is high-cardinality values (user ID, request ID, UUID, etc.) going into metric labels. The time-series count explodes and Prometheus dies of OOM, or the AMP cost suddenly becomes dozens of times higher. The principle is to put only finite enum values (status code, handler name, environment) in labels, and detailed identifiers fall out to logs or traces.
Exercises #
- Install this chapter’s kube-prometheus-stack on the dev EKS cluster and apply a bundle of ServiceMonitor + PrometheusRule to myshop-api. Deliberately generate 5xx for a while and observe the
MyshopApiHighErrorRatealert firing after passing thefor: 5mduration. Visualize the moment the alert’s state moves frompendingtofiringin the Alertmanager UI, and compare in one paragraph how the Argo Rollouts AnalysisTemplate of Chapter 24 uses the same PromQL for analysis. - Redesign Alertmanager’s routing tree to fit your own organization’s team composition. Draw the matrix of severity (critical / warning / info) × team (backend / platform / data) as a single table, and fill in which channel (PagerDuty / Slack channel name) goes into each cell. Write two or three of the alert storms that
inhibit_ruleswould prevent, with your own scenarios. - Deliberately make and apply a poorly designed rule that puts a user ID in a metric label, then observe the change in the time-series count with the cardinality check command. Organize in one paragraph how Prometheus’s memory usage changes and how AMP’s remote write cost can explode, from the point of view of Chapter 28 Cost optimization.
In one line: a production cluster’s observability usually spans two layers: in-cluster Prometheus and managed CloudWatch. kube-prometheus-stack bundles Prometheus · Grafana · Alertmanager · kube-state-metrics · node-exporter in one command. With ServiceMonitor + PrometheusRule you write myshop-api’s 4 golden signals alerts, and Alertmanager routes PagerDuty / Slack by severity and team labels. Loki handles logs for daily debugging, and CloudWatch Logs handles long-term retention and audit. Observability is valuable because the same metric can drive both a human alert and a code-driven promote decision, and linking alerts to response procedures with
runbook_urlis standard on-call practice. High label cardinality is one of the most common operational failure modes.
Next chapter #
At this point myshop-api is in a state where the flow from code to deployment · operations · observation is all automated. In the next chapter, also the last chapter of Part 4, we cover the regular operations cycle of running this cluster safely on a monthly, quarterly, and yearly basis.
In Chapter 26 Operations checklist we organize the standard procedures of EKS upgrade, RDS backup · recovery, cost check, and security check, and wrap up with a retrospective on the 6 chapters of Part 4 (EKS in Production). After that we move on to the full operational scope of Part 5 (Operations · Debugging · Cost).