Contents
31 Chapter

Deploying a Fullstack App on EKS

The Part 6 capstone, and the book's final chapter. It deploys the React Next.js (App Router + RSC + Server Actions) app and the Modern Python FastAPI (SQLAlchemy 2.x + Pydantic v2) app together on one EKS cluster under the same TODO domain. Across 13 PRs, it walks through cluster setup with Terraform + Karpenter + IRSA + ALB Controller + ExternalDNS + cert-manager, DB integration with RDS + External Secrets + RDS IAM auth, per-environment deployment with Helm + ArgoCD ApplicationSet, observability with Prometheus + Grafana + Loki + OpenTelemetry, autoscaling with HPA + Karpenter, k6 load testing + OpenCost cost estimation, and the operations cycles of Chapters 26 and 30. This capstone shows how the tools from Chapters 1 ~ 30 fit together inside one system.

This is the book’s last chapter. The Part 6 capstone is a comprehensive exercise that shows how all the tools from Chapters 1 ~ 30 fit together inside one system. Rather than an imaginary company, it uses the outputs of two other books in this series as its input — the Next.js TODO app from Part 6 of React and the FastAPI TODO backend from Part 4 of Modern Python, both running under the same domain. In this chapter we deploy them together on one EKS cluster and revisit the full Kubernetes track inside one system.

The goals of this chapter are these.

  • Next.js is up at https://todo.example.com and FastAPI at https://api.todo.example.com
  • RDS PostgreSQL is combined with backups · Multi-AZ · External Secrets
  • the GitHub push → ECR → ArgoCD ApplicationSet auto-sync flow
  • the Prometheus + Grafana + Loki + OpenTelemetry observability stack can observe both apps in the same direction
  • HPA + Karpenter respond automatically to traffic fluctuations
  • the operational-cost hypothesis of roughly $80 ~ $120 a month is verified with OpenCost

It proceeds in 13 PRs. Each PR becomes the input for the next, and the change volume stays deliberately small so every step remains reviewable.

The target architecture #

the todo system in one picture
[Browser]
   |
   | HTTPS (Route 53 + ACM)
   v
[ALB] -- AWS Load Balancer Controller
   |
   |-- /          -> [Next.js Pod x N] (SSR + RSC + Server Actions)
   `-- /api/*     -> [FastAPI Pod x M] (REST + Pydantic v2)
                          |
                          | PgBouncer
                          v
                       [RDS PostgreSQL] (Multi-AZ)
                          ^
                          |
                       [External Secrets] <- [AWS Secrets Manager]
                          ^
                          | IRSA
                       [ServiceAccount]

This picture is the final form reached by the 13 PRs in this chapter. Each arrow in the picture is a problem solved in one or more chapters of this book — this chapter is where those pieces are bound into one system.

PR #1 — Domain and architecture decision #

The first PR is a single ADR (Architecture Decision Record) with no code.

docs/adr/0001-eks-architecture.md
# ADR-0001: The K8s deployment architecture of the fullstack todo system

## Context
The todo system of Next.js (App Router + RSC) + FastAPI + PostgreSQL
must be deployed to a production environment.

## Options
1. ECS Fargate (managed containers)
2. EKS (Kubernetes)
3. Lambda + RDS (serverless)

## Decision
Adopt EKS.

## Rationale
- the two apps (Next.js + FastAPI) have different lifecycles and need isolation
- the autoscaling model of HPA · Karpenter fits the traffic pattern
- GitOps (ArgoCD) matches the operational standard model
- comprehensive validation of the tools from Chapters 1 ~ 30 of this book

## Consequences
- a cost hypothesis of $80 ~ $120 a month, to be verified with the Chapter 28 model
- application of the regular operations calendar cycle (Chapter 26)
- comparison with the ECS Fargate chapter of the AWS book

Because the same capstone in AWS takes the ECS Fargate route, comparing the two books makes the operational difference between “Kubernetes vs managed containers” clear. This chapter starts the operational cycle after choosing Kubernetes.

PR #2 — A fresh EKS cluster setup #

The Terraform manifest of Chapter 21, EKS Cluster Setup is the input. In this capstone we keep one deliberate difference — we introduce Karpenter from the start.

terraform/main.tf
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  # ... as in Chapter 21
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "todo-${var.env}"
  cluster_version = "1.32"
  enable_irsa     = true

  cluster_addons = {
    coredns            = { most_recent = true }
    kube-proxy         = { most_recent = true }
    vpc-cni            = { most_recent = true }
    aws-ebs-csi-driver = {
      most_recent              = true
      service_account_role_arn = module.ebs_csi_irsa.iam_role_arn
    }
  }

  # keep only the minimum number of ON_DEMAND nodes; Karpenter handles the rest on demand
  eks_managed_node_groups = {
    system = {
      desired_size   = 2
      min_size       = 2
      max_size       = 3
      instance_types = ["t3.medium"]
      capacity_type  = "ON_DEMAND"
      labels         = { role = "system" }
      taints = [{
        key    = "system"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

module "karpenter" {
  source = "terraform-aws-modules/eks/aws//modules/karpenter"
  cluster_name        = module.eks.cluster_name
  irsa_oidc_provider_arn = module.eks.oidc_provider_arn
}

The system node group hosts only system components like Karpenter, CoreDNS, and the monitoring stack. Application workloads (Next.js / FastAPI) go to the nodes Karpenter brings up. That pattern combines the Karpenter model from Chapter 13, Autoscaling with Chapter 28, Cost Optimization §“Karpenter — the decision tree against Cluster Autoscaler.”

kustomize/karpenter/default-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["t", "m", "c"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["2", "4", "8"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
      - nodes: "10%"
        duration: 10m
        schedule: "0 9 * * mon-fri"

disruption.budgets is the blast-radius control from Chapter 30, Upgrade Strategy — it replaces at most 10 % of nodes at a time during weekday business hours.

A bundle of auxiliary components #

ALB Controller + ExternalDNS + cert-manager
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system --set clusterName=todo-prod --set serviceAccount.create=false

helm install external-dns external-dns/external-dns \
  -n external-dns --create-namespace \
  --set provider=aws --set "domainFilters[0]=todo.example.com"

helm install cert-manager jetstack/cert-manager \
  -n cert-manager --create-namespace --set installCRDs=true

This is exactly the setup described in Chapter 22, The App Deployment Skeleton §“cert-manager and external-dns.”

PR #3 — The Namespace / RBAC / NetworkPolicy skeleton #

Before bringing up workloads, we establish the isolation skeleton.

kustomize/namespaces/todo.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: todo-frontend
  labels:
    team: web
    env: prod
    role: frontend
---
apiVersion: v1
kind: Namespace
metadata:
  name: todo-backend
  labels:
    team: backend
    env: prod
    role: backend
---
apiVersion: v1
kind: Namespace
metadata:
  name: todo-data
  labels:
    team: backend
    env: prod
    role: data

The split into three namespaces — frontend / backend / data — is the isolation unit of this capstone. The standard labels (team / env / role) from Chapter 7, Namespace and Labels are used as the grouping key in Chapter 25, Monitoring · Alerts and the cost-allocation key in Chapter 28.

NetworkPolicy — full isolation #

netpol — only frontend can call the backend api
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: todo-backend-ingress
  namespace: todo-backend
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: todo-api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              role: frontend
        - namespaceSelector:
            matchLabels:
              role: backend   # allow other workloads of the same backend too
      ports:
        - port: 8000

The NetworkPolicy model of Chapter 14, RBAC / NetworkPolicy / ResourceQuota carries over into real isolation. It’s a forced flow where frontend can’t go to RDS directly and must pass through backend.

ResourceQuota — a per-team limit #

quota — the backend namespace's limit
apiVersion: v1
kind: ResourceQuota
metadata:
  name: todo-backend-quota
  namespace: todo-backend
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    persistentvolumeclaims: "5"

The ResourceQuota of Chapter 14 is the first protection line for cost isolation in a multi-team environment.

PR #4 — PostgreSQL RDS + External Secrets #

The Terraform manifest of Chapter 23, DB Integration carries over almost unchanged. The difference is that we keep Aurora Serverless v2 as an option in dev.

terraform/modules/todo-rds/main.tf
module "rds" {
  source  = "terraform-aws-modules/rds/aws"
  version = "~> 6.0"

  identifier = "todo-${var.env}"

  engine               = "postgres"
  engine_version       = "16.3"
  major_engine_version = "16"
  instance_class       = var.env == "prod" ? "db.t4g.small" : "db.t4g.micro"

  allocated_storage             = 20
  manage_master_user_password   = true
  multi_az                      = var.env == "prod"
  backup_retention_period       = var.env == "prod" ? 30 : 7
  performance_insights_enabled  = true
  deletion_protection           = var.env == "prod"
}

To keep the cost hypothesis small, we set the instance class to db.t4g.small — a smaller option than Chapter 23’s db.m6g.large. The todo domain’s load is small, so it’s plenty.

ExternalSecret — todo-api's DB credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: todo-api-db
  namespace: todo-backend
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: todo-api-db
    template:
      data:
        DATABASE_URL: "postgresql://{{ .username }}:{{ .password }}@pgbouncer.todo-backend.svc:5432/todo?sslmode=disable"
  data:
    - secretKey: username
      remoteRef:
        key: rds!cluster-todo-prod
        property: username
    - secretKey: password
      remoteRef:
        key: rds!cluster-todo-prod
        property: password

It’s the manifest of Chapter 23 unchanged, and the RDS IAM auth of Chapter 29, Secret Operations §“Zero passwords” is kept as an option in this capstone — todo’s traffic is small, so the PgBouncer + password model is enough.

PR #5 — Deploying the FastAPI backend #

The FastAPI todo backend from the Part 4 capstone of Modern Python is the input. (modern-python keeps the “modern” prefix to preserve the meaning of distinguishing it from the older Python course.) Containerization is outside the scope of this book, but we point at the core of the Dockerfile.

Dockerfile — multi-stage
FROM python:3.13-slim AS builder
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen --no-dev

FROM python:3.13-slim AS runtime
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY src/ src/
ENV PATH="/app/.venv/bin:$PATH"
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Deployment #

charts/todo-api/templates/deployment.yaml — a portion
apiVersion: apps/v1
kind: Deployment
metadata:
  name: todo-api
  namespace: todo-backend
  labels:
    app.kubernetes.io/name: todo-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: todo-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app.kubernetes.io/name: todo-api
    spec:
      serviceAccountName: todo-api
      containers:
        - name: api
          image: 123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/todo-api:1.0.0
          ports:
            - containerPort: 8000
              name: http
          envFrom:
            - secretRef:
                name: todo-api-db
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 10"]
      terminationGracePeriodSeconds: 60

It is the standard manifest from Chapter 22, The App Deployment Skeleton combined with the graceful shutdown pattern (preStop + terminationGracePeriodSeconds) from Chapter 30.

ServiceAccount + IRSA #

todo-api's ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: todo-api
  namespace: todo-backend
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/todo-prod-api
automountServiceAccountToken: false   # the security pattern from Chapter 16

The IRSA pattern from Chapter 16, RBAC / ServiceAccount in Depth and the security pattern from Chapter 29, Secret Operations §“automountServiceAccountToken: false” are combined in one manifest.

PR #6 — Deploying the Next.js front #

The Next.js TODO app from the Part 6 capstone of React is the input. The App Router + RSC + Server Actions model behaves inside K8s as follows.

Next.js (App Router) inside K8s
[Browser]
   |
   | HTTPS
   v
[ALB]
   |
   v
[Next.js Pod]  -- Node.js server (next start)
   |
   | fetch on RSC rendering
   v
[todo-api Service]  -- ClusterIP, points at FastAPI
   |
   v
[todo-api Pod]

Server Actions run inside the Next.js Pod. When an external API call is needed, it calls the todo-api Service in the same cluster.

charts/todo-web/templates/deployment.yaml — a portion
apiVersion: apps/v1
kind: Deployment
metadata:
  name: todo-web
  namespace: todo-frontend
spec:
  replicas: 2
  template:
    spec:
      serviceAccountName: todo-web
      containers:
        - name: web
          image: 123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/todo-web:1.0.0
          ports:
            - containerPort: 3000
              name: http
          env:
            - name: TODO_API_URL
              value: "http://todo-api.todo-backend.svc.cluster.local:80"
            - name: NODE_ENV
              value: "production"
          resources:
            requests:
              cpu: 200m
              memory: 256Mi   # the memory hypothesis of SSR + RSC
            limits:
              cpu: 1
              memory: 512Mi

The memory hypothesis of the Next.js Pod is set by the sizing model of Chapter 11, Resource Requests and Limits — since SSR + RSC’s per-request memory footprint accumulates to a degree, setting requests to 256 Mi is a conservative starting point. We converge on the appropriate value a month later with the VPA recommendation of Chapter 28, Cost Optimization.

PR #7 — Ingress + ALB #

ingress — one ALB for two hosts
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: todo
  namespace: todo-frontend
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: '443'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
    alb.ingress.kubernetes.io/group.name: todo
    external-dns.alpha.kubernetes.io/hostname: "todo.example.com,api.todo.example.com"
spec:
  rules:
    - host: todo.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: todo-web
                port:
                  number: 80
    - host: api.todo.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: todo-api.todo-backend
                port:
                  number: 80

alb.ingress.kubernetes.io/group.name: todo is decisive — the two hosts share the same single ALB. The cost-savings pattern pointed at in Chapter 28, Cost Optimization §“The ALB’s LCU” applies directly in this section.

external-dns auto-registers the A records of the two hosts in Route 53, and one wildcard (*.todo.example.com) ACM certificate is enough. It’s the shape of the Ingress manifest of Chapter 22 extended into a multi-host pattern.

PR #8 — Binding with Helm charts #

We bind the manifests written so far into two Helm charts.

the charts/ directory
charts/
├── todo-web/
│   ├── Chart.yaml
│   ├── values.yaml
│   ├── values-dev.yaml
│   ├── values-prod.yaml
│   └── templates/
│       ├── deployment.yaml
│       ├── service.yaml
│       ├── hpa.yaml
│       └── pdb.yaml
├── todo-api/
│   ├── Chart.yaml
│   ├── values.yaml
│   ├── values-dev.yaml
│   ├── values-prod.yaml
│   └── templates/
│       ├── deployment.yaml
│       ├── service.yaml
│       ├── serviceaccount.yaml
│       ├── externalsecret.yaml
│       ├── hpa.yaml
│       ├── pdb.yaml
│       └── servicemonitor.yaml
└── todo-infra/
    ├── Chart.yaml
    └── templates/
        ├── namespaces.yaml
        ├── networkpolicy.yaml
        ├── resourcequota.yaml
        └── ingress.yaml

The split of three charts is the key.

  • todo-infra — namespaces · NetworkPolicy · ResourceQuota · Ingress. The infra the two apps share.
  • todo-api — all of backend’s manifests.
  • todo-web — all of frontend’s manifests.

It’s the real application of how the pattern of Chapter 22, The App Deployment Skeleton §“Binding with Helm charts” splits in a multi-app environment. There’s also the option of binding via Chart.yaml’s dependencies, but for simplicity this capstone keeps a flat structure and integrates with ArgoCD ApplicationSet.

PR #9 — GitOps: ArgoCD ApplicationSet #

The model of Chapter 20, GitOps + Chapter 24, The CI / CD Pipeline is organized into one ApplicationSet manifest.

argocd/applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: todo
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - list:
              elements:
                - app: todo-infra
                - app: todo-api
                - app: todo-web
          - list:
              elements:
                - env: dev
                  cluster: https://kubernetes.default.svc
                - env: prod
                  cluster: https://kubernetes.default.svc
  template:
    metadata:
      name: '{{`{{.app}}`}}-{{`{{.env}}`}}'
    spec:
      project: todo
      source:
        repoURL: https://github.com/myorg/todo-manifests.git
        targetRevision: main
        path: charts/{{`{{.app}}`}}
        helm:
          valueFiles:
            - values.yaml
            - values-{{`{{.env}}`}}.yaml
      destination:
        server: '{{`{{.cluster}}`}}'
        namespace: todo-{{`{{.app}}`}}
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

The matrix generator auto-generates 3 apps × 2 environments = 6 Applications from one manifest. The operational standard is for dev to be auto-sync and for prod to branch into a separate instance of the ApplicationSet in manual-sync mode, but this capstone keeps both as automated for simplicity.

The GitHub Actions OIDC + ECR push + manifest repo auto-commit cycle of Chapter 24 is the input of this manifest — one code push auto-syncs both the dev / prod environments.

PR #10 — Observability #

The kube-prometheus-stack from Chapter 19, Observability + Chapter 25, Monitoring · Alerts carries over unchanged. The difference is that we add an OpenTelemetry Collector to tie the traces of the two apps together.

otel-collector — the DaemonSet pattern
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: monitoring
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    exporters:
      prometheus:
        endpoint: 0.0.0.0:8889
      otlp/tempo:
        endpoint: tempo.monitoring.svc:4317
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          exporters: [prometheus]

When Next.js’s OpenTelemetry SDK and FastAPI’s OTel instrumentation send traces to the same endpoint, the full path of one request crossing the two apps is visible in Tempo. Which handler of FastAPI a fetch call on RSC rendering passed through to reach RDS is traced on one trace screen.

ServiceMonitor + PrometheusRule #

todo-api's 4 golden signals
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: todo-api
  namespace: todo-backend
  labels:
    release: prometheus
spec:
  groups:
    - name: todo-api.golden-signals
      rules:
        - alert: TodoApiHighErrorRate
          expr: |
            sum(rate(http_requests_total{app="todo-api",status=~"5.."}[5m]))
              / sum(rate(http_requests_total{app="todo-api"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
        # ... latency, traffic, saturation are the same

It’s the manifest of Chapter 25 unchanged, and the same rule applies to todo-web too. The alert severity routing reuses the Alertmanager manifest from Chapter 25 unchanged.

PR #11 — Autoscaling #

The HPA of Chapter 13, Autoscaling and the Karpenter NodePool of Chapter 28 combine to make two stages of automatic response.

todo-api's HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: todo-api
  namespace: todo-backend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: todo-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
two stages of automatic response
traffic increase
   |
   v
HPA: in 30 seconds, todo-api Pod 2 -> 5 -> 10 -> 20
   |
   | the node runs short on resources
   v
Karpenter: in 30 seconds ~ 1 minute, provisions a new node (spot first)
   |
   v
the Pod that was Pending gets scheduled on the new node

The shape of these two stages running together is the goal of K8s autoscaling. We measure that shape with a load test in the next PR.

PR #12 — Load testing and cost estimation #

k6/script.js — the load scenario
import http from "k6/http";
import { check } from "k6";

export const options = {
  stages: [
    { duration: "2m", target: 50 },
    { duration: "5m", target: 200 },
    { duration: "2m", target: 500 },
    { duration: "5m", target: 500 },
    { duration: "2m", target: 0 },
  ],
};

export default function () {
  const res = http.get("https://todo.example.com/api/todos");
  check(res, {
    "status is 200": (r) => r.status === 200,
    "duration < 500ms": (r) => r.timings.duration < 500,
  });
}
running the k6 load
k6 run k6/script.js

What to measure.

  • HPA scale-up response time — in how many seconds Pods increase between traffic 50 → 200
  • Karpenter’s node-add time — the time a Pod stayed Pending
  • P95 latency — how latency changes at the load peak (500 VUs)
  • 5xx ratio — whether the error rate during load exceeds Chapter 25’s threshold (5 %)

Cost verification #

OpenCost — measuring cost after the load test
helm install opencost opencost/opencost \
  -n opencost --create-namespace

From OpenCost’s output we verify the monthly cost hypothesis.

ItemEstimate (monthly)
EKS control plane$73
nodes (system t3.medium × 2 ON_DEMAND)$60
nodes (application spot, average 1.5 units)$20
RDS db.t4g.small Multi-AZ$30
ALB (1 unit, LCU)$20
NAT Gateway + data transfer$35
ECR / Route 53 / other$10
totalabout $248

We compare each item of Chapter 28 §“A checklist for reviewing the bill” with these actual measurements. Whether prod’s target landed within this book’s standard guide ($200 ~ $300) is the validation metric.

In a learning environment, the following adjustments can cut it to $40 ~ $80 a month.

  • prod’s Multi-AZ RDS to dev’s single AZ
  • one ALB (already shared)
  • the system node group on spot too
  • the NAT Gateway to a Single NAT

PR #13 — Applying the operations checklist #

The last PR applies the regular calendar of Chapter 26, The Operations Checklist and the upgrade checklist of Chapter 30, Upgrade Strategy to this system.

docs/runbooks/todo-operations.md
# The todo system operations calendar

## Daily
- check the 5 panels of the todo Grafana dashboard
- review the active alerts in Alertmanager

## Weekly
- ECR Trivy scan results (both todo-api, todo-web)
- review new security patches

## Monthly
- the top 1, 2, 3 costs by team / by workload in OpenCost
- the unreflected workloads of VPA recommendation
- check the OutOfSync state of ArgoCD

## Quarterly
- EKS minor upgrade (the 13 steps of Chapter 30)
- right-sizing signals from RDS Performance Insights
- RBAC audit
- recovery drill (PITR simulation)
- kube-bench CIS checkup

## Semi-annually
- external security audit
- DR simulation (Velero restore)

## Annually
- cluster architecture review
- manifest modernization

This manifest going into git is the last PR of this capstone. Putting not only the code but also the operational procedures in git’s single source is the essential goal of GitOps.

Retrospective — how the 30 chapters were bound together #

We organize how this book’s chapters meshed inside one system across the 13 PRs.

Chapter of this bookRole in this capstone
Chapters 1 ~ 3the vision to read a manifest line
Chapter 4, Deploymentthe RollingUpdate strategy of todo-api / todo-web
Chapter 5, Servicethe cluster DNS connection of todo-api ↔ todo-web
Chapter 6, ConfigMap · Secretthe standard for environment-variable injection
Chapter 7, Namespace and Labelsthe frontend / backend / data split
Chapter 9, PV / PVC / StorageClassthe EBS CSI Driver (no direct PV used — RDS)
Chapter 10, Ingressone ALB + group.name for two hosts
Chapter 11, Resource Requests and Limitsthe starting points of Next.js 256 Mi · FastAPI 128 Mi
Chapter 12, Health Checksthe 3 probes + graceful shutdown
Chapter 13, Autoscalingthe two-stage automatic response of HPA + Karpenter
Chapter 14, RBAC / NetworkPolicy / Quotanamespace isolation + per-team limits
Chapter 15, CNI in DepthVPC CNI assigns IP directly to Pods (background)
Chapter 16, IRSAtodo-api’s AWS credentials
Chapter 17, Admission ControllerKyverno policy (optional)
Chapter 18, CRD and OperatorESO, Karpenter, ALB Controller, OTel
Chapter 19, Observabilitythe trace of OpenTelemetry + Tempo
Chapter 20, GitOpsone ArgoCD ApplicationSet manifest
Chapter 21, EKS Setupthe starting point of Terraform
Chapter 22, The App Deployment Skeletonthe standard 9-bundle of todo-api / todo-web
Chapter 23, DB IntegrationRDS + ESO + PgBouncer
Chapter 24, The CI / CD PipelineGitHub Actions OIDC → ECR → ApplicationSet
Chapter 25, Monitoring · AlertsPrometheusRule + Alertmanager routing
Chapter 26, The Operations Checklistdaily / weekly / monthly / quarterly / semi-annually / annually
Chapter 27, kubectl Debuggingthe standard 5-minute flow on an incident
Chapter 28, Cost OptimizationOpenCost + Karpenter spot + ALB sharing
Chapter 29, Secret OperationsESO + automountServiceAccountToken
Chapter 30, Upgrade StrategypreStop · PDB · Karpenter disruption budgets

This table is the one-line summary of this capstone — the shape of the 30 chapters each taking a role in one system is the goal of the K8s track.

Comparison with the AWS book #

The Part 6 capstone of the AWS book (forthcoming) takes up the same todo system on the ECS Fargate route. Comparative learning of the two books makes the operational difference of implementing the same domain on two platforms clearly visible.

GrainThis book (EKS)AWS (ECS Fargate)
starting cost$200 ~ $300 a month$80 ~ $150 a month
operational surfaceK8s’s richness + learning curveAWS console + fewer objects
automation toolsKarpenter, HPA, ArgoCDService Auto Scaling, CodePipeline
observabilityPrometheus + GrafanaCloudWatch Container Insights
multi-cloud possibilitypossible (K8s standard)AWS-locked
the team’s learning costhighlow

For a small team working in a single domain, ECS Fargate is more efficient; if you need multi-domain support, GitOps, rich workload patterns, and a multi-cloud option, EKS is suitable. This capstone’s decision (EKS) is the result of learning value plus a comprehensive validation of this book’s 30 chapters.

Cleanup — deleting the cluster #

The cost-side standard is to clean up a learning cluster immediately after the capstone ends.

the order of resource cleanup
# 1. delete the ArgoCD Application (clean up the workloads)
kubectl delete applicationset todo -n argocd

# 2. release RDS deletion_protection then terraform destroy
# (for prod, deletion_protection is on, so set it to false via a terraform variable then apply)

# 3. confirm the automatic cleanup of ALB / Route 53
# external-dns auto-deletes the hostname's A records

# 4. terraform destroy
terraform destroy

# 5. delete the ECR repositories (image remnants)
aws ecr delete-repository --repository-name todo-api --force
aws ecr delete-repository --repository-name todo-web --force

This order is the standard for safe cleanup — if you don’t clean up from the Application first, Terraform gets blocked on the ALB dependency and destroy fails.

Exercises #

  1. Actually apply this capstone’s 13 PRs to your own GitHub organization, and organize the last load test’s results together with OpenCost’s cost output into one page. Map where the gap between the expected cost hypothesis (about $248) and the actual measurement arose (especially NAT data transfer · ALB LCU · spot ratio) onto Chapter 28, Cost Optimization §“A checklist for reviewing the bill.”
  2. Branch this capstone’s ApplicationSet manifest and modify it so dev’s and prod’s sync policies behave differently (dev as automated + selfHeal, prod as manual sync). Deliberately apply a broken value to dev’s manifest (e.g., a nonexistent image tag) and, in one paragraph, compare how selfHeal protects and how prod’s manual sync works as a human gate.
  3. After following the same todo system ECS Fargate capstone of the AWS book, compare the operational tradeoffs of the two implementations against your own scenario in one table. Organize into one page the decision tree of which platform is suitable at which point, tailored to your domain (traffic pattern · team size · tolerance for cloud lock-in).

In one line: The Part 6 capstone deploys modern-react’s Next.js and modern-python’s FastAPI together on the same EKS cluster across 13 PRs. It starts with Terraform + Karpenter + IRSA + ALB Controller + ExternalDNS + cert-manager, then adds the namespace split of frontend / backend / data + NetworkPolicy + ResourceQuota, the DB stack of RDS + External Secrets + PgBouncer, three Helm charts (infra + api + web), ArgoCD ApplicationSet generating 6 Applications from one manifest, OpenTelemetry tracing both apps together, HPA + Karpenter handling traffic changes, k6 + OpenCost verifying the monthly cost hypothesis of about $248, and the last PR putting the daily / weekly / monthly / quarterly / semi-annually / annually operations calendar in git. The goal of the K8s track is to make the 30 tools of Chapters 1 ~ 30 each play a clear role in one system. For a small team working in a single domain, AWS’s ECS Fargate may be more efficient. If you need multi-domain support, GitOps, and a multi-cloud option, EKS is suitable.

The end of the book — next steps #

With this capstone, the vision of how this book’s 30 chapters mesh inside one system is complete. But this book is not the destination of K8s — it’s a starting point. We point at the topics you can move to as the next track.

  • Service Mesh — Istio · Linkerd. mTLS · fine-grained traffic routing · observability mesh.
  • MLOps on K8s — Kubeflow · KServe · Argo Workflows. A dedicated stack for ML model training · deployment · serving.
  • Multi-cluster — patterns that go beyond the limits of a single cluster. Cluster federation · multi-region · the multi-cluster mode of ArgoCD ApplicationSet.
  • eBPF in depth — beyond Cilium. The next generation of security / observability / networking.
  • eks-anywhere / on-prem K8s — the challenge of operating clusters outside a managed offering.

These topics are the domain of separate books, and this book’s 30 chapters create the vision of standing at their starting point.

Finally, Appendix A — From docker-compose to k8s closes the book as a migration guide for the entry-level reader. For a reader who has followed this whole book it’s an appendix, but for a reader who came as far as Docker / docker-compose and opened this book for the first time, it’s a starting point.

X