Contents
22 Chapter

Infra skeleton — deploying FastAPI/Django on ECS Fargate

The flow of pushing a container image to ECR, writing a Task Definition, and bringing it up as an ECS Fargate Service behind an ALB. The chapter where you put a small blog API into a production environment for the first time.

This is where Part 4 of the book begins, “from the console to code.” Across Parts 1 ~ 3 you got each tool into your hands one at a time — accounts and IAM, EC2 and VPC, S3 and RDS, ALB and CloudFront, and ECS / Lambda / messaging / Secrets. From here on, we tie those tools into one system. These are the hands-on chapters that move scattered console work into code and lift a small backend up to an operable form.

The application this chapter assumes is a blog API (Post + Comment + User) built with FastAPI or Django DRF. This chapter sets up the infra skeleton that first puts that container onto ECS Fargate. The structure is one where a domain received through Route 53 flows through the ALB to Fargate Tasks in two AZs, with RDS Postgres sitting behind them. RDS integration is large enough that it’s covered separately in Chapter 23 RDS integration and migration operations, so this chapter sets up every component except the DB in one pass.

The big picture #

The infra we’ll build in this chapter is as follows.

The blog API structure
                      Internet
                  ┌──────────────┐
                  │  Route 53    │   blog.example.com
                  └──────┬───────┘
                ┌────────────────┐
                │      ALB       │   :443 → :8000
                │   (HTTPS, ACM) │
                └────────┬───────┘
              ┌──────────┴──────────┐
              ▼                     ▼
        ┌───────────┐         ┌───────────┐
        │  AZ-a     │         │   AZ-c    │
        │ Fargate   │         │  Fargate  │
        │  Task #1  │         │  Task #2  │
        │  (Blog)   │         │  (Blog)   │
        └─────┬─────┘         └─────┬─────┘
              │                     │
              └──────────┬──────────┘
                  ┌──────────────┐
                  │  RDS Postgres│   (Multi-AZ, Private)
                  └──────────────┘

Organized by component:

ComponentRoleSource
Route 53domain → ALBChapter 12 Route 53
ALBTLS termination, routing, health checksChapter 13 ALB / NLB and ACM
ACMTLS certificate issuance / renewalChapter 13 ALB / NLB and ACM
ECRimage storageChapter 16 ECR
ECS Fargatecontainer execution (serverless)Chapter 15 ECS Fargate
RDSDBChapter 11 RDS, Chapter 23
VPC + Subnetnetwork separationChapter 8 EC2 and VPC
Secrets ManagerDB passwordChapter 20 Secrets / Parameter Store, Chapter 23

The blog API container — a one-line promise #

The container this book assumes is an artifact of the following shape.

Dockerfile (FastAPI version)
FROM python:3.14-slim AS base
WORKDIR /app

ENV PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq5 curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app /app/app

EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
    CMD curl -fsS http://127.0.0.1:8000/health || exit 1

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

There are three core promises.

  1. It listens on port 8000.
  2. /health returns 200 (a lightweight check with no DB dependency).
  3. /ready returns 200 if the DB connection is healthy, otherwise 503. The ALB and ECS decide traffic routing from this response.

For Django, make the same promise with gunicorn -w 4 myproject.wsgi.

1) VPC and subnets — the network skeleton #

ECS, RDS, and ALB all live inside a VPC. Without a VPC, you can’t bring up a single line. Fortunately, a new account has a default VPC in each region, so you can use it when starting quickly. For production, a VPC you build yourself is recommended.

Recommended structure #

VPC 10.0.0.0/16
Public Subnet  (10.0.0.0/24,   AZ-a)  ← ALB, NAT GW
Public Subnet  (10.0.1.0/24,   AZ-c)  ← ALB, NAT GW
Private Subnet (10.0.10.0/24,  AZ-a)  ← Fargate Task
Private Subnet (10.0.11.0/24,  AZ-c)  ← Fargate Task
DB Subnet      (10.0.20.0/24,  AZ-a)  ← RDS
DB Subnet      (10.0.21.0/24,  AZ-c)  ← RDS

There are three kinds of subnets.

SubnetTraffic directionWho lives here
Publicinternet ↔ALB, NAT Gateway
Privateno internet (outbound only via NAT)Fargate, EC2
DBno internet, Fargate-only accessRDS

In this chapter we bring it up quickly using only the default VPC’s public subnets (assigning a public IP to the Fargate task). We rebuild the production shape as code in Chapter 25 Terraform intro.

Two Security Groups #

The two SG roles
sg-alb       80, 443 ← 0.0.0.0/0
             (internet to the ALB)

sg-fargate   8000   ← sg-alb
             (only the ALB to Fargate)

There’s one important pattern here. An SG can take another SG as its source. It means “only resources that have this SG attached,” not an IP range. Even if the ALB’s IP changes, the rule follows automatically.

2) Push the image to ECR #

We already covered this in Chapter 16 ECR, but here’s a quick recap.

Create the ECR repository
aws ecr create-repository \
  --repository-name blog-api \
  --image-scanning-configuration scanOnPush=true \
  --region ap-northeast-2

scanOnPush=true automatically scans for vulnerabilities on image push (we check the results in Chapter 26 monitoring).

Build and push #

build → tag → push
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-northeast-2
REPO=$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/blog-api

# 1) login
aws ecr get-login-password --region $REGION | \
  docker login --username AWS --password-stdin $REPO

# 2) build (linux/amd64 — Fargate standard architecture)
docker build --platform=linux/amd64 -t blog-api:v1 .

# 3) tag
docker tag blog-api:v1 $REPO:v1
docker tag blog-api:v1 $REPO:latest

# 4) push
docker push $REPO:v1
docker push $REPO:latest

On an Apple Silicon (M1/M2/M3) Mac, a plain docker build produces an arm64 image, which won’t run on Fargate (x86_64 standard). Always specify --platform=linux/amd64. Fargate supports ARM too, but it needs separate configuration.

Image tag strategy #

TagMeaning
latestnewest — don’t use it in production (no rollback possible)
v1, v2, ...human-readable version
<git-sha>traceable — auto-published by CI (Chapter 24)
<git-sha>-prodper-environment alias

latest is for developer convenience. Always pin the production Task Definition to a git SHA or semver. That way “which code is running” is confirmed without a single doubt.

3) Task Definition — the container’s personal record #

ECS’s most important component. Image + CPU/memory + environment variables + ports + log settings are bundled into one JSON.

task-definition.json
{
  "family": "blog-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/blog-api-task-role",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/blog-api:v1",
      "portMappings": [
        { "containerPort": 8000, "protocol": "tcp" }
      ],
      "essential": true,
      "environment": [
        { "name": "ENVIRONMENT", "value": "production" },
        { "name": "LOG_LEVEL", "value": "info" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/blog-api",
          "awslogs-region": "ap-northeast-2",
          "awslogs-stream-prefix": "api",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -fsS http://127.0.0.1:8000/health || exit 1"],
        "interval": 10,
        "timeout": 3,
        "retries": 3,
        "startPeriod": 30
      }
    }
  ]
}

The key items are as follows.

KeyMeaning
cpu / memoryFargate allows only fixed combinations (e.g., 256/512, 512/1024, 1024/2048)
executionRoleArnthe role the ECS agent uses for ECR pull / Logs / Secrets access
taskRoleArnthe IAM role the container code uses — boto3 signs with this
awslogslogs go to CloudWatch automatically (Chapter 26)
healthCheckthe container’s own health check (separate from the Dockerfile)

The difference between the two IAM roles #

These two roles are often confused.

executionRoleArntaskRoleArn
Who uses itECS agent (startup phase)code inside the container (while running)
PermissionsECR pull, CloudWatch write, Secrets readS3 access, RDS, SQS, etc. — app logic

Omitting executionRoleArn makes the image pull fail. Omitting taskRoleArn makes boto3 raise NoCredentialsError.

Registration #

Register the Task Definition
aws ecs register-task-definition \
  --cli-input-json file://task-definition.json \
  --region ap-northeast-2

The revision number (blog-api:1, blog-api:2, …) goes up with each registration. Rollback is done with a previous revision number, covered in Chapter 24 CI/CD.

4) ALB + Target Group — the side that receives traffic #

Exactly the ALB pattern built in Chapter 13 ALB / NLB and ACM. The key is as follows.

ALB → Target Group → Fargate
ALB:443  (HTTPS, ACM certificate)
Listener: 443 → forward → tg-blog-api
Target Group: tg-blog-api
  - Protocol: HTTP / 8000
  - Target type: ip   ← Fargate is always ip
  - Health check: GET /health
  - Healthy threshold: 2
  - Interval: 15s

Target type must be ip. A Fargate task’s IP changes every time, so instance mode doesn’t work.

Create the Target Group
aws elbv2 create-target-group \
  --name tg-blog-api \
  --protocol HTTP --port 8000 \
  --vpc-id $VPC_ID \
  --target-type ip \
  --health-check-path /health \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 15

Refer to Chapter 13 for the ALB Listener rules. The setup is HTTPS 443 → forward → tg-blog-api, HTTP 80 → 443 redirect.

5) ECS Service — the manager that keeps the containers alive #

If a Task Definition is an employee’s job description, the Service is the manager that keeps that employee working at all times. It always runs as many tasks as the desired count, recreates them when they die, and replaces them gradually during deployment.

Create the ECS Cluster (once)
aws ecs create-cluster --cluster-name blog-cluster
Create the ECS Service
aws ecs create-service \
  --cluster blog-cluster \
  --service-name blog-api \
  --task-definition blog-api:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
      subnets=[subnet-aaa, subnet-bbb],
      securityGroups=[sg-fargate],
      assignPublicIp=ENABLED
    }" \
  --load-balancers "targetGroupArn=$TG_ARN,containerName=api,containerPort=8000" \
  --health-check-grace-period-seconds 60 \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100"

The key options are as follows.

OptionMeaning
desired-count 2at least 2 — a Multi-AZ deployment withstands a single-AZ failure
assignPublicIp=ENABLEDused when there’s no private subnet + NAT (simple setup). NAT recommended for production
health-check-grace-periodgrace period the Service waits for the ALB health check right after launching a task (app boot time)
deploymentCircuitBreakerauto-rollback if a new deployment fails N times in a row (detailed in Chapter 24)
maximumPercent=200max task count during deployment (200% = old + new at once)
minimumHealthyPercent=100min healthy ratio during deployment (100% = zero downtime)

These two % values determine the shape of the rolling update.

Auto scaling #

Just because the Service is up doesn’t mean auto scaling turns on. Configure it separately.

Register the Auto Scaling Target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 10
CPU-based policy
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
      "TargetValue": 60.0,
      "PredefinedMetricSpecification": {"PredefinedMetricType": "ECSServiceAverageCPUUtilization"},
      "ScaleOutCooldown": 30,
      "ScaleInCooldown": 120
    }'

It auto scales out / in against an average CPU of 60%. In production, start conservative (40 ~ 60%) and adjust while watching the traffic pattern.

Terraform companion — the same skeleton as code #

Above, we built it with the console and CLI to understand the flow. But Part 4’s promise is to keep all infrastructure as code (Chapter 25 Terraform intro). Moving the same SG · Target Group · Task Definition · Service to Terraform looks like this. (We assume VPC · ALB · ACM reuse the modules from Chapter 25 · Chapter 13.)

ecs.tf — security groups and Target Group
resource "aws_security_group" "alb" {
  name_prefix = "blog-alb-"
  vpc_id      = var.vpc_id
  ingress { from_port = 443, to_port = 443, protocol = "tcp", cidr_blocks = ["0.0.0.0/0"] }
  egress  { from_port = 0,   to_port = 0,   protocol = "-1",  cidr_blocks = ["0.0.0.0/0"] }
}

resource "aws_security_group" "fargate" {
  name_prefix = "blog-fargate-"
  vpc_id      = var.vpc_id
  egress { from_port = 0, to_port = 0, protocol = "-1", cidr_blocks = ["0.0.0.0/0"] }
}

# allow only 8000 coming from the ALB SG — an SG reference, not an IP
resource "aws_security_group_rule" "fargate_from_alb" {
  type                     = "ingress"
  security_group_id        = aws_security_group.fargate.id
  source_security_group_id = aws_security_group.alb.id
  from_port = 8000, to_port = 8000, protocol = "tcp"
}

resource "aws_lb_target_group" "api" {
  name        = "tg-blog-api"
  port        = 8000
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"                 # Fargate must be ip
  health_check { path = "/health", healthy_threshold = 2, interval = 15 }
}
ecs.tf — Task Definition and Service
resource "aws_ecs_task_definition" "api" {
  family                   = "blog-api"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.ecs_exec.arn   # ECR pull · Logs · Secrets
  task_role_arn            = aws_iam_role.app.arn        # for the container code

  container_definitions = jsonencode([{
    name         = "api"
    image        = "${aws_ecr_repository.api.repository_url}:${var.image_tag}"
    portMappings = [{ containerPort = 8000 }]
    essential    = true
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/blog-api"
        "awslogs-region"        = "ap-northeast-2"
        "awslogs-stream-prefix" = "api"
      }
    }
  }])
}

resource "aws_ecs_service" "api" {
  name                              = "blog-api"
  cluster                           = aws_ecs_cluster.main.id
  task_definition                   = aws_ecs_task_definition.api.arn
  desired_count                     = 2
  launch_type                       = "FARGATE"
  health_check_grace_period_seconds = 60

  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.fargate.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8000
  }
  deployment_circuit_breaker { enable = true, rollback = true }
}

Pulling image_tag out as a variable lets you inject a git SHA to deploy in Chapter 24 CI/CD. This code carries straight over into the ecs-api.tf of the Part 6 capstone. The CLI commands in this chapter are for seeing “what gets created” with your own eyes; in production, keep this Terraform as the source of truth.

6) Verifying the first deployment #

Wait until the Service reaches a stable state #

wait for stable (5~10 min)
aws ecs wait services-stable \
  --cluster blog-cluster \
  --services blog-api

Check the health check directly #

Call the ALB DNS directly
ALB_DNS=$(aws elbv2 describe-load-balancers \
  --names blog-alb \
  --query 'LoadBalancers[0].DNSName' --output text)

curl -i https://$ALB_DNS/health
# HTTP/2 200
# {"status": "ok"}

Check the logs #

CloudWatch Logs tail
aws logs tail /ecs/blog-api --follow --since 5m

Send one request, and when an access log shows up in the logs, you’ve reached the first destination of Part 4 of this book.

Pitfalls — 5 reasons the first deployment won’t come up #

1) Endlessly restarting in STOPPED state #

In the ECS console’s Tasks tab, click a STOPPED row to check the “Stopped reason.” Common causes are as follows.

MessageCause
CannotPullContainerErrormissing ECR permission → executionRole
ResourceInitializationError: ... secret managerSecrets ARN typo / permission
Essential container ... exitedthe container itself died → CloudWatch logs
Task failed ELB health checksALB can’t judge it healthy → next item

2) ALB health check failure #

The most common one. The checkpoints are as follows.

  • Does the container port (8000) match the Target Group port (8000)?
  • Does the /health endpoint really return 200 (no DB dependency)?
  • Is health-check-grace-period longer than the app boot time (FastAPI 5s, Django 20 ~ 40s)?
  • Does the Fargate Security Group inbound allow only the ALB SG?
  • Can the ALB route all the way to that task’s subnet (same VPC)?

3) The ENI limit of the awsvpc networkMode #

A Fargate task grabs one ENI (Elastic Network Interface) each. If the AZ / subnet runs out of IPs, you can’t launch new tasks. Don’t size the CIDR too tight (the /24 in the example above = 256 IPs).

4) ECR pull failure without a public IP #

If you launch a task in a private subnet but have neither a NAT Gateway nor a VPC Endpoint, traffic to ECR / Secrets Manager / CloudWatch is blocked and startup fails. There are three fixes.

  1. Add a NAT Gateway (~$0.045/hour + data transfer)
  2. Add Interface VPC Endpoints for ECR / Logs / Secrets (cheaper than NAT)
  3. Public subnet + assignPublicIp=ENABLED (for learning)

5) Stuck deployment — new tasks never become healthy #

With deploymentCircuitBreaker on, it auto-rolls back after N minutes. With it off, the service stays IN_PROGRESS forever. Check the deployments array with aws ecs describe-services.

Exercises #

  1. From this chapter’s Task Definition, write one sentence each on who executionRoleArn and taskRoleArn are each used by, and connect the symptom that appears when one of them is missing, grounded in §“The difference between the two IAM roles.” It helps to recall in advance why Chapter 24 CI/CD’s iam:PassRole permission is needed.
  2. Explain in one paragraph why a Fargate Target Group must be target-type ip. Also write out, without looking, the 5 checkpoints to check when the ALB health check fails (§“ALB health check failure”).
  3. In this chapter we brought it up quickly with the default VPC’s public subnets. Organize how this differs from the recommended production structure (the three subnet kinds: public / private / DB) grounded in §“Recommended structure,” and note what changes would be needed when moving that production structure to code in Chapter 25 Terraform intro.

In short: An ECS Fargate first deployment is the flow of setting up the network with VPC subnets and two SGs, pushing the image to ECR and bundling it into a Task Definition, receiving it with an ip-type ALB Target Group, and having the Service maintain the desired count. executionRole and taskRole have different roles, and most first-deployment failures are ALB health check failures and missing IAM permissions.

Next chapter #

Traffic has started coming in behind the ALB, but our API still has no DB and lives only in memory. In the next Chapter 23 RDS integration and migration operations, we’ll bring up RDS Postgres Multi-AZ inside the VPC, inject the password with Secrets Manager, and organize the operational patterns of Alembic / Django migrations along with the blue/green migration pattern that doesn’t kill production traffic.

X