22 Chapter

Infra skeleton — deploying FastAPI/Django on ECS Fargate

The flow of pushing a container image to ECR, writing a Task Definition, and bringing it up as an ECS Fargate Service behind an ALB. The chapter where you put a small blog API into a production environment for the first time.

This is where Part 4 of the book begins, “from the console to code.” Across Parts 1 ~ 3 you got each tool into your hands one at a time — accounts and IAM, EC2 and VPC, S3 and RDS, ALB and CloudFront, and ECS / Lambda / messaging / Secrets. From here on, we tie those tools into one system. These are the hands-on chapters that move scattered console work into code and lift a small backend up to an operable form.

The application this chapter assumes is a blog API (Post + Comment + User) built with FastAPI or Django DRF. This chapter sets up the infra skeleton that first puts that container onto ECS Fargate. The structure is one where a domain received through Route 53 flows through the ALB to Fargate Tasks in two AZs, with RDS Postgres sitting behind them. RDS integration is large enough that it’s covered separately in Chapter 23 RDS integration and migration operations, so this chapter sets up every component except the DB in one pass.

The big picture #

The infra we’ll build in this chapter is as follows.

The blog API structure

                      Internet
                          │
                          ▼
                  ┌──────────────┐
                  │  Route 53    │   blog.example.com
                  └──────┬───────┘
                         │
                         ▼
                ┌────────────────┐
                │      ALB       │   :443 → :8000
                │   (HTTPS, ACM) │
                └────────┬───────┘
                         │
              ┌──────────┴──────────┐
              ▼                     ▼
        ┌───────────┐         ┌───────────┐
        │  AZ-a     │         │   AZ-c    │
        │ Fargate   │         │  Fargate  │
        │  Task #1  │         │  Task #2  │
        │  (Blog)   │         │  (Blog)   │
        └─────┬─────┘         └─────┬─────┘
              │                     │
              └──────────┬──────────┘
                         ▼
                  ┌──────────────┐
                  │  RDS Postgres│   (Multi-AZ, Private)
                  └──────────────┘

Organized by component:

Component	Role	Source
Route 53	domain → ALB	Chapter 12 Route 53
ALB	TLS termination, routing, health checks	Chapter 13 ALB / NLB and ACM
ACM	TLS certificate issuance / renewal	Chapter 13 ALB / NLB and ACM
ECR	image storage	Chapter 16 ECR
ECS Fargate	container execution (serverless)	Chapter 15 ECS Fargate
RDS	DB	Chapter 11 RDS, Chapter 23
VPC + Subnet	network separation	Chapter 8 EC2 and VPC
Secrets Manager	DB password	Chapter 20 Secrets / Parameter Store, Chapter 23

The blog API container — a one-line promise #

The container this book assumes is an artifact of the following shape.

Dockerfile (FastAPI version)

FROM python:3.14-slim AS base
WORKDIR /app

ENV PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq5 curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app /app/app

EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
    CMD curl -fsS http://127.0.0.1:8000/health || exit 1

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

There are three core promises.

It listens on port 8000.
/health returns 200 (a lightweight check with no DB dependency).
/ready returns 200 if the DB connection is healthy, otherwise 503. The ALB and ECS decide traffic routing from this response.

For Django, make the same promise with gunicorn -w 4 myproject.wsgi.

1) VPC and subnets — the network skeleton #

ECS, RDS, and ALB all live inside a VPC. Without a VPC, you can’t bring up a single line. Fortunately, a new account has a default VPC in each region, so you can use it when starting quickly. For production, a VPC you build yourself is recommended.

Recommended structure #

VPC 10.0.0.0/16

Public Subnet  (10.0.0.0/24,   AZ-a)  ← ALB, NAT GW
Public Subnet  (10.0.1.0/24,   AZ-c)  ← ALB, NAT GW
Private Subnet (10.0.10.0/24,  AZ-a)  ← Fargate Task
Private Subnet (10.0.11.0/24,  AZ-c)  ← Fargate Task
DB Subnet      (10.0.20.0/24,  AZ-a)  ← RDS
DB Subnet      (10.0.21.0/24,  AZ-c)  ← RDS

There are three kinds of subnets.

Subnet	Traffic direction	Who lives here
Public	internet ↔	ALB, NAT Gateway
Private	no internet (outbound only via NAT)	Fargate, EC2
DB	no internet, Fargate-only access	RDS

In this chapter we bring it up quickly using only the default VPC’s public subnets (assigning a public IP to the Fargate task). We rebuild the production shape as code in Chapter 25 Terraform intro.

Two Security Groups #

The two SG roles

sg-alb       80, 443 ← 0.0.0.0/0
             (internet to the ALB)

sg-fargate   8000   ← sg-alb
             (only the ALB to Fargate)

There’s one important pattern here. An SG can take another SG as its source. It means “only resources that have this SG attached,” not an IP range. Even if the ALB’s IP changes, the rule follows automatically.

2) Push the image to ECR #

We already covered this in Chapter 16 ECR, but here’s a quick recap.

Create the ECR repository

aws ecr create-repository \
  --repository-name blog-api \
  --image-scanning-configuration scanOnPush=true \
  --region ap-northeast-2

scanOnPush=true automatically scans for vulnerabilities on image push (we check the results in Chapter 26 monitoring).

Build and push #

build → tag → push

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-northeast-2
REPO=$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/blog-api

# 1) login
aws ecr get-login-password --region $REGION | \
  docker login --username AWS --password-stdin $REPO

# 2) build (linux/amd64 — Fargate standard architecture)
docker build --platform=linux/amd64 -t blog-api:v1 .

# 3) tag
docker tag blog-api:v1 $REPO:v1
docker tag blog-api:v1 $REPO:latest

# 4) push
docker push $REPO:v1
docker push $REPO:latest

On an Apple Silicon (M1/M2/M3) Mac, a plain docker build produces an arm64 image, which won’t run on Fargate (x86_64 standard). Always specify --platform=linux/amd64. Fargate supports ARM too, but it needs separate configuration.

Image tag strategy #

Tag	Meaning
`latest`	newest — don’t use it in production (no rollback possible)
`v1, v2, ...`	human-readable version
`<git-sha>`	traceable — auto-published by CI (Chapter 24)
`<git-sha>-prod`	per-environment alias

latest is for developer convenience. Always pin the production Task Definition to a git SHA or semver. That way “which code is running” is confirmed without a single doubt.

3) Task Definition — the container’s personal record #

ECS’s most important component. Image + CPU/memory + environment variables + ports + log settings are bundled into one JSON.

task-definition.json

{
  "family": "blog-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/blog-api-task-role",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/blog-api:v1",
      "portMappings": [
        { "containerPort": 8000, "protocol": "tcp" }
      ],
      "essential": true,
      "environment": [
        { "name": "ENVIRONMENT", "value": "production" },
        { "name": "LOG_LEVEL", "value": "info" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/blog-api",
          "awslogs-region": "ap-northeast-2",
          "awslogs-stream-prefix": "api",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -fsS http://127.0.0.1:8000/health || exit 1"],
        "interval": 10,
        "timeout": 3,
        "retries": 3,
        "startPeriod": 30
      }
    }
  ]
}

The key items are as follows.

Key	Meaning
`cpu / memory`	Fargate allows only fixed combinations (e.g., 256/512, 512/1024, 1024/2048)
`executionRoleArn`	the role the ECS agent uses for ECR pull / Logs / Secrets access
`taskRoleArn`	the IAM role the container code uses — boto3 signs with this
`awslogs`	logs go to CloudWatch automatically (Chapter 26)
`healthCheck`	the container’s own health check (separate from the Dockerfile)

The difference between the two IAM roles #

These two roles are often confused.

	`executionRoleArn`	`taskRoleArn`
Who uses it	ECS agent (startup phase)	code inside the container (while running)
Permissions	ECR pull, CloudWatch write, Secrets read	S3 access, RDS, SQS, etc. — app logic

Omitting executionRoleArn makes the image pull fail. Omitting taskRoleArn makes boto3 raise NoCredentialsError.

Registration #

aws ecs register-task-definition \
  --cli-input-json file://task-definition.json \
  --region ap-northeast-2

The revision number (blog-api:1, blog-api:2, …) goes up with each registration. Rollback is done with a previous revision number, covered in Chapter 24 CI/CD.

4) ALB + Target Group — the side that receives traffic #

Exactly the ALB pattern built in Chapter 13 ALB / NLB and ACM. The key is as follows.

ALB → Target Group → Fargate

ALB:443  (HTTPS, ACM certificate)
   │
   ▼
Listener: 443 → forward → tg-blog-api
   │
   ▼
Target Group: tg-blog-api
  - Protocol: HTTP / 8000
  - Target type: ip   ← Fargate is always ip
  - Health check: GET /health
  - Healthy threshold: 2
  - Interval: 15s

Target type must be ip. A Fargate task’s IP changes every time, so instance mode doesn’t work.

Create the Target Group

aws elbv2 create-target-group \
  --name tg-blog-api \
  --protocol HTTP --port 8000 \
  --vpc-id $VPC_ID \
  --target-type ip \
  --health-check-path /health \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 15

Refer to Chapter 13 for the ALB Listener rules. The setup is HTTPS 443 → forward → tg-blog-api, HTTP 80 → 443 redirect.

5) ECS Service — the manager that keeps the containers alive #

If a Task Definition is an employee’s job description, the Service is the manager that keeps that employee working at all times. It always runs as many tasks as the desired count, recreates them when they die, and replaces them gradually during deployment.

Create the ECS Cluster (once)

aws ecs create-cluster --cluster-name blog-cluster

Create the ECS Service

aws ecs create-service \
  --cluster blog-cluster \
  --service-name blog-api \
  --task-definition blog-api:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
      subnets=[subnet-aaa, subnet-bbb],
      securityGroups=[sg-fargate],
      assignPublicIp=ENABLED
    }" \
  --load-balancers "targetGroupArn=$TG_ARN,containerName=api,containerPort=8000" \
  --health-check-grace-period-seconds 60 \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100"

The key options are as follows.

Option	Meaning
`desired-count 2`	at least 2 — a Multi-AZ deployment withstands a single-AZ failure
`assignPublicIp=ENABLED`	used when there’s no private subnet + NAT (simple setup). NAT recommended for production
`health-check-grace-period`	grace period the Service waits for the ALB health check right after launching a task (app boot time)
`deploymentCircuitBreaker`	auto-rollback if a new deployment fails N times in a row (detailed in Chapter 24)
`maximumPercent=200`	max task count during deployment (200% = old + new at once)
`minimumHealthyPercent=100`	min healthy ratio during deployment (100% = zero downtime)

These two % values determine the shape of the rolling update.

Auto scaling #

Just because the Service is up doesn’t mean auto scaling turns on. Configure it separately.

aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 10

CPU-based policy

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
      "TargetValue": 60.0,
      "PredefinedMetricSpecification": {"PredefinedMetricType": "ECSServiceAverageCPUUtilization"},
      "ScaleOutCooldown": 30,
      "ScaleInCooldown": 120
    }'

It auto scales out / in against an average CPU of 60%. In production, start conservative (40 ~ 60%) and adjust while watching the traffic pattern.

Terraform companion — the same skeleton as code #

Above, we built it with the console and CLI to understand the flow. But Part 4’s promise is to keep all infrastructure as code (Chapter 25 Terraform intro). Moving the same SG · Target Group · Task Definition · Service to Terraform looks like this. (We assume VPC · ALB · ACM reuse the modules from Chapter 25 · Chapter 13.)

ecs.tf — security groups and Target Group

resource "aws_security_group" "alb" {
  name_prefix = "blog-alb-"
  vpc_id      = var.vpc_id
  ingress { from_port = 443, to_port = 443, protocol = "tcp", cidr_blocks = ["0.0.0.0/0"] }
  egress  { from_port = 0,   to_port = 0,   protocol = "-1",  cidr_blocks = ["0.0.0.0/0"] }
}

resource "aws_security_group" "fargate" {
  name_prefix = "blog-fargate-"
  vpc_id      = var.vpc_id
  egress { from_port = 0, to_port = 0, protocol = "-1", cidr_blocks = ["0.0.0.0/0"] }
}

# allow only 8000 coming from the ALB SG — an SG reference, not an IP
resource "aws_security_group_rule" "fargate_from_alb" {
  type                     = "ingress"
  security_group_id        = aws_security_group.fargate.id
  source_security_group_id = aws_security_group.alb.id
  from_port = 8000, to_port = 8000, protocol = "tcp"
}

resource "aws_lb_target_group" "api" {
  name        = "tg-blog-api"
  port        = 8000
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"                 # Fargate must be ip
  health_check { path = "/health", healthy_threshold = 2, interval = 15 }
}

ecs.tf — Task Definition and Service

resource "aws_ecs_task_definition" "api" {
  family                   = "blog-api"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.ecs_exec.arn   # ECR pull · Logs · Secrets
  task_role_arn            = aws_iam_role.app.arn        # for the container code

  container_definitions = jsonencode([{
    name         = "api"
    image        = "${aws_ecr_repository.api.repository_url}:${var.image_tag}"
    portMappings = [{ containerPort = 8000 }]
    essential    = true
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/blog-api"
        "awslogs-region"        = "ap-northeast-2"
        "awslogs-stream-prefix" = "api"
      }
    }
  }])
}

resource "aws_ecs_service" "api" {
  name                              = "blog-api"
  cluster                           = aws_ecs_cluster.main.id
  task_definition                   = aws_ecs_task_definition.api.arn
  desired_count                     = 2
  launch_type                       = "FARGATE"
  health_check_grace_period_seconds = 60

  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.fargate.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8000
  }
  deployment_circuit_breaker { enable = true, rollback = true }
}

Pulling image_tag out as a variable lets you inject a git SHA to deploy in Chapter 24 CI/CD. This code carries straight over into the ecs-api.tf of the Part 6 capstone. The CLI commands in this chapter are for seeing “what gets created” with your own eyes; in production, keep this Terraform as the source of truth.

6) Verifying the first deployment #

Wait until the Service reaches a stable state #

wait for stable (5~10 min)

aws ecs wait services-stable \
  --cluster blog-cluster \
  --services blog-api

Check the health check directly #

Call the ALB DNS directly

ALB_DNS=$(aws elbv2 describe-load-balancers \
  --names blog-alb \
  --query 'LoadBalancers[0].DNSName' --output text)

curl -i https://$ALB_DNS/health
# HTTP/2 200
# {"status": "ok"}

Check the logs #

CloudWatch Logs tail

aws logs tail /ecs/blog-api --follow --since 5m

Send one request, and when an access log shows up in the logs, you’ve reached the first destination of Part 4 of this book.

Pitfalls — 5 reasons the first deployment won’t come up #

1) Endlessly restarting in STOPPED state #

In the ECS console’s Tasks tab, click a STOPPED row to check the “Stopped reason.” Common causes are as follows.

Message	Cause
`CannotPullContainerError`	missing ECR permission → executionRole
`ResourceInitializationError: ... secret manager`	Secrets ARN typo / permission
`Essential container ... exited`	the container itself died → CloudWatch logs
`Task failed ELB health checks`	ALB can’t judge it healthy → next item

2) ALB health check failure #

The most common one. The checkpoints are as follows.

Does the container port (8000) match the Target Group port (8000)?
Does the /health endpoint really return 200 (no DB dependency)?
Is health-check-grace-period longer than the app boot time (FastAPI 5s, Django 20 ~ 40s)?
Does the Fargate Security Group inbound allow only the ALB SG?
Can the ALB route all the way to that task’s subnet (same VPC)?

3) The ENI limit of the `awsvpc` networkMode #

A Fargate task grabs one ENI (Elastic Network Interface) each. If the AZ / subnet runs out of IPs, you can’t launch new tasks. Don’t size the CIDR too tight (the /24 in the example above = 256 IPs).

4) ECR pull failure without a public IP #

If you launch a task in a private subnet but have neither a NAT Gateway nor a VPC Endpoint, traffic to ECR / Secrets Manager / CloudWatch is blocked and startup fails. There are three fixes.

Add a NAT Gateway (~$0.045/hour + data transfer)
Add Interface VPC Endpoints for ECR / Logs / Secrets (cheaper than NAT)
Public subnet + assignPublicIp=ENABLED (for learning)

5) Stuck deployment — new tasks never become healthy #

With deploymentCircuitBreaker on, it auto-rolls back after N minutes. With it off, the service stays IN_PROGRESS forever. Check the deployments array with aws ecs describe-services.

Exercises #

From this chapter’s Task Definition, write one sentence each on who executionRoleArn and taskRoleArn are each used by, and connect the symptom that appears when one of them is missing, grounded in §“The difference between the two IAM roles.” It helps to recall in advance why Chapter 24 CI/CD’s iam:PassRole permission is needed.
Explain in one paragraph why a Fargate Target Group must be target-type ip. Also write out, without looking, the 5 checkpoints to check when the ALB health check fails (§“ALB health check failure”).
In this chapter we brought it up quickly with the default VPC’s public subnets. Organize how this differs from the recommended production structure (the three subnet kinds: public / private / DB) grounded in §“Recommended structure,” and note what changes would be needed when moving that production structure to code in Chapter 25 Terraform intro.

In short: An ECS Fargate first deployment is the flow of setting up the network with VPC subnets and two SGs, pushing the image to ECR and bundling it into a Task Definition, receiving it with an ip-type ALB Target Group, and having the Service maintain the desired count. executionRole and taskRole have different roles, and most first-deployment failures are ALB health check failures and missing IAM permissions.

Next chapter #

Traffic has started coming in behind the ALB, but our API still has no DB and lives only in memory. In the next Chapter 23 RDS integration and migration operations, we’ll bring up RDS Postgres Multi-AZ inside the VPC, inject the password with Secrets Manager, and organize the operational patterns of Alembic / Django migrations along with the blue/green migration pattern that doesn’t kill production traffic.