Infra skeleton — deploying FastAPI/Django on ECS Fargate
The flow of pushing a container image to ECR, writing a Task Definition, and bringing it up as an ECS Fargate Service behind an ALB. The chapter where you put a small blog API into a production environment for the first time.
This is where Part 4 of the book begins, “from the console to code.” Across Parts 1 ~ 3 you got each tool into your hands one at a time — accounts and IAM, EC2 and VPC, S3 and RDS, ALB and CloudFront, and ECS / Lambda / messaging / Secrets. From here on, we tie those tools into one system. These are the hands-on chapters that move scattered console work into code and lift a small backend up to an operable form.
The application this chapter assumes is a blog API (Post + Comment + User) built with FastAPI or Django DRF. This chapter sets up the infra skeleton that first puts that container onto ECS Fargate. The structure is one where a domain received through Route 53 flows through the ALB to Fargate Tasks in two AZs, with RDS Postgres sitting behind them. RDS integration is large enough that it’s covered separately in Chapter 23 RDS integration and migration operations, so this chapter sets up every component except the DB in one pass.
The big picture #
The infra we’ll build in this chapter is as follows.
Internet
│
▼
┌──────────────┐
│ Route 53 │ blog.example.com
└──────┬───────┘
│
▼
┌────────────────┐
│ ALB │ :443 → :8000
│ (HTTPS, ACM) │
└────────┬───────┘
│
┌──────────┴──────────┐
▼ ▼
┌───────────┐ ┌───────────┐
│ AZ-a │ │ AZ-c │
│ Fargate │ │ Fargate │
│ Task #1 │ │ Task #2 │
│ (Blog) │ │ (Blog) │
└─────┬─────┘ └─────┬─────┘
│ │
└──────────┬──────────┘
▼
┌──────────────┐
│ RDS Postgres│ (Multi-AZ, Private)
└──────────────┘Organized by component:
| Component | Role | Source |
|---|---|---|
| Route 53 | domain → ALB | Chapter 12 Route 53 |
| ALB | TLS termination, routing, health checks | Chapter 13 ALB / NLB and ACM |
| ACM | TLS certificate issuance / renewal | Chapter 13 ALB / NLB and ACM |
| ECR | image storage | Chapter 16 ECR |
| ECS Fargate | container execution (serverless) | Chapter 15 ECS Fargate |
| RDS | DB | Chapter 11 RDS, Chapter 23 |
| VPC + Subnet | network separation | Chapter 8 EC2 and VPC |
| Secrets Manager | DB password | Chapter 20 Secrets / Parameter Store, Chapter 23 |
The blog API container — a one-line promise #
The container this book assumes is an artifact of the following shape.
FROM python:3.14-slim AS base
WORKDIR /app
ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1
RUN apt-get update && apt-get install -y --no-install-recommends \
libpq5 curl && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app /app/app
EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD curl -fsS http://127.0.0.1:8000/health || exit 1
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]There are three core promises.
- It listens on port 8000.
/healthreturns 200 (a lightweight check with no DB dependency)./readyreturns 200 if the DB connection is healthy, otherwise 503. The ALB and ECS decide traffic routing from this response.
For Django, make the same promise with gunicorn -w 4 myproject.wsgi.
1) VPC and subnets — the network skeleton #
ECS, RDS, and ALB all live inside a VPC. Without a VPC, you can’t bring up a single line. Fortunately, a new account has a default VPC in each region, so you can use it when starting quickly. For production, a VPC you build yourself is recommended.
Recommended structure #
Public Subnet (10.0.0.0/24, AZ-a) ← ALB, NAT GW
Public Subnet (10.0.1.0/24, AZ-c) ← ALB, NAT GW
Private Subnet (10.0.10.0/24, AZ-a) ← Fargate Task
Private Subnet (10.0.11.0/24, AZ-c) ← Fargate Task
DB Subnet (10.0.20.0/24, AZ-a) ← RDS
DB Subnet (10.0.21.0/24, AZ-c) ← RDSThere are three kinds of subnets.
| Subnet | Traffic direction | Who lives here |
|---|---|---|
| Public | internet ↔ | ALB, NAT Gateway |
| Private | no internet (outbound only via NAT) | Fargate, EC2 |
| DB | no internet, Fargate-only access | RDS |
In this chapter we bring it up quickly using only the default VPC’s public subnets (assigning a public IP to the Fargate task). We rebuild the production shape as code in Chapter 25 Terraform intro.
Two Security Groups #
sg-alb 80, 443 ← 0.0.0.0/0
(internet to the ALB)
sg-fargate 8000 ← sg-alb
(only the ALB to Fargate)There’s one important pattern here. An SG can take another SG as its source. It means “only resources that have this SG attached,” not an IP range. Even if the ALB’s IP changes, the rule follows automatically.
2) Push the image to ECR #
We already covered this in Chapter 16 ECR, but here’s a quick recap.
aws ecr create-repository \
--repository-name blog-api \
--image-scanning-configuration scanOnPush=true \
--region ap-northeast-2scanOnPush=true automatically scans for vulnerabilities on image push (we check the results in Chapter 26 monitoring).
Build and push #
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-northeast-2
REPO=$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/blog-api
# 1) login
aws ecr get-login-password --region $REGION | \
docker login --username AWS --password-stdin $REPO
# 2) build (linux/amd64 — Fargate standard architecture)
docker build --platform=linux/amd64 -t blog-api:v1 .
# 3) tag
docker tag blog-api:v1 $REPO:v1
docker tag blog-api:v1 $REPO:latest
# 4) push
docker push $REPO:v1
docker push $REPO:latestOn an Apple Silicon (M1/M2/M3) Mac, a plain
docker buildproduces an arm64 image, which won’t run on Fargate (x86_64 standard). Always specify--platform=linux/amd64. Fargate supports ARM too, but it needs separate configuration.
Image tag strategy #
| Tag | Meaning |
|---|---|
latest | newest — don’t use it in production (no rollback possible) |
v1, v2, ... | human-readable version |
<git-sha> | traceable — auto-published by CI (Chapter 24) |
<git-sha>-prod | per-environment alias |
latest is for developer convenience. Always pin the production Task Definition to a git SHA or semver. That way “which code is running” is confirmed without a single doubt.
3) Task Definition — the container’s personal record #
ECS’s most important component. Image + CPU/memory + environment variables + ports + log settings are bundled into one JSON.
{
"family": "blog-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/blog-api-task-role",
"containerDefinitions": [
{
"name": "api",
"image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/blog-api:v1",
"portMappings": [
{ "containerPort": 8000, "protocol": "tcp" }
],
"essential": true,
"environment": [
{ "name": "ENVIRONMENT", "value": "production" },
{ "name": "LOG_LEVEL", "value": "info" }
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/blog-api",
"awslogs-region": "ap-northeast-2",
"awslogs-stream-prefix": "api",
"awslogs-create-group": "true"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -fsS http://127.0.0.1:8000/health || exit 1"],
"interval": 10,
"timeout": 3,
"retries": 3,
"startPeriod": 30
}
}
]
}The key items are as follows.
| Key | Meaning |
|---|---|
cpu / memory | Fargate allows only fixed combinations (e.g., 256/512, 512/1024, 1024/2048) |
executionRoleArn | the role the ECS agent uses for ECR pull / Logs / Secrets access |
taskRoleArn | the IAM role the container code uses — boto3 signs with this |
awslogs | logs go to CloudWatch automatically (Chapter 26) |
healthCheck | the container’s own health check (separate from the Dockerfile) |
The difference between the two IAM roles #
These two roles are often confused.
executionRoleArn | taskRoleArn | |
|---|---|---|
| Who uses it | ECS agent (startup phase) | code inside the container (while running) |
| Permissions | ECR pull, CloudWatch write, Secrets read | S3 access, RDS, SQS, etc. — app logic |
Omitting executionRoleArn makes the image pull fail. Omitting taskRoleArn makes boto3 raise NoCredentialsError.
Registration #
aws ecs register-task-definition \
--cli-input-json file://task-definition.json \
--region ap-northeast-2The revision number (blog-api:1, blog-api:2, …) goes up with each registration. Rollback is done with a previous revision number, covered in Chapter 24 CI/CD.
4) ALB + Target Group — the side that receives traffic #
Exactly the ALB pattern built in Chapter 13 ALB / NLB and ACM. The key is as follows.
ALB:443 (HTTPS, ACM certificate)
│
▼
Listener: 443 → forward → tg-blog-api
│
▼
Target Group: tg-blog-api
- Protocol: HTTP / 8000
- Target type: ip ← Fargate is always ip
- Health check: GET /health
- Healthy threshold: 2
- Interval: 15sTarget type must be ip. A Fargate task’s IP changes every time, so instance mode doesn’t work.
aws elbv2 create-target-group \
--name tg-blog-api \
--protocol HTTP --port 8000 \
--vpc-id $VPC_ID \
--target-type ip \
--health-check-path /health \
--healthy-threshold-count 2 \
--health-check-interval-seconds 15Refer to Chapter 13 for the ALB Listener rules. The setup is HTTPS 443 → forward → tg-blog-api, HTTP 80 → 443 redirect.
5) ECS Service — the manager that keeps the containers alive #
If a Task Definition is an employee’s job description, the Service is the manager that keeps that employee working at all times. It always runs as many tasks as the desired count, recreates them when they die, and replaces them gradually during deployment.
aws ecs create-cluster --cluster-name blog-clusteraws ecs create-service \
--cluster blog-cluster \
--service-name blog-api \
--task-definition blog-api:1 \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={
subnets=[subnet-aaa, subnet-bbb],
securityGroups=[sg-fargate],
assignPublicIp=ENABLED
}" \
--load-balancers "targetGroupArn=$TG_ARN,containerName=api,containerPort=8000" \
--health-check-grace-period-seconds 60 \
--deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100"The key options are as follows.
| Option | Meaning |
|---|---|
desired-count 2 | at least 2 — a Multi-AZ deployment withstands a single-AZ failure |
assignPublicIp=ENABLED | used when there’s no private subnet + NAT (simple setup). NAT recommended for production |
health-check-grace-period | grace period the Service waits for the ALB health check right after launching a task (app boot time) |
deploymentCircuitBreaker | auto-rollback if a new deployment fails N times in a row (detailed in Chapter 24) |
maximumPercent=200 | max task count during deployment (200% = old + new at once) |
minimumHealthyPercent=100 | min healthy ratio during deployment (100% = zero downtime) |
These two % values determine the shape of the rolling update.
Auto scaling #
Just because the Service is up doesn’t mean auto scaling turns on. Configure it separately.
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/blog-cluster/blog-api \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 --max-capacity 10aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/blog-cluster/blog-api \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-target \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 60.0,
"PredefinedMetricSpecification": {"PredefinedMetricType": "ECSServiceAverageCPUUtilization"},
"ScaleOutCooldown": 30,
"ScaleInCooldown": 120
}'It auto scales out / in against an average CPU of 60%. In production, start conservative (40 ~ 60%) and adjust while watching the traffic pattern.
Terraform companion — the same skeleton as code #
Above, we built it with the console and CLI to understand the flow. But Part 4’s promise is to keep all infrastructure as code (Chapter 25 Terraform intro). Moving the same SG · Target Group · Task Definition · Service to Terraform looks like this. (We assume VPC · ALB · ACM reuse the modules from Chapter 25 · Chapter 13.)
resource "aws_security_group" "alb" {
name_prefix = "blog-alb-"
vpc_id = var.vpc_id
ingress { from_port = 443, to_port = 443, protocol = "tcp", cidr_blocks = ["0.0.0.0/0"] }
egress { from_port = 0, to_port = 0, protocol = "-1", cidr_blocks = ["0.0.0.0/0"] }
}
resource "aws_security_group" "fargate" {
name_prefix = "blog-fargate-"
vpc_id = var.vpc_id
egress { from_port = 0, to_port = 0, protocol = "-1", cidr_blocks = ["0.0.0.0/0"] }
}
# allow only 8000 coming from the ALB SG — an SG reference, not an IP
resource "aws_security_group_rule" "fargate_from_alb" {
type = "ingress"
security_group_id = aws_security_group.fargate.id
source_security_group_id = aws_security_group.alb.id
from_port = 8000, to_port = 8000, protocol = "tcp"
}
resource "aws_lb_target_group" "api" {
name = "tg-blog-api"
port = 8000
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip" # Fargate must be ip
health_check { path = "/health", healthy_threshold = 2, interval = 15 }
}resource "aws_ecs_task_definition" "api" {
family = "blog-api"
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = "512"
memory = "1024"
execution_role_arn = aws_iam_role.ecs_exec.arn # ECR pull · Logs · Secrets
task_role_arn = aws_iam_role.app.arn # for the container code
container_definitions = jsonencode([{
name = "api"
image = "${aws_ecr_repository.api.repository_url}:${var.image_tag}"
portMappings = [{ containerPort = 8000 }]
essential = true
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/blog-api"
"awslogs-region" = "ap-northeast-2"
"awslogs-stream-prefix" = "api"
}
}
}])
}
resource "aws_ecs_service" "api" {
name = "blog-api"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.api.arn
desired_count = 2
launch_type = "FARGATE"
health_check_grace_period_seconds = 60
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.fargate.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.api.arn
container_name = "api"
container_port = 8000
}
deployment_circuit_breaker { enable = true, rollback = true }
}Pulling image_tag out as a variable lets you inject a git SHA to deploy in Chapter 24 CI/CD. This code carries straight over into the ecs-api.tf of the Part 6 capstone. The CLI commands in this chapter are for seeing “what gets created” with your own eyes; in production, keep this Terraform as the source of truth.
6) Verifying the first deployment #
Wait until the Service reaches a stable state #
aws ecs wait services-stable \
--cluster blog-cluster \
--services blog-apiCheck the health check directly #
ALB_DNS=$(aws elbv2 describe-load-balancers \
--names blog-alb \
--query 'LoadBalancers[0].DNSName' --output text)
curl -i https://$ALB_DNS/health
# HTTP/2 200
# {"status": "ok"}Check the logs #
aws logs tail /ecs/blog-api --follow --since 5mSend one request, and when an access log shows up in the logs, you’ve reached the first destination of Part 4 of this book.
Pitfalls — 5 reasons the first deployment won’t come up #
1) Endlessly restarting in STOPPED state #
In the ECS console’s Tasks tab, click a STOPPED row to check the “Stopped reason.” Common causes are as follows.
| Message | Cause |
|---|---|
CannotPullContainerError | missing ECR permission → executionRole |
ResourceInitializationError: ... secret manager | Secrets ARN typo / permission |
Essential container ... exited | the container itself died → CloudWatch logs |
Task failed ELB health checks | ALB can’t judge it healthy → next item |
2) ALB health check failure #
The most common one. The checkpoints are as follows.
- Does the container port (
8000) match the Target Group port (8000)? - Does the
/healthendpoint really return 200 (no DB dependency)? - Is
health-check-grace-periodlonger than the app boot time (FastAPI 5s, Django 20 ~ 40s)? - Does the Fargate Security Group inbound allow only the ALB SG?
- Can the ALB route all the way to that task’s subnet (same VPC)?
3) The ENI limit of the awsvpc networkMode
#
A Fargate task grabs one ENI (Elastic Network Interface) each. If the AZ / subnet runs out of IPs, you can’t launch new tasks. Don’t size the CIDR too tight (the /24 in the example above = 256 IPs).
4) ECR pull failure without a public IP #
If you launch a task in a private subnet but have neither a NAT Gateway nor a VPC Endpoint, traffic to ECR / Secrets Manager / CloudWatch is blocked and startup fails. There are three fixes.
- Add a NAT Gateway (~$0.045/hour + data transfer)
- Add Interface VPC Endpoints for ECR / Logs / Secrets (cheaper than NAT)
- Public subnet +
assignPublicIp=ENABLED(for learning)
5) Stuck deployment — new tasks never become healthy #
With deploymentCircuitBreaker on, it auto-rolls back after N minutes. With it off, the service stays IN_PROGRESS forever. Check the deployments array with aws ecs describe-services.
Exercises #
- From this chapter’s Task Definition, write one sentence each on who
executionRoleArnandtaskRoleArnare each used by, and connect the symptom that appears when one of them is missing, grounded in §“The difference between the two IAM roles.” It helps to recall in advance why Chapter 24 CI/CD’siam:PassRolepermission is needed. - Explain in one paragraph why a Fargate Target Group must be
target-type ip. Also write out, without looking, the 5 checkpoints to check when the ALB health check fails (§“ALB health check failure”). - In this chapter we brought it up quickly with the default VPC’s public subnets. Organize how this differs from the recommended production structure (the three subnet kinds: public / private / DB) grounded in §“Recommended structure,” and note what changes would be needed when moving that production structure to code in Chapter 25 Terraform intro.
In short: An ECS Fargate first deployment is the flow of setting up the network with VPC subnets and two SGs, pushing the image to ECR and bundling it into a Task Definition, receiving it with an
ip-type ALB Target Group, and having the Service maintain the desired count. executionRole and taskRole have different roles, and most first-deployment failures are ALB health check failures and missing IAM permissions.
Next chapter #
Traffic has started coming in behind the ALB, but our API still has no DB and lives only in memory. In the next Chapter 23 RDS integration and migration operations, we’ll bring up RDS Postgres Multi-AZ inside the VPC, inject the password with Secrets Manager, and organize the operational patterns of Alembic / Django migrations along with the blue/green migration pattern that doesn’t kill production traffic.