AWS in Practice #1: Deploying FastAPI/Django to ECS Fargate

Saturday, May 2, 2026

10 min read

In Basics (7 posts) we lined up account / region / IAM / cost / CLI / security / logs; in Intermediate (7 posts) we covered EC2 / VPC / S3 / RDS / DNS / ALB / CloudFront; in Advanced (7 posts) we placed ECS / Lambda / messaging / Secrets / Step Functions one slot at a time. With 21 posts of toolbox assembled — it’s time to put a real backend together as a single end-to-end project.

This series takes the blog API (Post + Comment + User) built in FastAPI in Practice or the Django DRF series as the domain, and pulls it up to operationally-ready shape across 6 posts.

The big picture #

The infrastructure we’ll build in this post:

The blog API's place

                      Internet
                          │
                          ▼
                  ┌──────────────┐
                  │  Route 53    │   blog.example.com
                  └──────┬───────┘
                         │
                         ▼
                ┌────────────────┐
                │      ALB       │   :443 → :8000
                │   (HTTPS, ACM) │
                └────────┬───────┘
                         │
              ┌──────────┴──────────┐
              ▼                     ▼
        ┌───────────┐         ┌───────────┐
        │  AZ-a     │         │   AZ-c    │
        │ Fargate   │         │  Fargate  │
        │  Task #1  │         │  Task #2  │
        │  (Blog)   │         │  (Blog)   │
        └─────┬─────┘         └─────┬─────┘
              │                     │
              └──────────┬──────────┘
                         ▼
                  ┌──────────────┐
                  │  RDS Postgres│   (Multi-AZ, Private)
                  └──────────────┘

Component by component:

Component	Role	Source
Route 53	Domain → ALB	Intermediate #5
ALB	TLS termination, routing, health checks	Intermediate #6
ACM	TLS certificate issuance/renewal	Intermediate #6
ECR	Image storage	Advanced #2
ECS Fargate	Container execution (serverless)	Advanced #1
RDS	DB	Intermediate #4, #2
VPC + Subnet	Network separation	Intermediate #1
Secrets Manager	DB password	Advanced #6, #2

This post sets up everything except the DB at once. RDS gets its own treatment in #2.

The domain — a one-line summary of the blog API container #

The container this series assumes is the artifact from FastAPI in Practice #6 or DRF #6 — shaped like this:

Dockerfile (FastAPI)

FROM python:3.14-slim AS base
WORKDIR /app

ENV PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq5 curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app /app/app

EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
    CMD curl -fsS http://127.0.0.1:8000/health || exit 1

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Three core promises:

Listens on port 8000
/health returns 200 (a lightweight check with no DB dependency)
/ready returns 200 if the DB connection is OK, 503 otherwise — the ALB / ECS uses this to decide traffic routing

For Django, just deliver the same promises with gunicorn -w 4 myproject.wsgi.

1) VPC and subnets — the network skeleton #

ECS / RDS / ALB all live inside a VPC. Without one, you can’t bring up a single resource. Fortunately, new accounts come with a default VPC in every region that you can use to get started quickly. For production, building a dedicated VPC is recommended.

The recommended shape #

VPC 10.0.0.0/16

Public Subnet  (10.0.0.0/24,   AZ-a)  ← ALB, NAT GW
Public Subnet  (10.0.1.0/24,   AZ-c)  ← ALB, NAT GW
Private Subnet (10.0.10.0/24,  AZ-a)  ← Fargate Task
Private Subnet (10.0.11.0/24,  AZ-c)  ← Fargate Task
DB Subnet      (10.0.20.0/24,  AZ-a)  ← RDS
DB Subnet      (10.0.21.0/24,  AZ-c)  ← RDS

Three roles:

Subnet	Traffic direction	Who lives there
Public	Internet ↔	ALB, NAT Gateway
Private	No internet (only outbound via NAT)	Fargate, EC2
DB	No internet, only Fargate has access	RDS

In this post we’ll use just the public subnets of the default VPC to spin things up fast (assigning public IPs to Fargate tasks). The production shape comes as code in #4 Terraform.

Two security groups #

Two SG roles

sg-alb       80, 443 ← 0.0.0.0/0
             (Internet to ALB)

sg-fargate   8000   ← sg-alb
             (Only ALB to Fargate)

Important pattern: an SG can reference another SG as its source. This is not an IP range — it means “only resources that have this SG attached.” If the ALB’s IP changes, the rule follows automatically.

2) Push the image to ECR #

We covered this in Advanced #2 ECR, but quickly again.

Create the ECR repository

aws ecr create-repository \
  --repository-name blog-api \
  --image-scanning-configuration scanOnPush=true \
  --region ap-northeast-2

scanOnPush=true — automatic vulnerability scan on image push (we’ll see results in #5 Monitoring).

Build and push #

Build → tag → push

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-northeast-2
REPO=$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/blog-api

# 1) Login
aws ecr get-login-password --region $REGION | \
  docker login --username AWS --password-stdin $REPO

# 2) Build (linux/amd64 — Fargate's standard architecture)
docker build --platform=linux/amd64 -t blog-api:v1 .

# 3) Tag
docker tag blog-api:v1 $REPO:v1
docker tag blog-api:v1 $REPO:latest

# 4) Push
docker push $REPO:v1
docker push $REPO:latest

If you build on Apple Silicon (M1/M2/M3) Mac with plain docker build, you get an arm64 image, which won’t run on Fargate (x86_64 by default). Always specify --platform=linux/amd64. Fargate also supports ARM but requires extra configuration.

Image tag strategy #

Tag	Meaning
`latest`	Latest — don’t use in production (no rollback)
`v1, v2, ...`	Human-readable version
`<git-sha>`	Traceable — auto-issued by CI (#3)
`<git-sha>-prod`	Per-environment alias

latest is for developer convenience. Production Task Definitions always pin to a git SHA or semver — that way “what code is running” can be answered without any doubt.

3) Task Definition — the container’s “ID card” #

The most important part in ECS. Image + CPU/memory + environment variables + ports + log configuration all bundle into a single JSON.

task-definition.json

{
  "family": "blog-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/blog-api-task-role",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/blog-api:v1",
      "portMappings": [
        { "containerPort": 8000, "protocol": "tcp" }
      ],
      "essential": true,
      "environment": [
        { "name": "ENVIRONMENT", "value": "production" },
        { "name": "LOG_LEVEL", "value": "info" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/blog-api",
          "awslogs-region": "ap-northeast-2",
          "awslogs-stream-prefix": "api",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -fsS http://127.0.0.1:8000/health || exit 1"],
        "interval": 10,
        "timeout": 3,
        "retries": 3,
        "startPeriod": 30
      }
    }
  ]
}

Key fields:

Key	Meaning
`cpu / memory`	Fargate allows only fixed combinations (e.g. 256/512, 512/1024, 1024/2048)
`executionRoleArn`	Role used by the ECS agent for ECR pull / Logs / Secrets access
`taskRoleArn`	IAM role used by the container code — boto3 signs with this
`awslogs`	Logs go automatically to CloudWatch (#5)
`healthCheck`	Container’s own health check (separate from the one in the Dockerfile)

The two IAM roles get confused often #

	`executionRoleArn`	`taskRoleArn`
Who uses it	ECS agent (start phase)	Code inside the container (runtime)
Permissions	ECR pull, write to CloudWatch, read Secrets	S3 access, RDS, SQS — app logic

Missing executionRoleArn → image pull fails. Missing taskRoleArn → boto3 throws NoCredentialsError.

Register #

aws ecs register-task-definition \
  --cli-input-json file://task-definition.json \
  --region ap-northeast-2

Each registration bumps the revision number (blog-api:1, blog-api:2, …). Rollback is to a previous revision, covered in #3.

4) ALB + Target Group — the entry point for traffic #

Same ALB pattern from Intermediate #6. The core:

ALB → Target Group → Fargate

ALB:443  (HTTPS, ACM certificate)
   │
   ▼
Listener: 443 → forward → tg-blog-api
   │
   ▼
Target Group: tg-blog-api
  - Protocol: HTTP / 8000
  - Target type: ip   ← Fargate is always ip
  - Health check: GET /health
  - Healthy threshold: 2
  - Interval: 15s

Target type must be ip — Fargate tasks get a different IP each time, so instance mode doesn’t work.

Create the Target Group

aws elbv2 create-target-group \
  --name tg-blog-api \
  --protocol HTTP --port 8000 \
  --vpc-id $VPC_ID \
  --target-type ip \
  --health-check-path /health \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 15

ALB Listener rules — see Intermediate #6. HTTPS 443 → forward → tg-blog-api, HTTP 80 → 443 redirect.

5) ECS Service — the container’s “company” #

If a Task Definition is the job description, the Service is the employer — it maintains the desired count of running tasks, replaces failed ones, and performs rolling deployments.

Create the ECS Cluster (one-time)

aws ecs create-cluster --cluster-name blog-cluster

Create the ECS Service

aws ecs create-service \
  --cluster blog-cluster \
  --service-name blog-api \
  --task-definition blog-api:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
      subnets=[subnet-aaa, subnet-bbb],
      securityGroups=[sg-fargate],
      assignPublicIp=ENABLED
    }" \
  --load-balancers "targetGroupArn=$TG_ARN,containerName=api,containerPort=8000" \
  --health-check-grace-period-seconds 60 \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100"

Key options:

Option	Meaning
`desired-count 2`	At least 2 — Multi-AZ deployment to survive one-AZ failure
`assignPublicIp=ENABLED`	When private subnet + NAT isn’t available (simple setup). Production should use NAT
`health-check-grace-period`	Grace period after Service starts a task before ALB health-checks it (app boot time)
`deploymentCircuitBreaker`	Auto-rollback if a new deployment fails N times in a row (covered in detail in #3)
`maximumPercent=200`	Max number of tasks during deployment (200% = old + new together)
`minimumHealthyPercent=100`	Min healthy ratio during deployment (100% = zero downtime)

These two percentages decide the rolling update shape.

Auto scaling #

Auto scaling isn’t on just because the Service is up. Separately:

aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 10

CPU-based policy

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
      "TargetValue": 60.0,
      "PredefinedMetricSpecification": {"PredefinedMetricType": "ECSServiceAverageCPUUtilization"},
      "ScaleOutCooldown": 30,
      "ScaleInCooldown": 120
    }'

This scales out and in to keep the average CPU around 60%. Start conservative in production (40–60%) and tune the target as you observe real traffic patterns.

6) Verify the first deployment #

Wait for the Service to reach a stable state #

Wait for stability (5–10 min)

aws ecs wait services-stable \
  --cluster blog-cluster \
  --services blog-api

Check the health endpoint directly #

Hit ALB DNS directly

ALB_DNS=$(aws elbv2 describe-load-balancers \
  --names blog-alb \
  --query 'LoadBalancers[0].DNSName' --output text)

curl -i https://$ALB_DNS/health
# HTTP/2 200
# {"status": "ok"}

Tail the logs #

CloudWatch Logs tail

aws logs tail /ecs/blog-api --follow --since 5m

Send a request and once the access log appears, you’ve reached the first checkpoint of this series.

Pitfalls — 5 reasons the first deployment fails #

1) Endlessly restarting in `STOPPED` state #

In the ECS console Tasks tab, click a STOPPED row → check “Stopped reason.” Common causes:

Message	Cause
`CannotPullContainerError`	Missing ECR permission → executionRole
`ResourceInitializationError: ... secret manager`	Wrong Secrets ARN / permissions
`Essential container ... exited`	Container itself died → CloudWatch logs
`Task failed ELB health checks`	ALB can’t mark it healthy → next item

2) ALB health check failing #

The most common one. Check points:

Does the container port (8000) match the Target Group port (8000)?
Does /health actually return 200 (no DB dependency)?
Is health-check-grace-period longer than the app’s boot time (FastAPI 5s, Django 20–40s)?
Does the Fargate Security Group’s inbound only allow the ALB SG?
Can the ALB route to the task’s subnet (same VPC)?

3) `awsvpc` networkMode ENI limits #

Fargate tasks consume one ENI (Elastic Network Interface) each. If the AZ / subnet runs out of IPs, new tasks can’t start. Don’t size CIDR too tightly (the example above /24 = 256 IPs).

4) ECR pull fails without a public IP #

If a task starts in a private subnet without a NAT Gateway or a VPC Endpoint, traffic to ECR / Secrets Manager / CloudWatch is blocked, and startup fails.

Three fixes:

Add a NAT Gateway (~$0.045/hr + data transfer)
Add Interface VPC Endpoints for ECR / Logs / Secrets (cheaper than NAT)
Public subnet + assignPublicIp=ENABLED (for learning)

5) Stuck deployment — new tasks never become healthy #

If deploymentCircuitBreaker is on, it auto-rollbacks after N minutes. If off, the Service stays IN_PROGRESS forever. Use aws ecs describe-services to inspect the deployments array.

Wrapping up #

What we covered in this post:

The big picture — Route 53 → ALB → Fargate (× 2 AZ) → RDS, the standard Multi-AZ production shape
VPC skeleton — the roles of public / private / db subnets, two SGs for ALB ↔ Fargate
ECR — --platform=linux/amd64 on build, tags by git SHA or semver, no latest in production
Task Definition — Fargate CPU/memory combos, splitting executionRole vs taskRole, automatic logging via awslogs
ALB Target Group — Fargate is target-type ip, health check on /health
ECS Service — desired count, deployment circuit breaker, maximum/minimum % shape the rolling update
Auto Scaling — application-autoscaling for CPU/request-based target tracking
Verification — services-stable wait, ALB DNS curl, CloudWatch Logs tail
Pitfalls — STOPPED root cause analysis / 5 reasons ALB health check fails / ENI IP shortage / NAT/Endpoint missing / stuck deployment

Next — RDS #

Traffic is now flowing through the ALB, but our API is still without a database — relying entirely on in-memory state.

In #2 RDS integration and migration operations we’ll bring up RDS Postgres Multi-AZ inside the VPC, inject the password through Secrets Manager, place Alembic / Django migrations into operations, and lay out a blue/green-compatible migration pattern that doesn’t kill production traffic.

The big picture #

The domain — a one-line summary of the blog API container #

1) VPC and subnets — the network skeleton #

The recommended shape #

Two security groups #

2) Push the image to ECR #

Build and push #

Image tag strategy #

3) Task Definition — the container’s “ID card” #

The two IAM roles get confused often #

Register #

4) ALB + Target Group — the entry point for traffic #

5) ECS Service — the container’s “company” #

Auto scaling #

6) Verify the first deployment #

Wait for the Service to reach a stable state #

Check the health endpoint directly #

Tail the logs #

Pitfalls — 5 reasons the first deployment fails #

1) Endlessly restarting in STOPPED state #

2) ALB health check failing #

3) awsvpc networkMode ENI limits #

4) ECR pull fails without a public IP #

5) Stuck deployment — new tasks never become healthy #

Wrapping up #

Next — RDS #

1) Endlessly restarting in `STOPPED` state #

3) `awsvpc` networkMode ENI limits #