AWS in Practice #1: Deploying FastAPI/Django to ECS Fargate

10 min read

In Basics (7 posts) we lined up account / region / IAM / cost / CLI / security / logs; in Intermediate (7 posts) we covered EC2 / VPC / S3 / RDS / DNS / ALB / CloudFront; in Advanced (7 posts) we placed ECS / Lambda / messaging / Secrets / Step Functions one slot at a time. With 21 posts of toolbox assembled — it’s time to put a real backend together as a single end-to-end project.

This series takes the blog API (Post + Comment + User) built in FastAPI in Practice or the Django DRF series as the domain, and pulls it up to operationally-ready shape across 6 posts.

The big picture #

The infrastructure we’ll build in this post:

The blog API's place
                      Internet
                  ┌──────────────┐
                  │  Route 53    │   blog.example.com
                  └──────┬───────┘
                ┌────────────────┐
                │      ALB       │   :443 → :8000
                │   (HTTPS, ACM) │
                └────────┬───────┘
              ┌──────────┴──────────┐
              ▼                     ▼
        ┌───────────┐         ┌───────────┐
        │  AZ-a     │         │   AZ-c    │
        │ Fargate   │         │  Fargate  │
        │  Task #1  │         │  Task #2  │
        │  (Blog)   │         │  (Blog)   │
        └─────┬─────┘         └─────┬─────┘
              │                     │
              └──────────┬──────────┘
                  ┌──────────────┐
                  │  RDS Postgres│   (Multi-AZ, Private)
                  └──────────────┘

Component by component:

ComponentRoleSource
Route 53Domain → ALBIntermediate #5
ALBTLS termination, routing, health checksIntermediate #6
ACMTLS certificate issuance/renewalIntermediate #6
ECRImage storageAdvanced #2
ECS FargateContainer execution (serverless)Advanced #1
RDSDBIntermediate #4, #2
VPC + SubnetNetwork separationIntermediate #1
Secrets ManagerDB passwordAdvanced #6, #2

This post sets up everything except the DB at once. RDS gets its own treatment in #2.

The domain — a one-line summary of the blog API container #

The container this series assumes is the artifact from FastAPI in Practice #6 or DRF #6 — shaped like this:

Dockerfile (FastAPI)
FROM python:3.14-slim AS base
WORKDIR /app

ENV PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq5 curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app /app/app

EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
    CMD curl -fsS http://127.0.0.1:8000/health || exit 1

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Three core promises:

  1. Listens on port 8000
  2. /health returns 200 (a lightweight check with no DB dependency)
  3. /ready returns 200 if the DB connection is OK, 503 otherwise — the ALB / ECS uses this to decide traffic routing

For Django, just deliver the same promises with gunicorn -w 4 myproject.wsgi.

1) VPC and subnets — the network skeleton #

ECS / RDS / ALB all live inside a VPC. Without one, you can’t bring up a single resource. Fortunately, new accounts come with a default VPC in every region that you can use to get started quickly. For production, building a dedicated VPC is recommended.

The recommended shape #

VPC 10.0.0.0/16
Public Subnet  (10.0.0.0/24,   AZ-a)  ← ALB, NAT GW
Public Subnet  (10.0.1.0/24,   AZ-c)  ← ALB, NAT GW
Private Subnet (10.0.10.0/24,  AZ-a)  ← Fargate Task
Private Subnet (10.0.11.0/24,  AZ-c)  ← Fargate Task
DB Subnet      (10.0.20.0/24,  AZ-a)  ← RDS
DB Subnet      (10.0.21.0/24,  AZ-c)  ← RDS

Three roles:

SubnetTraffic directionWho lives there
PublicInternet ↔ALB, NAT Gateway
PrivateNo internet (only outbound via NAT)Fargate, EC2
DBNo internet, only Fargate has accessRDS

In this post we’ll use just the public subnets of the default VPC to spin things up fast (assigning public IPs to Fargate tasks). The production shape comes as code in #4 Terraform.

Two security groups #

Two SG roles
sg-alb       80, 443 ← 0.0.0.0/0
             (Internet to ALB)

sg-fargate   8000   ← sg-alb
             (Only ALB to Fargate)

Important pattern: an SG can reference another SG as its source. This is not an IP range — it means “only resources that have this SG attached.” If the ALB’s IP changes, the rule follows automatically.

2) Push the image to ECR #

We covered this in Advanced #2 ECR, but quickly again.

Create the ECR repository
aws ecr create-repository \
  --repository-name blog-api \
  --image-scanning-configuration scanOnPush=true \
  --region ap-northeast-2

scanOnPush=true — automatic vulnerability scan on image push (we’ll see results in #5 Monitoring).

Build and push #

Build → tag → push
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-northeast-2
REPO=$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/blog-api

# 1) Login
aws ecr get-login-password --region $REGION | \
  docker login --username AWS --password-stdin $REPO

# 2) Build (linux/amd64 — Fargate's standard architecture)
docker build --platform=linux/amd64 -t blog-api:v1 .

# 3) Tag
docker tag blog-api:v1 $REPO:v1
docker tag blog-api:v1 $REPO:latest

# 4) Push
docker push $REPO:v1
docker push $REPO:latest

If you build on Apple Silicon (M1/M2/M3) Mac with plain docker build, you get an arm64 image, which won’t run on Fargate (x86_64 by default). Always specify --platform=linux/amd64. Fargate also supports ARM but requires extra configuration.

Image tag strategy #

TagMeaning
latestLatest — don’t use in production (no rollback)
v1, v2, ...Human-readable version
<git-sha>Traceable — auto-issued by CI (#3)
<git-sha>-prodPer-environment alias

latest is for developer convenience. Production Task Definitions always pin to a git SHA or semver — that way “what code is running” can be answered without any doubt.

3) Task Definition — the container’s “ID card” #

The most important part in ECS. Image + CPU/memory + environment variables + ports + log configuration all bundle into a single JSON.

task-definition.json
{
  "family": "blog-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/blog-api-task-role",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/blog-api:v1",
      "portMappings": [
        { "containerPort": 8000, "protocol": "tcp" }
      ],
      "essential": true,
      "environment": [
        { "name": "ENVIRONMENT", "value": "production" },
        { "name": "LOG_LEVEL", "value": "info" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/blog-api",
          "awslogs-region": "ap-northeast-2",
          "awslogs-stream-prefix": "api",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -fsS http://127.0.0.1:8000/health || exit 1"],
        "interval": 10,
        "timeout": 3,
        "retries": 3,
        "startPeriod": 30
      }
    }
  ]
}

Key fields:

KeyMeaning
cpu / memoryFargate allows only fixed combinations (e.g. 256/512, 512/1024, 1024/2048)
executionRoleArnRole used by the ECS agent for ECR pull / Logs / Secrets access
taskRoleArnIAM role used by the container code — boto3 signs with this
awslogsLogs go automatically to CloudWatch (#5)
healthCheckContainer’s own health check (separate from the one in the Dockerfile)

The two IAM roles get confused often #

executionRoleArntaskRoleArn
Who uses itECS agent (start phase)Code inside the container (runtime)
PermissionsECR pull, write to CloudWatch, read SecretsS3 access, RDS, SQS — app logic

Missing executionRoleArn → image pull fails. Missing taskRoleArn → boto3 throws NoCredentialsError.

Register #

Register the Task Definition
aws ecs register-task-definition \
  --cli-input-json file://task-definition.json \
  --region ap-northeast-2

Each registration bumps the revision number (blog-api:1, blog-api:2, …). Rollback is to a previous revision, covered in #3.

4) ALB + Target Group — the entry point for traffic #

Same ALB pattern from Intermediate #6. The core:

ALB → Target Group → Fargate
ALB:443  (HTTPS, ACM certificate)
Listener: 443 → forward → tg-blog-api
Target Group: tg-blog-api
  - Protocol: HTTP / 8000
  - Target type: ip   ← Fargate is always ip
  - Health check: GET /health
  - Healthy threshold: 2
  - Interval: 15s

Target type must be ip — Fargate tasks get a different IP each time, so instance mode doesn’t work.

Create the Target Group
aws elbv2 create-target-group \
  --name tg-blog-api \
  --protocol HTTP --port 8000 \
  --vpc-id $VPC_ID \
  --target-type ip \
  --health-check-path /health \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 15

ALB Listener rules — see Intermediate #6. HTTPS 443 → forward → tg-blog-api, HTTP 80 → 443 redirect.

5) ECS Service — the container’s “company” #

If a Task Definition is the job description, the Service is the employer — it maintains the desired count of running tasks, replaces failed ones, and performs rolling deployments.

Create the ECS Cluster (one-time)
aws ecs create-cluster --cluster-name blog-cluster
Create the ECS Service
aws ecs create-service \
  --cluster blog-cluster \
  --service-name blog-api \
  --task-definition blog-api:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
      subnets=[subnet-aaa, subnet-bbb],
      securityGroups=[sg-fargate],
      assignPublicIp=ENABLED
    }" \
  --load-balancers "targetGroupArn=$TG_ARN,containerName=api,containerPort=8000" \
  --health-check-grace-period-seconds 60 \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100"

Key options:

OptionMeaning
desired-count 2At least 2 — Multi-AZ deployment to survive one-AZ failure
assignPublicIp=ENABLEDWhen private subnet + NAT isn’t available (simple setup). Production should use NAT
health-check-grace-periodGrace period after Service starts a task before ALB health-checks it (app boot time)
deploymentCircuitBreakerAuto-rollback if a new deployment fails N times in a row (covered in detail in #3)
maximumPercent=200Max number of tasks during deployment (200% = old + new together)
minimumHealthyPercent=100Min healthy ratio during deployment (100% = zero downtime)

These two percentages decide the rolling update shape.

Auto scaling #

Auto scaling isn’t on just because the Service is up. Separately:

Register Auto Scaling target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 10
CPU-based policy
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/blog-cluster/blog-api \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
      "TargetValue": 60.0,
      "PredefinedMetricSpecification": {"PredefinedMetricType": "ECSServiceAverageCPUUtilization"},
      "ScaleOutCooldown": 30,
      "ScaleInCooldown": 120
    }'

This scales out and in to keep the average CPU around 60%. Start conservative in production (40–60%) and tune the target as you observe real traffic patterns.

6) Verify the first deployment #

Wait for the Service to reach a stable state #

Wait for stability (5–10 min)
aws ecs wait services-stable \
  --cluster blog-cluster \
  --services blog-api

Check the health endpoint directly #

Hit ALB DNS directly
ALB_DNS=$(aws elbv2 describe-load-balancers \
  --names blog-alb \
  --query 'LoadBalancers[0].DNSName' --output text)

curl -i https://$ALB_DNS/health
# HTTP/2 200
# {"status": "ok"}

Tail the logs #

CloudWatch Logs tail
aws logs tail /ecs/blog-api --follow --since 5m

Send a request and once the access log appears, you’ve reached the first checkpoint of this series.

Pitfalls — 5 reasons the first deployment fails #

1) Endlessly restarting in STOPPED state #

In the ECS console Tasks tab, click a STOPPED row → check “Stopped reason.” Common causes:

MessageCause
CannotPullContainerErrorMissing ECR permission → executionRole
ResourceInitializationError: ... secret managerWrong Secrets ARN / permissions
Essential container ... exitedContainer itself died → CloudWatch logs
Task failed ELB health checksALB can’t mark it healthy → next item

2) ALB health check failing #

The most common one. Check points:

  • Does the container port (8000) match the Target Group port (8000)?
  • Does /health actually return 200 (no DB dependency)?
  • Is health-check-grace-period longer than the app’s boot time (FastAPI 5s, Django 20–40s)?
  • Does the Fargate Security Group’s inbound only allow the ALB SG?
  • Can the ALB route to the task’s subnet (same VPC)?

3) awsvpc networkMode ENI limits #

Fargate tasks consume one ENI (Elastic Network Interface) each. If the AZ / subnet runs out of IPs, new tasks can’t start. Don’t size CIDR too tightly (the example above /24 = 256 IPs).

4) ECR pull fails without a public IP #

If a task starts in a private subnet without a NAT Gateway or a VPC Endpoint, traffic to ECR / Secrets Manager / CloudWatch is blocked, and startup fails.

Three fixes:

  1. Add a NAT Gateway (~$0.045/hr + data transfer)
  2. Add Interface VPC Endpoints for ECR / Logs / Secrets (cheaper than NAT)
  3. Public subnet + assignPublicIp=ENABLED (for learning)

5) Stuck deployment — new tasks never become healthy #

If deploymentCircuitBreaker is on, it auto-rollbacks after N minutes. If off, the Service stays IN_PROGRESS forever. Use aws ecs describe-services to inspect the deployments array.

Wrapping up #

What we covered in this post:

  • The big picture — Route 53 → ALB → Fargate (× 2 AZ) → RDS, the standard Multi-AZ production shape
  • VPC skeleton — the roles of public / private / db subnets, two SGs for ALB ↔ Fargate
  • ECR--platform=linux/amd64 on build, tags by git SHA or semver, no latest in production
  • Task Definition — Fargate CPU/memory combos, splitting executionRole vs taskRole, automatic logging via awslogs
  • ALB Target Group — Fargate is target-type ip, health check on /health
  • ECS Service — desired count, deployment circuit breaker, maximum/minimum % shape the rolling update
  • Auto Scalingapplication-autoscaling for CPU/request-based target tracking
  • Verificationservices-stable wait, ALB DNS curl, CloudWatch Logs tail
  • Pitfalls — STOPPED root cause analysis / 5 reasons ALB health check fails / ENI IP shortage / NAT/Endpoint missing / stuck deployment

Next — RDS #

Traffic is now flowing through the ALB, but our API is still without a database — relying entirely on in-memory state.

In #2 RDS integration and migration operations we’ll bring up RDS Postgres Multi-AZ inside the VPC, inject the password through Secrets Manager, place Alembic / Django migrations into operations, and lay out a blue/green-compatible migration pattern that doesn’t kill production traffic.

X