AWS in Practice #1: Deploying FastAPI/Django to ECS Fargate
In Basics (7 posts) we lined up account / region / IAM / cost / CLI / security / logs; in Intermediate (7 posts) we covered EC2 / VPC / S3 / RDS / DNS / ALB / CloudFront; in Advanced (7 posts) we placed ECS / Lambda / messaging / Secrets / Step Functions one slot at a time. With 21 posts of toolbox assembled — it’s time to put a real backend together as a single end-to-end project.
This series takes the blog API (Post + Comment + User) built in FastAPI in Practice or the Django DRF series as the domain, and pulls it up to operationally-ready shape across 6 posts.
The big picture #
The infrastructure we’ll build in this post:
Internet
│
▼
┌──────────────┐
│ Route 53 │ blog.example.com
└──────┬───────┘
│
▼
┌────────────────┐
│ ALB │ :443 → :8000
│ (HTTPS, ACM) │
└────────┬───────┘
│
┌──────────┴──────────┐
▼ ▼
┌───────────┐ ┌───────────┐
│ AZ-a │ │ AZ-c │
│ Fargate │ │ Fargate │
│ Task #1 │ │ Task #2 │
│ (Blog) │ │ (Blog) │
└─────┬─────┘ └─────┬─────┘
│ │
└──────────┬──────────┘
▼
┌──────────────┐
│ RDS Postgres│ (Multi-AZ, Private)
└──────────────┘Component by component:
| Component | Role | Source |
|---|---|---|
| Route 53 | Domain → ALB | Intermediate #5 |
| ALB | TLS termination, routing, health checks | Intermediate #6 |
| ACM | TLS certificate issuance/renewal | Intermediate #6 |
| ECR | Image storage | Advanced #2 |
| ECS Fargate | Container execution (serverless) | Advanced #1 |
| RDS | DB | Intermediate #4, #2 |
| VPC + Subnet | Network separation | Intermediate #1 |
| Secrets Manager | DB password | Advanced #6, #2 |
This post sets up everything except the DB at once. RDS gets its own treatment in #2.
The domain — a one-line summary of the blog API container #
The container this series assumes is the artifact from FastAPI in Practice #6 or DRF #6 — shaped like this:
FROM python:3.14-slim AS base
WORKDIR /app
ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1
RUN apt-get update && apt-get install -y --no-install-recommends \
libpq5 curl && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app /app/app
EXPOSE 8000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD curl -fsS http://127.0.0.1:8000/health || exit 1
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]Three core promises:
- Listens on port 8000
/healthreturns 200 (a lightweight check with no DB dependency)/readyreturns 200 if the DB connection is OK, 503 otherwise — the ALB / ECS uses this to decide traffic routing
For Django, just deliver the same promises with gunicorn -w 4 myproject.wsgi.
1) VPC and subnets — the network skeleton #
ECS / RDS / ALB all live inside a VPC. Without one, you can’t bring up a single resource. Fortunately, new accounts come with a default VPC in every region that you can use to get started quickly. For production, building a dedicated VPC is recommended.
The recommended shape #
Public Subnet (10.0.0.0/24, AZ-a) ← ALB, NAT GW
Public Subnet (10.0.1.0/24, AZ-c) ← ALB, NAT GW
Private Subnet (10.0.10.0/24, AZ-a) ← Fargate Task
Private Subnet (10.0.11.0/24, AZ-c) ← Fargate Task
DB Subnet (10.0.20.0/24, AZ-a) ← RDS
DB Subnet (10.0.21.0/24, AZ-c) ← RDSThree roles:
| Subnet | Traffic direction | Who lives there |
|---|---|---|
| Public | Internet ↔ | ALB, NAT Gateway |
| Private | No internet (only outbound via NAT) | Fargate, EC2 |
| DB | No internet, only Fargate has access | RDS |
In this post we’ll use just the public subnets of the default VPC to spin things up fast (assigning public IPs to Fargate tasks). The production shape comes as code in #4 Terraform.
Two security groups #
sg-alb 80, 443 ← 0.0.0.0/0
(Internet to ALB)
sg-fargate 8000 ← sg-alb
(Only ALB to Fargate)Important pattern: an SG can reference another SG as its source. This is not an IP range — it means “only resources that have this SG attached.” If the ALB’s IP changes, the rule follows automatically.
2) Push the image to ECR #
We covered this in Advanced #2 ECR, but quickly again.
aws ecr create-repository \
--repository-name blog-api \
--image-scanning-configuration scanOnPush=true \
--region ap-northeast-2scanOnPush=true — automatic vulnerability scan on image push (we’ll see results in #5 Monitoring).
Build and push #
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-northeast-2
REPO=$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/blog-api
# 1) Login
aws ecr get-login-password --region $REGION | \
docker login --username AWS --password-stdin $REPO
# 2) Build (linux/amd64 — Fargate's standard architecture)
docker build --platform=linux/amd64 -t blog-api:v1 .
# 3) Tag
docker tag blog-api:v1 $REPO:v1
docker tag blog-api:v1 $REPO:latest
# 4) Push
docker push $REPO:v1
docker push $REPO:latestIf you build on Apple Silicon (M1/M2/M3) Mac with plain
docker build, you get an arm64 image, which won’t run on Fargate (x86_64 by default). Always specify--platform=linux/amd64. Fargate also supports ARM but requires extra configuration.
Image tag strategy #
| Tag | Meaning |
|---|---|
latest | Latest — don’t use in production (no rollback) |
v1, v2, ... | Human-readable version |
<git-sha> | Traceable — auto-issued by CI (#3) |
<git-sha>-prod | Per-environment alias |
latest is for developer convenience. Production Task Definitions always pin to a git SHA or semver — that way “what code is running” can be answered without any doubt.
3) Task Definition — the container’s “ID card” #
The most important part in ECS. Image + CPU/memory + environment variables + ports + log configuration all bundle into a single JSON.
{
"family": "blog-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/blog-api-task-role",
"containerDefinitions": [
{
"name": "api",
"image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/blog-api:v1",
"portMappings": [
{ "containerPort": 8000, "protocol": "tcp" }
],
"essential": true,
"environment": [
{ "name": "ENVIRONMENT", "value": "production" },
{ "name": "LOG_LEVEL", "value": "info" }
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/blog-api",
"awslogs-region": "ap-northeast-2",
"awslogs-stream-prefix": "api",
"awslogs-create-group": "true"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -fsS http://127.0.0.1:8000/health || exit 1"],
"interval": 10,
"timeout": 3,
"retries": 3,
"startPeriod": 30
}
}
]
}Key fields:
| Key | Meaning |
|---|---|
cpu / memory | Fargate allows only fixed combinations (e.g. 256/512, 512/1024, 1024/2048) |
executionRoleArn | Role used by the ECS agent for ECR pull / Logs / Secrets access |
taskRoleArn | IAM role used by the container code — boto3 signs with this |
awslogs | Logs go automatically to CloudWatch (#5) |
healthCheck | Container’s own health check (separate from the one in the Dockerfile) |
The two IAM roles get confused often #
executionRoleArn | taskRoleArn | |
|---|---|---|
| Who uses it | ECS agent (start phase) | Code inside the container (runtime) |
| Permissions | ECR pull, write to CloudWatch, read Secrets | S3 access, RDS, SQS — app logic |
Missing executionRoleArn → image pull fails. Missing taskRoleArn → boto3 throws NoCredentialsError.
Register #
aws ecs register-task-definition \
--cli-input-json file://task-definition.json \
--region ap-northeast-2Each registration bumps the revision number (blog-api:1, blog-api:2, …). Rollback is to a previous revision, covered in #3.
4) ALB + Target Group — the entry point for traffic #
Same ALB pattern from Intermediate #6. The core:
ALB:443 (HTTPS, ACM certificate)
│
▼
Listener: 443 → forward → tg-blog-api
│
▼
Target Group: tg-blog-api
- Protocol: HTTP / 8000
- Target type: ip ← Fargate is always ip
- Health check: GET /health
- Healthy threshold: 2
- Interval: 15sTarget type must be ip — Fargate tasks get a different IP each time, so instance mode doesn’t work.
aws elbv2 create-target-group \
--name tg-blog-api \
--protocol HTTP --port 8000 \
--vpc-id $VPC_ID \
--target-type ip \
--health-check-path /health \
--healthy-threshold-count 2 \
--health-check-interval-seconds 15ALB Listener rules — see Intermediate #6. HTTPS 443 → forward → tg-blog-api, HTTP 80 → 443 redirect.
5) ECS Service — the container’s “company” #
If a Task Definition is the job description, the Service is the employer — it maintains the desired count of running tasks, replaces failed ones, and performs rolling deployments.
aws ecs create-cluster --cluster-name blog-clusteraws ecs create-service \
--cluster blog-cluster \
--service-name blog-api \
--task-definition blog-api:1 \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={
subnets=[subnet-aaa, subnet-bbb],
securityGroups=[sg-fargate],
assignPublicIp=ENABLED
}" \
--load-balancers "targetGroupArn=$TG_ARN,containerName=api,containerPort=8000" \
--health-check-grace-period-seconds 60 \
--deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100"Key options:
| Option | Meaning |
|---|---|
desired-count 2 | At least 2 — Multi-AZ deployment to survive one-AZ failure |
assignPublicIp=ENABLED | When private subnet + NAT isn’t available (simple setup). Production should use NAT |
health-check-grace-period | Grace period after Service starts a task before ALB health-checks it (app boot time) |
deploymentCircuitBreaker | Auto-rollback if a new deployment fails N times in a row (covered in detail in #3) |
maximumPercent=200 | Max number of tasks during deployment (200% = old + new together) |
minimumHealthyPercent=100 | Min healthy ratio during deployment (100% = zero downtime) |
These two percentages decide the rolling update shape.
Auto scaling #
Auto scaling isn’t on just because the Service is up. Separately:
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/blog-cluster/blog-api \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 --max-capacity 10aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/blog-cluster/blog-api \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-target \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 60.0,
"PredefinedMetricSpecification": {"PredefinedMetricType": "ECSServiceAverageCPUUtilization"},
"ScaleOutCooldown": 30,
"ScaleInCooldown": 120
}'This scales out and in to keep the average CPU around 60%. Start conservative in production (40–60%) and tune the target as you observe real traffic patterns.
6) Verify the first deployment #
Wait for the Service to reach a stable state #
aws ecs wait services-stable \
--cluster blog-cluster \
--services blog-apiCheck the health endpoint directly #
ALB_DNS=$(aws elbv2 describe-load-balancers \
--names blog-alb \
--query 'LoadBalancers[0].DNSName' --output text)
curl -i https://$ALB_DNS/health
# HTTP/2 200
# {"status": "ok"}Tail the logs #
aws logs tail /ecs/blog-api --follow --since 5mSend a request and once the access log appears, you’ve reached the first checkpoint of this series.
Pitfalls — 5 reasons the first deployment fails #
1) Endlessly restarting in STOPPED state
#
In the ECS console Tasks tab, click a STOPPED row → check “Stopped reason.” Common causes:
| Message | Cause |
|---|---|
CannotPullContainerError | Missing ECR permission → executionRole |
ResourceInitializationError: ... secret manager | Wrong Secrets ARN / permissions |
Essential container ... exited | Container itself died → CloudWatch logs |
Task failed ELB health checks | ALB can’t mark it healthy → next item |
2) ALB health check failing #
The most common one. Check points:
- Does the container port (
8000) match the Target Group port (8000)? - Does
/healthactually return 200 (no DB dependency)? - Is
health-check-grace-periodlonger than the app’s boot time (FastAPI 5s, Django 20–40s)? - Does the Fargate Security Group’s inbound only allow the ALB SG?
- Can the ALB route to the task’s subnet (same VPC)?
3) awsvpc networkMode ENI limits
#
Fargate tasks consume one ENI (Elastic Network Interface) each. If the AZ / subnet runs out of IPs, new tasks can’t start. Don’t size CIDR too tightly (the example above /24 = 256 IPs).
4) ECR pull fails without a public IP #
If a task starts in a private subnet without a NAT Gateway or a VPC Endpoint, traffic to ECR / Secrets Manager / CloudWatch is blocked, and startup fails.
Three fixes:
- Add a NAT Gateway (~$0.045/hr + data transfer)
- Add Interface VPC Endpoints for ECR / Logs / Secrets (cheaper than NAT)
- Public subnet +
assignPublicIp=ENABLED(for learning)
5) Stuck deployment — new tasks never become healthy #
If deploymentCircuitBreaker is on, it auto-rollbacks after N minutes. If off, the Service stays IN_PROGRESS forever. Use aws ecs describe-services to inspect the deployments array.
Wrapping up #
What we covered in this post:
- The big picture — Route 53 → ALB → Fargate (× 2 AZ) → RDS, the standard Multi-AZ production shape
- VPC skeleton — the roles of public / private / db subnets, two SGs for ALB ↔ Fargate
- ECR —
--platform=linux/amd64on build, tags by git SHA or semver, nolatestin production - Task Definition — Fargate CPU/memory combos, splitting executionRole vs taskRole, automatic logging via awslogs
- ALB Target Group — Fargate is
target-type ip, health check on/health - ECS Service — desired count, deployment circuit breaker, maximum/minimum % shape the rolling update
- Auto Scaling —
application-autoscalingfor CPU/request-based target tracking - Verification —
services-stablewait, ALB DNS curl, CloudWatch Logs tail - Pitfalls — STOPPED root cause analysis / 5 reasons ALB health check fails / ENI IP shortage / NAT/Endpoint missing / stuck deployment
Next — RDS #
Traffic is now flowing through the ALB, but our API is still without a database — relying entirely on in-memory state.
In #2 RDS integration and migration operations we’ll bring up RDS Postgres Multi-AZ inside the VPC, inject the password through Secrets Manager, place Alembic / Django migrations into operations, and lay out a blue/green-compatible migration pattern that doesn’t kill production traffic.