AWS Advanced #1: ECS and Fargate — Container Deployment
If the AWS Basics 7 posts gave you the foundation of accounts / IAM / security / CloudWatch, and the AWS Intermediate 7 posts made you comfortable with EC2 / VPC / S3 / RDS / Route 53 / ALB / CloudFront, now we step up — to containers.
The seven AWS Advanced posts move you off putting things directly on a single EC2 box and into the toolbox you meet at operating scale — containers, serverless, messaging, secrets, workflows.
- #1 ECS and Fargate — Container Deployment ← this post
- #2 ECR — Image Registry
- #3 Lambda Basics
- #4 API Gateway + Lambda
- #5 EventBridge / SQS / SNS
- #6 Secrets Manager / Parameter Store
- #7 Step Functions
This post is the first of those — ECS and Fargate. We’ll lay down the standard pattern for taking an image you built with Docker and running it on AWS.
The limits of putting things directly on one EC2 #
The flow from Intermediate #2 EC2 operations — spin up an EC2, SSH in, install nginx / docker / your code by hand, run it under systemd — is fine for simple cases. But you start running into pain in these places.
| Pain point | Direct EC2 ops |
|---|---|
| Reproducible environments | OS patches and dependency drift make it different every time |
| Scaling out | Build an AMI → ASG → deploy — minutes, not seconds |
| Zero-downtime deploys | Complicated shell scripts or a separate tool |
| Rollback | Snapshot → boot → shift traffic |
| Health checks / auto-recovery | systemd only goes so far |
Containers solve all of these in one motion — that’s the modern infrastructure flow. On AWS, the door into that is ECS.
Where ECS fits #
Amazon ECS (Elastic Container Service) is AWS’s managed container orchestrator. Hand it a Docker image, tell it what machine, how many copies, and how traffic should flow, and ECS runs the rest.
ECS vs EKS — one-liner #
| ECS | EKS | |
|---|---|---|
| What it is | AWS’s own orchestrator | Kubernetes managed by AWS |
| Learning curve | Gentle (sits naturally inside AWS) | Steep (you have to learn k8s itself) |
| Portability | Low (AWS-only) | High (k8s standard) |
| Ecosystem | AWS tools + some community | Whole k8s ecosystem (Helm, ArgoCD, etc.) |
| Operational burden | Low | High (Control Plane cost + ops knowledge) |
| Where it shines | Small / mid scale, AWS lock-in is fine | Large scale, multi-cloud, k8s standard required |
Starting containers for the first time? ECS first. EKS comes later, after the foundations from Intermediate #1 EC2/VPC plus k8s itself.
ECS has another cousin called App Runner — even simpler than ECS (image → URL in one step). But it’s narrow on options, so ECS / Fargate is the production-grade choice today.
The four ECS pieces #
Four pieces is all you need to memorize.
┌──────────────────────────────────────┐
│ Cluster — the grouping unit │
│ ┌────────────────────────────────┐ │
│ │ Service — keep N running │ │
│ │ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Task #1 │ │ Task #2 │ │ │
│ │ │ (container)│ │ (container)│ │ │
│ │ └────────────┘ └────────────┘ │ │
│ │ ↑ Task Definition (the blueprint) │
│ └────────────────────────────────┘ │
└──────────────────────────────────────┘1) Task Definition — the blueprint of your container #
A single piece of JSON. It says what to run and how.
- Which image (
123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1) - CPU / memory (
512/1024 MB) - Environment variables / Secrets
- Port mappings
- Log driver (typically CloudWatch Logs)
- IAM roles (Task Role + Execution Role — more on this below)
- Health check
{
"family": "myapp",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/myapp-task-role",
"containerDefinitions": [
{
"name": "web",
"image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1",
"essential": true,
"portMappings": [{ "containerPort": 8000, "protocol": "tcp" }],
"environment": [
{ "name": "ENV", "value": "production" }
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:myapp/db-AbCdEf"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/myapp",
"awslogs-region": "ap-northeast-2",
"awslogs-stream-prefix": "web"
}
}
}
]
}Task Definitions accumulate as revisions (myapp:7, etc.). To deploy a new image, register a new revision and have the Service point at it.
2) Task — a running instance #
A Task Definition that’s actually been started. The container (or set of containers) is running. Equivalent to an EC2 instance.
- One Task = one running revision of a Task Definition
- A Task can have multiple containers (sidecar pattern — main app + log shipper, etc.)
- A Task gets its own ENI (network interface) + IP (
awsvpcmode)
3) Service — keep N alive #
Just running a Task once means it’s gone if it crashes. Service is the next layer:
- “Keep N copies of this Task Definition running.”
- Auto-restart on death
- Wires up to ALB / NLB to receive traffic (Intermediate #6)
- Deployment strategies (rolling, blue/green)
- Auto Scaling (CPU / memory / request count based)
Almost all production workloads (web servers, APIs) run as a Service. One-shot batch jobs run a Task directly without a Service (RunTask).
4) Cluster — the grouping #
The logical grouping that Services / Tasks live in. Usually split per environment:
prod-clusterstaging-clusterdev-cluster
Clusters are free (no charge for the Cluster itself). What you pay for is the resources inside running Tasks. So split per environment freely.
Launch Type — EC2 vs Fargate #
Where ECS actually puts your Tasks. Two modes.
EC2 Launch Type #
You run a fleet of EC2 instances (ASG); ECS schedules containers onto them.
ECS Service
│ (schedule)
▼
EC2 #1 EC2 #2 EC2 #3 ← you run these (ASG, AMI, patches, security)
▲ ▲ ▲
container container containerPros:
- Instance pricing = EC2 pricing (long-term savings / Reserved / Spot)
- Free choice of GPU / large memory / specialty instances
Cons:
- You operate the EC2 — keep AMIs current, patch the OS, update the ECS agent
- You have to think about packing (binpacking)
- An empty instance still costs you while it’s idle
Fargate Launch Type #
EC2 disappears. You declare the Task’s CPU / memory and AWS finds where to run your container.
ECS Service
│ (schedule)
▼
[AWS-managed plane — invisible]
│
▼
container (Task)Pros:
- Zero EC2 ops — OS patches, ASG, AMI all handled by AWS
- Per-Task billing (per minute, vCPU + memory)
- No idle instance waste
Cons:
- Higher unit price than EC2 (managed cost is included)
- No GPU / specialty instances / some networking options
- Per container: 0.25–16 vCPU, 0.5–120GB memory ceiling
Which one? #
| Situation | Pick |
|---|---|
| Small / medium traffic | Fargate — zero ops |
| High-volume, cost-focused | EC2 + Reserved / Spot |
| GPU / specialty workloads | EC2 |
| Bursty traffic / batch | Fargate Spot (up to 70% off) |
| You know k8s but only have ECS | EC2 + freedom |
This series and the practice 6 posts all assume Fargate. It cuts ops down sharply and the learning curve is gentle.
Two IAM roles — Execution Role vs Task Role #
The most commonly confused thing in ECS ops.
Execution Role #
The permissions the ECS agent needs to launch your Task. Used by AWS right before the Task starts.
- Pull images from ECR
- Create CloudWatch Logs groups / streams
- Fetch secrets from Secrets Manager / Parameter Store (injected at start time)
For most accounts a single ecsTaskExecutionRole is enough (attach the AWS-managed AmazonECSTaskExecutionRolePolicy).
Task Role #
The permissions your code inside the container uses to call AWS APIs. Used at runtime.
boto3.client("s3").get_object(...)from your code → S3 accessdynamodb.get_item(...)from your code → DynamoDB access
You should make a least-privilege Task Role per app. The principle from Basics #6 Security fundamentals.
Execution Role → Used by ECS (image pull, log creation, secret injection)
Task Role → Used by your code (S3, DynamoDB, SQS calls, etc.)Mash these together into one role and you’ve made a security hole.
First deploy — Hello, ECS #
A walkthrough of the full flow. This assumes you already have a Docker image.
1) Push to ECR #
We cover this in detail in #2 ECR, but the flow up front:
# Login
aws ecr get-login-password --region ap-northeast-2 \
| docker login --username AWS --password-stdin \
123456789012.dkr.ecr.ap-northeast-2.amazonaws.com
# Build + tag + push
docker build -t myapp .
docker tag myapp:latest \
123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1
docker push \
123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v12) Create the Cluster #
aws ecs create-cluster --cluster-name prod-clusterOne click in the console. Free, again.
3) Register the Task Definition #
Save the JSON above as task-definition.json:
aws ecs register-task-definition \
--cli-input-json file://task-definition.jsonOn success you get revision myapp:1.
4) Create the Service (with ALB) #
With the ALB Target Group (Intermediate #6) already created:
aws ecs create-service \
--cluster prod-cluster \
--service-name myapp \
--task-definition myapp:1 \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-xxx],assignPublicIp=DISABLED}" \
--load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=web,containerPort=8000"The instant you run this, ECS will:
- Bring up 2 containers in Fargate
- Register each container’s ENI to the Target Group
- Have the ALB route traffic once health checks pass
Hit the ALB DNS (or your Route 53 (Intermediate #5) domain) and you’re live.
5) Deploy a new version #
# Push the new image (myapp:v2)
docker tag myapp:v2 ...; docker push ...
# Register a new Task Definition revision (just swap the image tag)
aws ecs register-task-definition --cli-input-json file://task-definition-v2.json
# → myapp:2
# Update the Service to use the new revision
aws ecs update-service \
--cluster prod-cluster \
--service myapp \
--task-definition myapp:2ECS handles the rolling update for you — bring up 2 new Tasks, wait for health, drain the old 2. No downtime.
Service deployment options #
The default is rolling update; two more options exist.
Rolling Update (default) #
Two knobs: minimumHealthyPercent (default 100) and maximumPercent (default 200).
minHealthy=100, maxPercent=200→ with desired=2, briefly 4 (new 2 + old 2), then drop the old. Zero downtime.minHealthy=50, maxPercent=100→ drop 1 old → start 1 new → drop 1 old → start 1 new. Cheaper.
Blue / Green (CodeDeploy) #
Stand up an entirely new (green) set, then swap the ALB listener at once. Instant rollback.
External (Spinnaker / your own controller) #
Hand “how to deploy” off to an external tool. Only large orgs.
Auto Scaling — grow with traffic #
Sit Application Auto Scaling on top of a Service to adjust desired count automatically.
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/prod-cluster/myapp \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 --max-capacity 10
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/prod-cluster/myapp \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu60 \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://cpu-60.jsoncpu-60.json contains PredefinedMetricSpecification: ECSServiceAverageCPUUtilization, TargetValue: 60.0.
Common scaling triggers:
- ECS Service average CPU
- ECS Service average memory
- ALB RequestCountPerTarget (request count based)
Service Connect — service-to-service #
Multiple microservices on ECS calling each other. Two options.
1) Through ALB / NLB #
Each service has its own ALB. Service A → https://service-b.internal/ (Route 53 private hosted zone) → ALB → Service B.
Pros: standard HTTP, consistent with external. Cons: ALB cost, an extra hop.
2) Service Connect (built into ECS) #
ECS automatically attaches a proxy sidecar (Envoy-based) next to your container, behaving like a mesh. DNS is auto-registered inside the Cluster (web.myapp.local).
{
"serviceConnectConfiguration": {
"enabled": true,
"namespace": "myapp",
"services": [
{
"portName": "web",
"discoveryName": "web",
"clientAliases": [{ "port": 8000, "dnsName": "web" }]
}
]
}
}For small systems an ALB hop is fine. Look at Service Connect once you have multiple microservices.
Cost — where it comes from #
Fargate basis:
hourly = (vCPU-hours) × $0.0506
+ (memory-GB-hours) × $0.0055
+ (Data Transfer)
Example: 0.5 vCPU + 1GB Fargate, 1 task, one month (730h)
= 0.5 × 0.0506 × 730 + 1 × 0.0055 × 730
= $18.5 + $4.0
= ~$22.5 / month (rough Seoul region pricing)Plus:
- ALB: hourly + LCU
- NAT Gateway (when private subnets reach the internet): hourly + GB
- CloudWatch Logs: ingest GB + storage GB
NAT Gateway is sneakily expensive. It can easily run ~$30/month — for a small service, NAT can dwarf Fargate itself.
Cost-saving levers #
- Fargate Spot: 70% off for bursty / batch workloads. Can be terminated; only stateless work fits
- Compute Savings Plans: 1- or 3-year commitment, up to 50% off
- Right-sizing: use CloudWatch Container Insights to see actual usage, then drop vCPU / memory — usually the biggest win
Common pitfalls #
1) Tasks keep dying and restarting #
The Service auto-restarts so it looks fine on the surface — but the container is actually exiting right after it starts. Causes:
- Health check failures (app boots slowly, ALB marks unhealthy)
- Errors at startup → immediate exit
- OOM killed (memory too small)
Look at CloudWatch Logs (Basics #7) and the stopped reason:
aws ecs describe-tasks --cluster prod-cluster \
--tasks <task-id> --query 'tasks[0].stoppedReason'2) Image pull permission missing #
“CannotPullContainerError” right after Task start → 99% of the time Execution Role is missing ECR permissions. Confirm AmazonECSTaskExecutionRolePolicy is attached.
3) Secrets aren’t injected #
secrets from the Task Definition come in empty → the Execution Role lacks secretsmanager:GetSecretValue / ssm:GetParameter on those ARNs. Details in #6.
4) ALB Target unhealthy #
Deploys succeed but the ALB health check fails. Usual causes:
- Health check path doesn’t exist on the app (forgot the
/healthendpoint) - Security Group blocks ALB → Task traffic
- App is bound to 127.0.0.1 instead of 0.0.0.0 (unreachable from outside the container)
5) Task Definition revisions explode #
v1 → v2 → … → v847, on and on. Without cleanup the console gets sluggish. Operational policy: auto-clean revisions older than 30 days, or have your IaC clean up.
6) NAT Gateway cost blow-up #
Tasks in private subnets that hit external APIs frequently → NAT Gateway data processing fees can exceed your EC2 bill. Mitigations:
- VPC Endpoints for AWS services you use a lot (S3, ECR, Secrets Manager) — that traffic skips NAT
- For external API calls, keep tasks in the same AZ as the NAT to avoid cross-AZ data charges
Wrap-up #
Here is what this post covered:
- The limits of bare EC2 ops — environment reproducibility, scaling, zero-downtime deploys, rollbacks, health checks all flow naturally with containers
- Where ECS sits — AWS’s managed container orchestrator. EKS comes when you need k8s standardization
- The four pieces — Cluster (grouping) / Service (keep N) / Task (running container) / Task Definition (blueprint)
- Launch Type — EC2 (you operate, cost-optimal) vs Fargate (zero ops, higher unit price). The series goes Fargate
- Two IAM roles — Execution Role (ECS launching the Task) vs Task Role (your code calling AWS APIs). Never blur them
- First-deploy flow — ECR push → Cluster → Task Definition → Service (with ALB)
- Deploy strategies — rolling (default) / blue-green (CodeDeploy) / external
- Auto Scaling — Application Auto Scaling on CPU / memory / request count
- Service Connect — service-to-service via mesh, no ALB hop
- Cost — vCPU + memory + ALB + NAT. NAT is bigger than you think. Spot, Savings Plans, right-sizing
- Pitfalls — restart loops (health / OOM), image pull permission, secret permissions, ALB unhealthy, revision sprawl, NAT cost
Up next — ECR #
Where do those images ECS pulls actually live? In the next post we go into Amazon ECR (Elastic Container Registry) in detail.
In #2 ECR — Image Registry we cover creating private repos, authentication, push / pull, image scanning, lifecycle policies, and multi-architecture images — the natural companion to ECS, all in one piece.