AWS Advanced #1: ECS and Fargate — Container Deployment

Infrastructure AWS Container ECS Fargate

Saturday, April 25, 2026

13 min read

If the AWS Basics 7 posts gave you the foundation of accounts / IAM / security / CloudWatch, and the AWS Intermediate 7 posts made you comfortable with EC2 / VPC / S3 / RDS / Route 53 / ALB / CloudFront, now we step up — to containers.

The seven AWS Advanced posts move you off putting things directly on a single EC2 box and into the toolbox you meet at operating scale — containers, serverless, messaging, secrets, workflows.

#1 ECS and Fargate — Container Deployment ← this post
#2 ECR — Image Registry
#3 Lambda Basics
#4 API Gateway + Lambda
#5 EventBridge / SQS / SNS
#6 Secrets Manager / Parameter Store
#7 Step Functions

This post is the first of those — ECS and Fargate. We’ll lay down the standard pattern for taking an image you built with Docker and running it on AWS.

The limits of putting things directly on one EC2 #

The flow from Intermediate #2 EC2 operations — spin up an EC2, SSH in, install nginx / docker / your code by hand, run it under systemd — is fine for simple cases. But you start running into pain in these places.

Pain point	Direct EC2 ops
Reproducible environments	OS patches and dependency drift make it different every time
Scaling out	Build an AMI → ASG → deploy — minutes, not seconds
Zero-downtime deploys	Complicated shell scripts or a separate tool
Rollback	Snapshot → boot → shift traffic
Health checks / auto-recovery	systemd only goes so far

Containers solve all of these in one motion — that’s the modern infrastructure flow. On AWS, the door into that is ECS.

Where ECS fits #

Amazon ECS (Elastic Container Service) is AWS’s managed container orchestrator. Hand it a Docker image, tell it what machine, how many copies, and how traffic should flow, and ECS runs the rest.

ECS vs EKS — one-liner #

	ECS	EKS
What it is	AWS’s own orchestrator	Kubernetes managed by AWS
Learning curve	Gentle (sits naturally inside AWS)	Steep (you have to learn k8s itself)
Portability	Low (AWS-only)	High (k8s standard)
Ecosystem	AWS tools + some community	Whole k8s ecosystem (Helm, ArgoCD, etc.)
Operational burden	Low	High (Control Plane cost + ops knowledge)
Where it shines	Small / mid scale, AWS lock-in is fine	Large scale, multi-cloud, k8s standard required

Starting containers for the first time? ECS first. EKS comes later, after the foundations from Intermediate #1 EC2/VPC plus k8s itself.

ECS has another cousin called App Runner — even simpler than ECS (image → URL in one step). But it’s narrow on options, so ECS / Fargate is the production-grade choice today.

The four ECS pieces #

Four pieces is all you need to memorize.

ECS — top to bottom

┌──────────────────────────────────────┐
│  Cluster — the grouping unit         │
│  ┌────────────────────────────────┐  │
│  │ Service — keep N running        │  │
│  │  ┌────────────┐ ┌────────────┐ │  │
│  │  │  Task #1   │ │  Task #2   │ │  │
│  │  │ (container)│ │ (container)│ │  │
│  │  └────────────┘ └────────────┘ │  │
│  │  ↑ Task Definition (the blueprint) │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

1) Task Definition — the blueprint of your container #

A single piece of JSON. It says what to run and how.

Which image (123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1)
CPU / memory (512 / 1024 MB)
Environment variables / Secrets
Port mappings
Log driver (typically CloudWatch Logs)
IAM roles (Task Role + Execution Role — more on this below)
Health check

task-definition.json (Fargate)

{
  "family": "myapp",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/myapp-task-role",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1",
      "essential": true,
      "portMappings": [{ "containerPort": 8000, "protocol": "tcp" }],
      "environment": [
        { "name": "ENV", "value": "production" }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:myapp/db-AbCdEf"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/myapp",
          "awslogs-region": "ap-northeast-2",
          "awslogs-stream-prefix": "web"
        }
      }
    }
  ]
}

Task Definitions accumulate as revisions (myapp:7, etc.). To deploy a new image, register a new revision and have the Service point at it.

2) Task — a running instance #

A Task Definition that’s actually been started. The container (or set of containers) is running. Equivalent to an EC2 instance.

One Task = one running revision of a Task Definition
A Task can have multiple containers (sidecar pattern — main app + log shipper, etc.)
A Task gets its own ENI (network interface) + IP (awsvpc mode)

3) Service — keep N alive #

Just running a Task once means it’s gone if it crashes. Service is the next layer:

“Keep N copies of this Task Definition running.”
Auto-restart on death
Wires up to ALB / NLB to receive traffic (Intermediate #6)
Deployment strategies (rolling, blue/green)
Auto Scaling (CPU / memory / request count based)

Almost all production workloads (web servers, APIs) run as a Service. One-shot batch jobs run a Task directly without a Service (RunTask).

4) Cluster — the grouping #

The logical grouping that Services / Tasks live in. Usually split per environment:

prod-cluster
staging-cluster
dev-cluster

Clusters are free (no charge for the Cluster itself). What you pay for is the resources inside running Tasks. So split per environment freely.

Launch Type — EC2 vs Fargate #

Where ECS actually puts your Tasks. Two modes.

EC2 Launch Type #

You run a fleet of EC2 instances (ASG); ECS schedules containers onto them.

EC2 Launch Type

ECS Service
   │ (schedule)
   ▼
EC2 #1     EC2 #2     EC2 #3   ← you run these (ASG, AMI, patches, security)
 ▲          ▲          ▲
container  container  container

Pros:

Instance pricing = EC2 pricing (long-term savings / Reserved / Spot)
Free choice of GPU / large memory / specialty instances

Cons:

You operate the EC2 — keep AMIs current, patch the OS, update the ECS agent
You have to think about packing (binpacking)
An empty instance still costs you while it’s idle

Fargate Launch Type #

EC2 disappears. You declare the Task’s CPU / memory and AWS finds where to run your container.

Fargate Launch Type

ECS Service
   │ (schedule)
   ▼
[AWS-managed plane — invisible]
   │
   ▼
container (Task)

Pros:

Zero EC2 ops — OS patches, ASG, AMI all handled by AWS
Per-Task billing (per minute, vCPU + memory)
No idle instance waste

Cons:

Higher unit price than EC2 (managed cost is included)
No GPU / specialty instances / some networking options
Per container: 0.25–16 vCPU, 0.5–120GB memory ceiling

Which one? #

Situation	Pick
Small / medium traffic	Fargate — zero ops
High-volume, cost-focused	EC2 + Reserved / Spot
GPU / specialty workloads	EC2
Bursty traffic / batch	Fargate Spot (up to 70% off)
You know k8s but only have ECS	EC2 + freedom

This series and the practice 6 posts all assume Fargate. It cuts ops down sharply and the learning curve is gentle.

Two IAM roles — Execution Role vs Task Role #

The most commonly confused thing in ECS ops.

Execution Role #

The permissions the ECS agent needs to launch your Task. Used by AWS right before the Task starts.

Pull images from ECR
Create CloudWatch Logs groups / streams
Fetch secrets from Secrets Manager / Parameter Store (injected at start time)

For most accounts a single ecsTaskExecutionRole is enough (attach the AWS-managed AmazonECSTaskExecutionRolePolicy).

Task Role #

The permissions your code inside the container uses to call AWS APIs. Used at runtime.

boto3.client("s3").get_object(...) from your code → S3 access
dynamodb.get_item(...) from your code → DynamoDB access

You should make a least-privilege Task Role per app. The principle from Basics #6 Security fundamentals.

Role separation

Execution Role  →  Used by ECS (image pull, log creation, secret injection)
Task Role       →  Used by your code (S3, DynamoDB, SQS calls, etc.)

Mash these together into one role and you’ve made a security hole.

First deploy — Hello, ECS #

A walkthrough of the full flow. This assumes you already have a Docker image.

1) Push to ECR #

We cover this in detail in #2 ECR, but the flow up front:

ECR push

# Login
aws ecr get-login-password --region ap-northeast-2 \
  | docker login --username AWS --password-stdin \
    123456789012.dkr.ecr.ap-northeast-2.amazonaws.com

# Build + tag + push
docker build -t myapp .
docker tag myapp:latest \
  123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1
docker push \
  123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1

2) Create the Cluster #

Cluster

aws ecs create-cluster --cluster-name prod-cluster

One click in the console. Free, again.

3) Register the Task Definition #

Save the JSON above as task-definition.json:

aws ecs register-task-definition \
  --cli-input-json file://task-definition.json

On success you get revision myapp:1.

4) Create the Service (with ALB) #

With the ALB Target Group (Intermediate #6) already created:

Service

aws ecs create-service \
  --cluster prod-cluster \
  --service-name myapp \
  --task-definition myapp:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-xxx],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=web,containerPort=8000"

The instant you run this, ECS will:

Bring up 2 containers in Fargate
Register each container’s ENI to the Target Group
Have the ALB route traffic once health checks pass

Hit the ALB DNS (or your Route 53 (Intermediate #5) domain) and you’re live.

5) Deploy a new version #

new version

# Push the new image (myapp:v2)
docker tag myapp:v2 ...; docker push ...

# Register a new Task Definition revision (just swap the image tag)
aws ecs register-task-definition --cli-input-json file://task-definition-v2.json
# → myapp:2

# Update the Service to use the new revision
aws ecs update-service \
  --cluster prod-cluster \
  --service myapp \
  --task-definition myapp:2

ECS handles the rolling update for you — bring up 2 new Tasks, wait for health, drain the old 2. No downtime.

Service deployment options #

The default is rolling update; two more options exist.

Rolling Update (default) #

Two knobs: minimumHealthyPercent (default 100) and maximumPercent (default 200).

minHealthy=100, maxPercent=200 → with desired=2, briefly 4 (new 2 + old 2), then drop the old. Zero downtime.
minHealthy=50, maxPercent=100 → drop 1 old → start 1 new → drop 1 old → start 1 new. Cheaper.

Blue / Green (CodeDeploy) #

Stand up an entirely new (green) set, then swap the ALB listener at once. Instant rollback.

External (Spinnaker / your own controller) #

Hand “how to deploy” off to an external tool. Only large orgs.

Auto Scaling — grow with traffic #

Sit Application Auto Scaling on top of a Service to adjust desired count automatically.

hold average CPU at 60%

aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/prod-cluster/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 10

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/prod-cluster/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu60 \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://cpu-60.json

cpu-60.json contains PredefinedMetricSpecification: ECSServiceAverageCPUUtilization, TargetValue: 60.0.

Common scaling triggers:

ECS Service average CPU
ECS Service average memory
ALB RequestCountPerTarget (request count based)

Service Connect — service-to-service #

Multiple microservices on ECS calling each other. Two options.

1) Through ALB / NLB #

Each service has its own ALB. Service A → https://service-b.internal/ (Route 53 private hosted zone) → ALB → Service B.

Pros: standard HTTP, consistent with external. Cons: ALB cost, an extra hop.

2) Service Connect (built into ECS) #

ECS automatically attaches a proxy sidecar (Envoy-based) next to your container, behaving like a mesh. DNS is auto-registered inside the Cluster (web.myapp.local).

Service Connect (excerpt)

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "myapp",
    "services": [
      {
        "portName": "web",
        "discoveryName": "web",
        "clientAliases": [{ "port": 8000, "dnsName": "web" }]
      }
    ]
  }
}

For small systems an ALB hop is fine. Look at Service Connect once you have multiple microservices.

Cost — where it comes from #

Fargate basis:

cost = vCPU + memory + network

hourly = (vCPU-hours)   × $0.0506
       + (memory-GB-hours) × $0.0055
       + (Data Transfer)

Example: 0.5 vCPU + 1GB Fargate, 1 task, one month (730h)
   = 0.5 × 0.0506 × 730 + 1 × 0.0055 × 730
   = $18.5  +  $4.0
   = ~$22.5 / month  (rough Seoul region pricing)

Plus:

ALB: hourly + LCU
NAT Gateway (when private subnets reach the internet): hourly + GB
CloudWatch Logs: ingest GB + storage GB

NAT Gateway is sneakily expensive. It can easily run ~$30/month — for a small service, NAT can dwarf Fargate itself.

Cost-saving levers #

Fargate Spot: 70% off for bursty / batch workloads. Can be terminated; only stateless work fits
Compute Savings Plans: 1- or 3-year commitment, up to 50% off
Right-sizing: use CloudWatch Container Insights to see actual usage, then drop vCPU / memory — usually the biggest win

Common pitfalls #

1) Tasks keep dying and restarting #

The Service auto-restarts so it looks fine on the surface — but the container is actually exiting right after it starts. Causes:

Health check failures (app boots slowly, ALB marks unhealthy)
Errors at startup → immediate exit
OOM killed (memory too small)

Look at CloudWatch Logs (Basics #7) and the stopped reason:

aws ecs describe-tasks --cluster prod-cluster \
  --tasks <task-id> --query 'tasks[0].stoppedReason'

2) Image pull permission missing #

“CannotPullContainerError” right after Task start → 99% of the time Execution Role is missing ECR permissions. Confirm AmazonECSTaskExecutionRolePolicy is attached.

3) Secrets aren’t injected #

secrets from the Task Definition come in empty → the Execution Role lacks secretsmanager:GetSecretValue / ssm:GetParameter on those ARNs. Details in #6.

4) ALB Target unhealthy #

Deploys succeed but the ALB health check fails. Usual causes:

Health check path doesn’t exist on the app (forgot the /health endpoint)
Security Group blocks ALB → Task traffic
App is bound to 127.0.0.1 instead of 0.0.0.0 (unreachable from outside the container)

5) Task Definition revisions explode #

v1 → v2 → … → v847, on and on. Without cleanup the console gets sluggish. Operational policy: auto-clean revisions older than 30 days, or have your IaC clean up.

6) NAT Gateway cost blow-up #

Tasks in private subnets that hit external APIs frequently → NAT Gateway data processing fees can exceed your EC2 bill. Mitigations:

VPC Endpoints for AWS services you use a lot (S3, ECR, Secrets Manager) — that traffic skips NAT
For external API calls, keep tasks in the same AZ as the NAT to avoid cross-AZ data charges

Wrap-up #

Here is what this post covered:

The limits of bare EC2 ops — environment reproducibility, scaling, zero-downtime deploys, rollbacks, health checks all flow naturally with containers
Where ECS sits — AWS’s managed container orchestrator. EKS comes when you need k8s standardization
The four pieces — Cluster (grouping) / Service (keep N) / Task (running container) / Task Definition (blueprint)
Launch Type — EC2 (you operate, cost-optimal) vs Fargate (zero ops, higher unit price). The series goes Fargate
Two IAM roles — Execution Role (ECS launching the Task) vs Task Role (your code calling AWS APIs). Never blur them
First-deploy flow — ECR push → Cluster → Task Definition → Service (with ALB)
Deploy strategies — rolling (default) / blue-green (CodeDeploy) / external
Auto Scaling — Application Auto Scaling on CPU / memory / request count
Service Connect — service-to-service via mesh, no ALB hop
Cost — vCPU + memory + ALB + NAT. NAT is bigger than you think. Spot, Savings Plans, right-sizing
Pitfalls — restart loops (health / OOM), image pull permission, secret permissions, ALB unhealthy, revision sprawl, NAT cost

Up next — ECR #

Where do those images ECS pulls actually live? In the next post we go into Amazon ECR (Elastic Container Registry) in detail.

In #2 ECR — Image Registry we cover creating private repos, authentication, push / pull, image scanning, lifecycle policies, and multi-architecture images — the natural companion to ECS, all in one piece.