AWS Advanced #1: ECS and Fargate — Container Deployment

13 min read

If the AWS Basics 7 posts gave you the foundation of accounts / IAM / security / CloudWatch, and the AWS Intermediate 7 posts made you comfortable with EC2 / VPC / S3 / RDS / Route 53 / ALB / CloudFront, now we step up — to containers.

The seven AWS Advanced posts move you off putting things directly on a single EC2 box and into the toolbox you meet at operating scale — containers, serverless, messaging, secrets, workflows.

This post is the first of those — ECS and Fargate. We’ll lay down the standard pattern for taking an image you built with Docker and running it on AWS.

The limits of putting things directly on one EC2 #

The flow from Intermediate #2 EC2 operations — spin up an EC2, SSH in, install nginx / docker / your code by hand, run it under systemd — is fine for simple cases. But you start running into pain in these places.

Pain pointDirect EC2 ops
Reproducible environmentsOS patches and dependency drift make it different every time
Scaling outBuild an AMI → ASG → deploy — minutes, not seconds
Zero-downtime deploysComplicated shell scripts or a separate tool
RollbackSnapshot → boot → shift traffic
Health checks / auto-recoverysystemd only goes so far

Containers solve all of these in one motion — that’s the modern infrastructure flow. On AWS, the door into that is ECS.

Where ECS fits #

Amazon ECS (Elastic Container Service) is AWS’s managed container orchestrator. Hand it a Docker image, tell it what machine, how many copies, and how traffic should flow, and ECS runs the rest.

ECS vs EKS — one-liner #

ECSEKS
What it isAWS’s own orchestratorKubernetes managed by AWS
Learning curveGentle (sits naturally inside AWS)Steep (you have to learn k8s itself)
PortabilityLow (AWS-only)High (k8s standard)
EcosystemAWS tools + some communityWhole k8s ecosystem (Helm, ArgoCD, etc.)
Operational burdenLowHigh (Control Plane cost + ops knowledge)
Where it shinesSmall / mid scale, AWS lock-in is fineLarge scale, multi-cloud, k8s standard required

Starting containers for the first time? ECS first. EKS comes later, after the foundations from Intermediate #1 EC2/VPC plus k8s itself.

ECS has another cousin called App Runner — even simpler than ECS (image → URL in one step). But it’s narrow on options, so ECS / Fargate is the production-grade choice today.

The four ECS pieces #

Four pieces is all you need to memorize.

ECS — top to bottom
┌──────────────────────────────────────┐
│  Cluster — the grouping unit         │
│  ┌────────────────────────────────┐  │
│  │ Service — keep N running        │  │
│  │  ┌────────────┐ ┌────────────┐ │  │
│  │  │  Task #1   │ │  Task #2   │ │  │
│  │  │ (container)│ │ (container)│ │  │
│  │  └────────────┘ └────────────┘ │  │
│  │  ↑ Task Definition (the blueprint) │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

1) Task Definition — the blueprint of your container #

A single piece of JSON. It says what to run and how.

  • Which image (123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1)
  • CPU / memory (512 / 1024 MB)
  • Environment variables / Secrets
  • Port mappings
  • Log driver (typically CloudWatch Logs)
  • IAM roles (Task Role + Execution Role — more on this below)
  • Health check
task-definition.json (Fargate)
{
  "family": "myapp",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/myapp-task-role",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1",
      "essential": true,
      "portMappings": [{ "containerPort": 8000, "protocol": "tcp" }],
      "environment": [
        { "name": "ENV", "value": "production" }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:myapp/db-AbCdEf"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/myapp",
          "awslogs-region": "ap-northeast-2",
          "awslogs-stream-prefix": "web"
        }
      }
    }
  ]
}

Task Definitions accumulate as revisions (myapp:7, etc.). To deploy a new image, register a new revision and have the Service point at it.

2) Task — a running instance #

A Task Definition that’s actually been started. The container (or set of containers) is running. Equivalent to an EC2 instance.

  • One Task = one running revision of a Task Definition
  • A Task can have multiple containers (sidecar pattern — main app + log shipper, etc.)
  • A Task gets its own ENI (network interface) + IP (awsvpc mode)

3) Service — keep N alive #

Just running a Task once means it’s gone if it crashes. Service is the next layer:

  • “Keep N copies of this Task Definition running.”
  • Auto-restart on death
  • Wires up to ALB / NLB to receive traffic (Intermediate #6)
  • Deployment strategies (rolling, blue/green)
  • Auto Scaling (CPU / memory / request count based)

Almost all production workloads (web servers, APIs) run as a Service. One-shot batch jobs run a Task directly without a Service (RunTask).

4) Cluster — the grouping #

The logical grouping that Services / Tasks live in. Usually split per environment:

  • prod-cluster
  • staging-cluster
  • dev-cluster

Clusters are free (no charge for the Cluster itself). What you pay for is the resources inside running Tasks. So split per environment freely.

Launch Type — EC2 vs Fargate #

Where ECS actually puts your Tasks. Two modes.

EC2 Launch Type #

You run a fleet of EC2 instances (ASG); ECS schedules containers onto them.

EC2 Launch Type
ECS Service
   │ (schedule)
EC2 #1     EC2 #2     EC2 #3   ← you run these (ASG, AMI, patches, security)
 ▲          ▲          ▲
container  container  container

Pros:

  • Instance pricing = EC2 pricing (long-term savings / Reserved / Spot)
  • Free choice of GPU / large memory / specialty instances

Cons:

  • You operate the EC2 — keep AMIs current, patch the OS, update the ECS agent
  • You have to think about packing (binpacking)
  • An empty instance still costs you while it’s idle

Fargate Launch Type #

EC2 disappears. You declare the Task’s CPU / memory and AWS finds where to run your container.

Fargate Launch Type
ECS Service
   │ (schedule)
[AWS-managed plane — invisible]
container (Task)

Pros:

  • Zero EC2 ops — OS patches, ASG, AMI all handled by AWS
  • Per-Task billing (per minute, vCPU + memory)
  • No idle instance waste

Cons:

  • Higher unit price than EC2 (managed cost is included)
  • No GPU / specialty instances / some networking options
  • Per container: 0.25–16 vCPU, 0.5–120GB memory ceiling

Which one? #

SituationPick
Small / medium trafficFargate — zero ops
High-volume, cost-focusedEC2 + Reserved / Spot
GPU / specialty workloadsEC2
Bursty traffic / batchFargate Spot (up to 70% off)
You know k8s but only have ECSEC2 + freedom

This series and the practice 6 posts all assume Fargate. It cuts ops down sharply and the learning curve is gentle.

Two IAM roles — Execution Role vs Task Role #

The most commonly confused thing in ECS ops.

Execution Role #

The permissions the ECS agent needs to launch your Task. Used by AWS right before the Task starts.

  • Pull images from ECR
  • Create CloudWatch Logs groups / streams
  • Fetch secrets from Secrets Manager / Parameter Store (injected at start time)

For most accounts a single ecsTaskExecutionRole is enough (attach the AWS-managed AmazonECSTaskExecutionRolePolicy).

Task Role #

The permissions your code inside the container uses to call AWS APIs. Used at runtime.

  • boto3.client("s3").get_object(...) from your code → S3 access
  • dynamodb.get_item(...) from your code → DynamoDB access

You should make a least-privilege Task Role per app. The principle from Basics #6 Security fundamentals.

Role separation
Execution Role  →  Used by ECS (image pull, log creation, secret injection)
Task Role       →  Used by your code (S3, DynamoDB, SQS calls, etc.)

Mash these together into one role and you’ve made a security hole.

First deploy — Hello, ECS #

A walkthrough of the full flow. This assumes you already have a Docker image.

1) Push to ECR #

We cover this in detail in #2 ECR, but the flow up front:

ECR push
# Login
aws ecr get-login-password --region ap-northeast-2 \
  | docker login --username AWS --password-stdin \
    123456789012.dkr.ecr.ap-northeast-2.amazonaws.com

# Build + tag + push
docker build -t myapp .
docker tag myapp:latest \
  123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1
docker push \
  123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1

2) Create the Cluster #

Cluster
aws ecs create-cluster --cluster-name prod-cluster

One click in the console. Free, again.

3) Register the Task Definition #

Save the JSON above as task-definition.json:

register
aws ecs register-task-definition \
  --cli-input-json file://task-definition.json

On success you get revision myapp:1.

4) Create the Service (with ALB) #

With the ALB Target Group (Intermediate #6) already created:

Service
aws ecs create-service \
  --cluster prod-cluster \
  --service-name myapp \
  --task-definition myapp:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-xxx],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=web,containerPort=8000"

The instant you run this, ECS will:

  1. Bring up 2 containers in Fargate
  2. Register each container’s ENI to the Target Group
  3. Have the ALB route traffic once health checks pass

Hit the ALB DNS (or your Route 53 (Intermediate #5) domain) and you’re live.

5) Deploy a new version #

new version
# Push the new image (myapp:v2)
docker tag myapp:v2 ...; docker push ...

# Register a new Task Definition revision (just swap the image tag)
aws ecs register-task-definition --cli-input-json file://task-definition-v2.json
# → myapp:2

# Update the Service to use the new revision
aws ecs update-service \
  --cluster prod-cluster \
  --service myapp \
  --task-definition myapp:2

ECS handles the rolling update for you — bring up 2 new Tasks, wait for health, drain the old 2. No downtime.

Service deployment options #

The default is rolling update; two more options exist.

Rolling Update (default) #

Two knobs: minimumHealthyPercent (default 100) and maximumPercent (default 200).

  • minHealthy=100, maxPercent=200 → with desired=2, briefly 4 (new 2 + old 2), then drop the old. Zero downtime.
  • minHealthy=50, maxPercent=100 → drop 1 old → start 1 new → drop 1 old → start 1 new. Cheaper.

Blue / Green (CodeDeploy) #

Stand up an entirely new (green) set, then swap the ALB listener at once. Instant rollback.

External (Spinnaker / your own controller) #

Hand “how to deploy” off to an external tool. Only large orgs.

Auto Scaling — grow with traffic #

Sit Application Auto Scaling on top of a Service to adjust desired count automatically.

hold average CPU at 60%
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/prod-cluster/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 10

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/prod-cluster/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu60 \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://cpu-60.json

cpu-60.json contains PredefinedMetricSpecification: ECSServiceAverageCPUUtilization, TargetValue: 60.0.

Common scaling triggers:

  • ECS Service average CPU
  • ECS Service average memory
  • ALB RequestCountPerTarget (request count based)

Service Connect — service-to-service #

Multiple microservices on ECS calling each other. Two options.

1) Through ALB / NLB #

Each service has its own ALB. Service A → https://service-b.internal/ (Route 53 private hosted zone) → ALB → Service B.

Pros: standard HTTP, consistent with external. Cons: ALB cost, an extra hop.

2) Service Connect (built into ECS) #

ECS automatically attaches a proxy sidecar (Envoy-based) next to your container, behaving like a mesh. DNS is auto-registered inside the Cluster (web.myapp.local).

Service Connect (excerpt)
{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "myapp",
    "services": [
      {
        "portName": "web",
        "discoveryName": "web",
        "clientAliases": [{ "port": 8000, "dnsName": "web" }]
      }
    ]
  }
}

For small systems an ALB hop is fine. Look at Service Connect once you have multiple microservices.

Cost — where it comes from #

Fargate basis:

cost = vCPU + memory + network
hourly = (vCPU-hours)   × $0.0506
       + (memory-GB-hours) × $0.0055
       + (Data Transfer)

Example: 0.5 vCPU + 1GB Fargate, 1 task, one month (730h)
   = 0.5 × 0.0506 × 730 + 1 × 0.0055 × 730
   = $18.5  +  $4.0
   = ~$22.5 / month  (rough Seoul region pricing)

Plus:

  • ALB: hourly + LCU
  • NAT Gateway (when private subnets reach the internet): hourly + GB
  • CloudWatch Logs: ingest GB + storage GB

NAT Gateway is sneakily expensive. It can easily run ~$30/month — for a small service, NAT can dwarf Fargate itself.

Cost-saving levers #

  • Fargate Spot: 70% off for bursty / batch workloads. Can be terminated; only stateless work fits
  • Compute Savings Plans: 1- or 3-year commitment, up to 50% off
  • Right-sizing: use CloudWatch Container Insights to see actual usage, then drop vCPU / memory — usually the biggest win

Common pitfalls #

1) Tasks keep dying and restarting #

The Service auto-restarts so it looks fine on the surface — but the container is actually exiting right after it starts. Causes:

  • Health check failures (app boots slowly, ALB marks unhealthy)
  • Errors at startup → immediate exit
  • OOM killed (memory too small)

Look at CloudWatch Logs (Basics #7) and the stopped reason:

aws ecs describe-tasks --cluster prod-cluster \
  --tasks <task-id> --query 'tasks[0].stoppedReason'

2) Image pull permission missing #

“CannotPullContainerError” right after Task start → 99% of the time Execution Role is missing ECR permissions. Confirm AmazonECSTaskExecutionRolePolicy is attached.

3) Secrets aren’t injected #

secrets from the Task Definition come in empty → the Execution Role lacks secretsmanager:GetSecretValue / ssm:GetParameter on those ARNs. Details in #6.

4) ALB Target unhealthy #

Deploys succeed but the ALB health check fails. Usual causes:

  • Health check path doesn’t exist on the app (forgot the /health endpoint)
  • Security Group blocks ALB → Task traffic
  • App is bound to 127.0.0.1 instead of 0.0.0.0 (unreachable from outside the container)

5) Task Definition revisions explode #

v1v2 → … → v847, on and on. Without cleanup the console gets sluggish. Operational policy: auto-clean revisions older than 30 days, or have your IaC clean up.

6) NAT Gateway cost blow-up #

Tasks in private subnets that hit external APIs frequently → NAT Gateway data processing fees can exceed your EC2 bill. Mitigations:

  • VPC Endpoints for AWS services you use a lot (S3, ECR, Secrets Manager) — that traffic skips NAT
  • For external API calls, keep tasks in the same AZ as the NAT to avoid cross-AZ data charges

Wrap-up #

Here is what this post covered:

  • The limits of bare EC2 ops — environment reproducibility, scaling, zero-downtime deploys, rollbacks, health checks all flow naturally with containers
  • Where ECS sits — AWS’s managed container orchestrator. EKS comes when you need k8s standardization
  • The four pieces — Cluster (grouping) / Service (keep N) / Task (running container) / Task Definition (blueprint)
  • Launch Type — EC2 (you operate, cost-optimal) vs Fargate (zero ops, higher unit price). The series goes Fargate
  • Two IAM roles — Execution Role (ECS launching the Task) vs Task Role (your code calling AWS APIs). Never blur them
  • First-deploy flow — ECR push → Cluster → Task Definition → Service (with ALB)
  • Deploy strategies — rolling (default) / blue-green (CodeDeploy) / external
  • Auto Scaling — Application Auto Scaling on CPU / memory / request count
  • Service Connect — service-to-service via mesh, no ALB hop
  • Cost — vCPU + memory + ALB + NAT. NAT is bigger than you think. Spot, Savings Plans, right-sizing
  • Pitfalls — restart loops (health / OOM), image pull permission, secret permissions, ALB unhealthy, revision sprawl, NAT cost

Up next — ECR #

Where do those images ECS pulls actually live? In the next post we go into Amazon ECR (Elastic Container Registry) in detail.

In #2 ECR — Image Registry we cover creating private repos, authentication, push / pull, image scanning, lifecycle policies, and multi-architecture images — the natural companion to ECS, all in one piece.

X