15 Chapter

ECS and Fargate — Deploying Containers

Putting containers on AWS, all in one place. We cover how ECS works (vs EKS), its four building blocks — Cluster · Service · Task · Task Definition — the difference between the EC2 launch type and Fargate, the split between Execution Role and Task Role, ALB · VPC wiring, and everything from your first deployment to Auto Scaling and cost.

This book’s destination is running a fullstack app on ECS Fargate (Part 6, Deploying a fullstack app to AWS). That’s why this chapter isn’t about one simple service — it’s the chapter that answers why the whole book flows container-first. With the foundation of accounts · IAM · security · CloudWatch laid in Part 1, and the core resources from Chapter 8 EC2/VPC through Chapter 14 CloudFront under your belt in Part 2, it’s now time to step up from putting things directly on a single EC2 box.

If everything through Chapter 14 CloudFront was about “how do I build each resource,” from this chapter on the view shifts to how you put a real application on top of those resources in an operable form. The standard pattern for running an image built with Docker on AWS is ECS and Fargate, and both Part 4’s Chapter 22 ECS Fargate deployment skeleton and the Part 6 capstone are built on the model we set up in this chapter.

This chapter covers, all at once, how ECS works and how it differs from EKS, its four building blocks, the two launch types EC2 and Fargate, the two IAM roles people most often confuse, the full flow of a first deployment, and deployment strategy · Auto Scaling · cost.

The limits of putting things directly on a single EC2 box #

The flow from Chapter 9 EC2 operations — create an EC2 instance, SSH in, install nginx / docker / your code by hand, and bring it up with systemd — is enough for simple cases. But as scale grows, the pain starts.

Pain point	EC2 hand-operation
Reproducing the same environment	Differs every time due to OS patches and dependency drift
Scale-out	Build AMI → ASG → deploy — minutes at a time
Zero-downtime deployment	Complex shell scripts / separate tooling
Rollback	Snapshot → boot → shift traffic
Health checks / auto-recovery	systemd hits its limits

Containers solving this pain is the flow of modern infrastructure. On AWS, the entry point for that is ECS.

What ECS handles #

Amazon ECS (Elastic Container Service) is AWS’s managed container orchestrator. Once you’ve defined which machine runs your Docker image, how many to run, and how to route traffic, ECS operates it for you.

ECS vs EKS #

	ECS	EKS
Identity	AWS’s own orchestrator	AWS-managed Kubernetes
Learning curve	Shallow (blends well into AWS)	Steep (requires learning k8s itself)
Portability to other clouds	Low (AWS-only)	High (k8s standard)
Ecosystem	AWS tools + some community	The full k8s ecosystem (Helm, ArgoCD, etc.)
Operational burden	Low	High (Control Plane cost + operational know-how)
When it fits	Small / medium scale, AWS lock-in OK	Large scale, multi-cloud, need k8s standard

If you’re starting container operations for the first time, start with ECS. Defer EKS until you’ve finished learning k8s itself, on top of a foundation like Chapter 8 EC2/VPC. This book handles container operations with ECS Fargate; the EKS · Kubernetes route belongs to the Kubernetes book.

Another ECS-family service is App Runner. It’s even simpler than ECS (image → URL in one step). But its options are narrow, so the current standard is for ECS / Fargate to hold the operations space.

The four building blocks of ECS #

To understand ECS, you only have to memorize four building blocks.

The 4 building blocks of ECS — top to bottom

┌──────────────────────────────────────┐
│  Cluster — the grouping unit         │
│  ┌────────────────────────────────┐  │
│  │ Service — always keeps N        │  │
│  │  ┌────────────┐ ┌────────────┐ │  │
│  │  │  Task #1   │ │  Task #2   │ │  │
│  │  │ (container) │ │ (container) │ │  │
│  │  └────────────┘ └────────────┘ │  │
│  │  ↑ Task Definition (blueprint) │  │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

Task Definition — the container’s blueprint #

A single JSON file. It contains everything about what to run and how.

Which image (123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1)
CPU / memory (512 / 1024 MB)
Environment variables / Secrets
Port mappings
Log driver (usually CloudWatch Logs)
IAM roles (Task Role + Execution Role — covered in detail later)
Health checks

task-definition.json (Fargate)

{
  "family": "myapp",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/myapp-task-role",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1",
      "essential": true,
      "portMappings": [{ "containerPort": 8000, "protocol": "tcp" }],
      "environment": [
        { "name": "ENV", "value": "production" }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:myapp/db-AbCdEf"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/myapp",
          "awslogs-region": "ap-northeast-2",
          "awslogs-stream-prefix": "web"
        }
      }
    }
  ]
}

A Task Definition accumulates as revisions (numbered, like myapp:7). To deploy a new image, you create a new revision and change the Service to reference it.

Task — a running instance #

A Task Definition actually brought up as one container (or group of containers). It corresponds to an EC2 instance.

One Task = one execution of a Task Definition revision.
A Task can hold several containers (the sidecar pattern — main app + log collector, etc.).
A Task has its own ENI (network interface) and IP (awsvpc mode).

Service — always keep N running #

If you only say “bring up a Task once,” it’s over when that Task dies. Service is the role that sits on top.

“Always keep N Tasks of this Task Definition running”
Auto-restart when one dies
Connect to an ALB / NLB to receive traffic (Chapter 13 ALB / NLB and ACM)
Deployment strategy (rolling, blue/green)
Auto Scaling (based on CPU / memory / request count)

Operational workloads (web servers, APIs, etc.) are almost always brought up as a Service. Only one-off batch jobs run a Task directly without a Service (RunTask).

Cluster — the grouping #

A logical grouping where Services and Tasks live. Usually split by environment.

prod-cluster
staging-cluster
dev-cluster

A Cluster is free (the Cluster itself has no cost). The cost is the resources of the Tasks running inside it. So you can split it freely by environment.

Launch Type — EC2 vs Fargate #

This is how ECS decides where to actually run a Task. There are two modes.

EC2 Launch Type #

You operate a group of EC2 instances (an ASG) and ECS schedules containers on top of them.

EC2 Launch Type

ECS Service
   │ (schedule)
   ▼
EC2 #1     EC2 #2     EC2 #3   ← you operate (ASG, AMI, patching, security)
 ▲          ▲          ▲
 container  container  container

The advantages are as follows.

Instance cost = EC2 pricing (long-term savings / Reserved / Spot)
Freedom for GPU / large memory / special instances

The drawbacks are as follows.

You have to operate EC2 itself — keeping AMIs current, OS security patching, ECS agent updates
You have to mind instance packing (binpacking).
An idle instance left running wastes that much money for that time.

Fargate Launch Type #

EC2 is invisible. You only declare the Task’s CPU / memory, and AWS handles the work for you.

Fargate Launch Type

ECS Service
   │ (schedule)
   ▼
[AWS-managed area — invisible]
   │
   ▼
container (Task)

The advantages are as follows.

EC2 operation is zero — OS patching, ASG, AMI are all done by AWS.
Billing per Task (per minute, vCPU + memory)
No waste from idle instances.

The drawbacks are as follows.

The unit price is higher than EC2 (includes the management cost).
GPU / special instances / some network options are unavailable.
There are limits of vCPU 0.25 ~ 16 and memory 0.5 ~ 120GB per container.

Which to pick #

Case	Recommendation
Small / medium traffic	Fargate — zero operational burden
Very high cost	EC2 + Reserved / Spot
GPU / special workload	EC2
Variable traffic / batch	Fargate Spot (up to 70% off)
Comfortable with k8s but only ECS is available	EC2 + freedom

Parts 3 ~ 4 and the Part 6 capstone of this book all run on Fargate. It cuts the operational burden dramatically and the learning curve is gentle.

The two IAM roles — Execution Role vs Task Role #

This is the point most often confused in ECS operations.

Execution Role #

The permissions the ECS agent needs to bring up a Task. AWS uses it right before the Task starts.

Pull the image from ECR (Chapter 16 ECR)
Create CloudWatch Logs groups / streams
Fetch secrets from Secrets Manager / Parameter Store (injected at Task start)

By default, one ecsTaskExecutionRole per account is enough (granted the AWS-managed policy AmazonECSTaskExecutionRolePolicy).

Task Role #

The permissions the code inside the container uses when it calls the AWS API. Used at runtime.

boto3.client("s3").get_object(...) in the code → S3 access
dynamodb.get_item(...) in the code → DynamoDB access

The principle is to create a separate, least-privilege Task Role for each app (the least-privilege pattern of Chapter 6 Security basics).

Separating permissions

Execution Role  →  used by ECS (pull image, create logs, inject secrets)
Task Role       →  used by my code (call S3, DynamoDB, SQS, etc.)

Confusing the two and piling everything into one leads to a security incident.

First deployment — Hello, ECS #

Let’s walk through the complete flow once. We assume you already have a Docker image.

1) Push the image to ECR #

Chapter 16 ECR covers this in detail, but let’s look at just the flow ahead of time.

ECR push

# authenticate
aws ecr get-login-password --region ap-northeast-2 \
  | docker login --username AWS --password-stdin \
    123456789012.dkr.ecr.ap-northeast-2.amazonaws.com

# build + tag + push
docker build -t myapp .
docker tag myapp:latest \
  123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1
docker push \
  123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myapp:v1

2) Create the Cluster #

Cluster

aws ecs create-cluster --cluster-name prod-cluster

It’s also one click in the console. Again, it’s free.

3) Register the Task Definition #

Put the JSON above into a file (task-definition.json) and register it.

aws ecs register-task-definition \
  --cli-input-json file://task-definition.json

On success, the myapp:1 revision is created.

4) Create the Service (with an ALB) #

We proceed with the ALB’s Target Group (Chapter 13 ALB / NLB and ACM) already created.

Service

aws ecs create-service \
  --cluster prod-cluster \
  --service-name myapp \
  --task-definition myapp:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-xxx],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=web,containerPort=8000"

The moment you run this line, ECS does the following.

Bring up 2 containers on Fargate
Register each container’s ENI to the Target Group
The ALB routes traffic after the health check passes

Access via the ALB’s DNS (or your domain from Chapter 12 Route 53) and you’re done.

5) Deploy a new version #

new version

# push a new image (myapp:v2)
docker tag myapp:v2 ...; docker push ...

# new Task Definition revision (just change the image tag and register again)
aws ecs register-task-definition --cli-input-json file://task-definition-v2.json
# → myapp:2

# update the Service to use the new revision
aws ecs update-service \
  --cluster prod-cluster \
  --service myapp \
  --task-definition myapp:2

ECS handles it for you with a rolling update — it brings up 2 new Tasks, and once they pass the health check, it terminates the 2 old Tasks. No service interruption.

A Service’s deployment options #

The default is a rolling update, but there are two more.

Rolling Update (default) #

You tune it with two knobs: minimumHealthyPercent (default 100) and maximumPercent (default 200).

minHealthy=100, maxPercent=200 → with desired=2, up to 4 at one moment (2 new + 2 old), then the old ones terminate. Zero downtime.
minHealthy=50, maxPercent=100 → terminate 1 old → 1 new → terminate 1 old → 1 new. Cost savings.

Blue / Green (CodeDeploy integration) #

You build an entire new environment (green) and switch the ALB’s listener all at once. Rollback is instant.

External (Spinnaker / your own controller) #

You delegate “how to deploy” to an external tool. Used only by large organizations.

Auto Scaling — grow with traffic #

You attach Application Auto Scaling on top of a Service to adjust the desired count automatically.

keep average CPU at 60%

aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/prod-cluster/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 10

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/prod-cluster/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu60 \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://cpu-60.json

Inside cpu-60.json you put PredefinedMetricSpecification: ECSServiceAverageCPUUtilization and TargetValue: 60.0.

The default candidates for a scaling trigger are as follows.

ECS Service average CPU
ECS Service average memory
The ALB’s RequestCountPerTarget (request-count based)

Service Connect — communication between services #

This is how several microservices call each other on ECS. There are two options.

1) Via ALB / NLB #

You put an ALB in front of each service. Service A → https://service-b.internal/ (Route 53 private hosted zone) → ALB → Service B.

The advantage is that it’s standard HTTP, so it’s consistent with the outside; the drawback is the ALB cost and one extra hop.

2) Service Connect (ECS-native) #

ECS automatically inserts a proxy sidecar (Envoy-based) next to the container, so it behaves like a mesh. DNS auto-registers inside the Cluster (web.myapp.local).

Service Connect config (summary)

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "myapp",
    "services": [
      {
        "portName": "web",
        "discoveryName": "web",
        "clientAliases": [{ "port": 8000, "dnsName": "web" }]
      }
    ]
  }
}

For small systems, just one ALB is enough. Consider Service Connect when you have several microservices.

Cost — where it comes from #

This is for Fargate.

Cost = vCPU + memory + network

per hour = (vCPU hours) × $0.0506
         + (memory GB hours) × $0.0055
         + (Data Transfer)

example: one 0.5 vCPU + 1GB Fargate for a month (730h)
   = 0.5 × 0.0506 × 730 + 1 × 0.0055 × 730
   = $18.5  +  $4.0
   = $22.5 / month  (roughly, Seoul region)

On top of that, the following apply.

ALB: per hour + per LCU
NAT Gateway (when going out to the internet from a private subnet): per hour + per GB
CloudWatch Logs: ingest GB + storage GB

The NAT Gateway is surprisingly large. It runs around $30/month — for a small service, the NAT cost can exceed Fargate itself. Cost optimization is covered in earnest in Chapter 27 Cost optimization.

Cost-saving options #

Fargate Spot: 70% off for variable / batch workloads. It can be terminated suddenly, so use it only for stateless workloads.
Compute Savings Plans: up to 50% off with a 1 ~ 3 year commitment
Right-sizing: check actual usage with CloudWatch Container Insights, then cut vCPU / memory — this is the item with the biggest effect.

Pitfalls you’ll often hit #

1) A Task keeps dying and coming back #

Since the Service auto-restarts, on the surface it “looks like it’s working,” but in reality the container is terminating right after start. The causes are as follows.

Health check failure (the app comes up late, so the ALB judges it unhealthy)
Immediate exit on an error inside the container
Out of memory (OOM killed)

Check the stopped reason in CloudWatch Logs (Chapter 7 CloudWatch intro).

aws ecs describe-tasks --cluster prod-cluster \
  --tasks <task-id> --query 'tasks[0].stoppedReason'

2) Insufficient image pull permission #

If “CannotPullContainerError” comes up right after the Task starts, 99% of the time it’s a missing ECR permission on the Execution Role. Check that you attached the AWS-managed AmazonECSTaskExecutionRolePolicy.

3) Secret injection doesn’t work #

If the Task Definition’s secrets come in empty, the Execution Role lacks secretsmanager:GetSecretValue / ssm:GetParameter permission on the Secrets Manager / Parameter Store ARN. This is covered in detail in Chapter 20 Secrets Manager / Parameter Store.

4) ALB Target is unhealthy #

The deployment went through, but the ALB health check fails. Common causes are as follows.

The health check path doesn’t exist in the app (forgot the /health endpoint)
A Security Group blocks ALB → Task traffic
The app binds to 127.0.0.1 instead of 0.0.0.0 (unreachable from outside the container)

5) Task Definition revisions explode #

They pile up endlessly, like v1 → v2 → … → v847. If you don’t clean them up yourself, the console gets heavy. Set an operational policy to auto-clean revisions unused for over 30 days, or let your IaC clean them up.

6) NAT Gateway cost explosion #

If Tasks in a private subnet frequently call external APIs, the NAT Gateway’s Data Processing charge exceeds the EC2 charge. The alternatives are as follows.

VPC Endpoint (for frequently used services like S3, ECR, Secrets Manager) — traffic skips the NAT.
If you make many external API calls, use a NAT in the same AZ to avoid cross-AZ traffic costs.

Exercises #

Write in one sentence your service’s traffic pattern (constant, highly variable, or batch-like), and from the table in §“Launch Type — EC2 vs Fargate,” pick which launch type fits and note why. In Chapter 22 ECS Fargate deployment skeleton, you’ll make the same choice again in Terraform.
Without looking, write one sentence each for what the Execution Role and the Task Role are used for. Then, assuming your app uses both S3 and Secrets Manager, connect which permission goes into which role, basing it on §“The two IAM roles” (think of it together with the least privilege of Chapter 6 Security basics).
Following the calculation in §“Cost — where it comes from,” compute roughly the cost of running two 1 vCPU + 2GB Fargate Tasks for a month. If an ALB and a NAT Gateway are added to that, write in one sentence which item you could reduce with the VPC Endpoint from Chapter 16 ECR.

In short: ECS is AWS’s managed container orchestrator, built from four building blocks — Cluster, Service, Task, and Task Definition. The launch type splits into EC2, which carries a heavy operational burden, and Fargate, which removes server management, and this book is Fargate-based. The Execution Role is the permission ECS uses to bring up a Task, and the Task Role is the permission the code uses to call the AWS API — never confuse the two. The first deployment goes ECR push → Cluster → Task Definition → Service (ALB connection), a rolling update is the default, and cost is vCPU, memory, plus ALB and NAT, where the NAT is surprisingly large.

Next chapter #

The next Chapter 16 ECR covers where the images ECS runs come from. From creating a private repo, IAM authentication, push / pull, image scanning, and lifecycle policies to multi-architecture images — it puts the image registry, ECS’s companion, all in one place.