27 Chapter

Cost Optimization and Dashboards

Cost Explorer analysis, Savings Plans / Spot / Graviton, Right Sizing, tag enforcement and cost classification, and the FinOps area. Patterns that actually cut a production system's cost, wrapping up Part 4, 'From the console to code.'

In Chapter 22 ~ Chapter 26 — infrastructure / DB / CI/CD / IaC / monitoring — we’ve assembled an operable system. The last remaining topic is how much it’s costing, and how to cut that cost.

This chapter is the last of Part 4, “From the console to code.” If Chapter 3 cost management got the basics of billing alerts and Cost Explorer in hand, this chapter is the step that actually cuts a production system’s cost on top of that. And at the chapter’s end we organize the layers we’ve stacked so far and lay a bridge into Part 5, operations · security · cost.

Where the bill leaks #

A typical small operation (ECS Fargate + RDS + ALB + CloudFront + Logs) has a monthly bill breakdown like this.

Resource	Share	Meaning
ECS Fargate (vCPU + memory time)	30 ~ 50%	the biggest cost item
RDS (instance + Storage + IO)	20 ~ 30%	2x for Multi-AZ
NAT Gateway / Egress	10 ~ 20%	the often-forgotten cost item
ALB / traffic	5 ~ 10%	hours + LCU
CloudWatch Logs / Metrics	5 ~ 10%	runs away when retention is omitted
S3 / ECR	2 ~ 5%	accumulating images / objects
Other	5%	DNS, KMS, Secrets, …

If you look at this table and think “that resembles my bill,” the patterns below apply.

1) Cost Explorer — start with where the money goes #

Cost Explorer slices / dices the bill. The analyses you use often in the console are as follows.

Frequently used analyses

1) by service       — Fargate vs RDS vs Logs (the biggest cost items)
2) by tag           — env=prod vs env=dev (per-environment cost)
3) by usage type    — DataTransfer-Out-Bytes vs BoxUsage, etc.
4) by region        — resources sleeping in another region (Chapter 1 pitfall)
5) by time trend    — an item that suddenly rose since yesterday

From the CLI too #

This month's cost by service

aws ce get-cost-and-usage \
  --time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Cost Anomaly Detection #

ML-based anomaly detection. It alerts automatically when you deviate from the usual pattern.

Anomaly monitor (by service)

aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "blog-services",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

If Chapter 3 cost management’s billing alert is “when a threshold is crossed,” Cost Anomaly is “when it’s different from usual.” It’s the tool for catching subtle leaks.

2) Compute cost — the three Fargate points #

A) Right Sizing — only as much as you really need #

Look at the average CPU / memory utilization in CloudWatch Container Insights (Chapter 26) and adjust the task size.

A Right Sizing example

current:  cpu=1024, memory=2048
observed: avg CPU 15%, p95 35%, memory avg 30%
adjusted: cpu=512, memory=1024  → 50% cost reduction

A healthy level is CPU averaging in the 30 ~ 50% range. Below 20% is too large (still leave burst headroom).

In small environments, Compute Optimizer gives recommendations automatically. Just turn it on once in the console.

B) Fargate Spot — 70% cheaper #

Run batch-like / restartable tasks on Fargate Spot.

capacity provider strategy

resource "aws_ecs_service" "this" {
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1     # base on-demand
    base              = 2
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 4     # additional goes to Spot first
  }
}

The pattern above keeps 2 always on-demand, and beyond that is Spot at a 4:1 ratio. When load drops, it clears Spot first.

When an interruption occurs, ECS launches a new task, but ~120 seconds of downtime is possible. In production traffic, keep only part on Spot; 100% Spot is a big risk.

C) Graviton (ARM) — 20% cheaper + 20% faster #

db.t4g.* (RDS), the Fargate ARM option, and EC2 Graviton (m7g, c7g) are AWS’s ARM chips. If your container image can be built for ARM, there’s no reason not to use them.

Multi-arch build

# build
docker buildx build --platform linux/amd64,linux/arm64 \
  -t $REPO/blog-api:v1 --push .

Fargate ARM 64

resource "aws_ecs_task_definition" "this" {
  cpu          = "512"
  memory       = "1024"
  runtime_platform {
    cpu_architecture       = "ARM64"
    operating_system_family = "LINUX"
  }
}

The premise is that all libraries you use must be ARM-compatible. Most Python / Node / Go packages are OK. Some native bindings need verification.

3) Savings Plans / Reserved Capacity #

Commitment discounts for Fargate / EC2 / Lambda.

Kind	Discount	Commitment
Compute Savings Plan	up to 66%	1 year / 3 years, $/h commitment
EC2 Instance SP	up to 72%	commit down to the instance family
RDS Reserved	up to 65%	instance class + region

Compute SP is the most flexible (applies to Fargate / EC2 / Lambda all). Review it from the point you enter stable operation. Never commit at the start (when traffic / architecture is still in flux).

Guide

operation start ~ 3 months : no commitment (fast-changing environment)
3 months ~ 6 months        : usage analysis, start reviewing a 1-year SP
6 months +                 : commit a 1-year for 60~70% of stable usage

Commit to the full 100% and you lose out when traffic drops. Always leave a safety margin.

4) Storage / Logs — the most frequently leaking items #

CloudWatch Logs #

The retention emphasized in Chapter 26. Apply it to every group.

30 days for all, with Terraform

resource "aws_cloudwatch_log_group" "ecs" {
  for_each          = toset(["/ecs/blog-api", "/ecs/blog-api-migrate"])
  name              = each.key
  retention_in_days = 30
}

S3 #

Automatically move old objects to a cheaper class.

S3 lifecycle

resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id
  rule {
    id     = "to-ia-then-glacier"
    status = "Enabled"
    transition { days = 30,  storage_class = "STANDARD_IA" }
    transition { days = 90,  storage_class = "GLACIER" }
    expiration { days = 365 }
  }
}

ECR #

Automatically delete old images.

ECR lifecycle

resource "aws_ecr_lifecycle_policy" "blog_api" {
  repository = aws_ecr_repository.blog_api.name
  policy = jsonencode({
    rules = [{
      rulePriority = 1
      description  = "keep only the latest 30"
      selection    = { tagStatus = "any", countType = "imageCountMoreThan", countNumber = 30 }
      action       = { type = "expire" }
    }]
  })
}

5) Network — NAT and Egress #

The item that surprises people most when they first see a production bill is NAT Gateway and Egress.

NAT Gateway cost

per hour  $0.045
per GB    $0.045   (processing)
+ Egress $0.09/GB (to the internet)

Even in a small system, a single NAT is ~$32/month + traffic. Savings by pattern are as follows.

Pattern	Effect
VPC Endpoint for S3, DynamoDB	completely free, distributes NAT traffic
VPC Endpoint for ECR, Logs, Secrets	~$0.01/hour + ~$0.01/GB (cheaper than NAT)
Put it behind CloudFront	Origin → CloudFront free, CloudFront → user ~$0.085/GB (by region)
Single NAT (dev environment)	one NAT instead of per-AZ NAT — availability ↓

One line of Endpoint #

ECR Interface Endpoint

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.this.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

When an ECS task pulls an image from ECR, it pulls through the endpoint rather than the NAT. It cuts both NAT traffic and cost.

6) Tagging — making cost classifiable #

Without tags, the bill is one lump. With tags, it’s sliced by environment / team / project.

Default tags

provider "aws" {
  default_tags {
    tags = {
      Environment = var.environment
      Project     = "blog-api"
      ManagedBy   = "terraform"
      CostCenter  = "product-blog"
    }
  }
}

The provider’s default_tags applies automatically to all resources. It’s the heart of operations.

Activating Cost Allocation Tags #

Even if you add tags, Cost Explorer won’t classify by them unless you activate them in the console’s Billing → Cost Allocation Tags. The order is settings → activate tag → wait ~24h → available.

Tag enforcement (SCP / IAM Condition) #

Block the creation of untagged resources. Do it with an AWS Organizations SCP or an IAM policy’s Condition.

Deny creation if no tag

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
    "Resource": "*",
    "Condition": {
      "Null": { "aws:RequestTag/Environment": "true" }
    }
  }]
}

SCP and Organizations-level governance are covered in earnest in Chapter 29 security governance.

7) An operational cost dashboard #

Add a row of cost widgets to the CloudWatch Dashboard.

Cost dashboard widgets

[1] month-to-date cost (vs the same point last month)
[2] by service (Fargate / RDS / Logs / NAT / ALB)
[3] by environment (env=prod vs env=staging vs env=dev)
[4] daily trend (90 days)
[5] count of Right Sizing recommendations

Thirty minutes once a week in the oncall meeting catches worsening indicators early.

Shared responsibility — FinOps #

Large organizations have a dedicated area called FinOps that watches cost. In a small organization, the developer themselves must be conscious of their own module’s cost. That’s why tags are key. You only develop the awareness when you can see your own code’s bill.

Pitfalls — pitfalls you often meet with cost #

1) Resources sleeping in another region #

Exactly the pitfall from Chapter 1 AWS intro. Check regions that aren’t 0 in AWS Resource Explorer or Cost Explorer’s per-region cost.

2) A PoC you didn’t `terraform destroy` #

The mistake of building a stack / environment and forgetting it. Block it with the tag + auto-cleanup Lambda pattern.

Auto cleanup

EventBridge schedule (daily at 9 AM)
    │
    ▼
Lambda
   - tag Project=PoC AND CreatedAt < 7 days ago
   - delete resources / notify

3) Not knowing Free Tier expired #

Chapter 3 cost management’s billing alert is the first safeguard, and Cost Anomaly Detection is the second.

4) Operating on 100% Spot causes downtime #

If a Spot interruption hits several tasks at once, the service can’t fill the desired count and 5xx runs away. Always keep a base on-demand.

5) Multi-AZ RDS doubles #

For a small system, Multi-AZ is a cost burden. But single-AZ is also problematic. The compromise is dev/staging single-AZ + prod Multi-AZ.

6) Not using VPC Endpoints #

In a simple NAT-only setup, high-traffic resources (Logs, S3) go through the NAT and the cost runs away. Be sure to review this at the point you enter production.

7) Architecture change after a commitment #

Move to ARM / Lambda right after buying a 3-year SP and the commitment is still billed. Start with short (1-year) commitments, only in a stable environment.

The layers stacked in Part 4 #

This is the end of Part 4, “From the console to code.” The tools you got in hand in Parts 1 ~ 3 stacked into one system in Part 4.

The layers stacked in this book

        ┌─────────────────────────────────────┐
        │ FinOps                              │   ← Chapter 27
        │ cost / tagging / commitments        │
        ├─────────────────────────────────────┤
        │ Observability                       │   ← Chapter 26
        │ Logs / Metrics / Traces             │
        ├─────────────────────────────────────┤
        │ Automation                          │   ← Chapters 24, 25
        │ CI/CD / IaC                         │
        ├─────────────────────────────────────┤
        │ Data                                │   ← Chapter 23 + Chapter 11
        │ RDS / Secrets                       │
        ├─────────────────────────────────────┤
        │ Compute                             │   ← Chapter 22 + Chapters 15~21
        │ ECS / Lambda                        │
        ├─────────────────────────────────────┤
        │ Network                             │   ← Chapters 8, 13, 14
        │ VPC / ALB / CloudFront              │
        ├─────────────────────────────────────┤
        │ Control plane                       │   ← Chapters 1~7
        │ account / IAM / cost / security     │
        └─────────────────────────────────────┘

The layers above in reverse — control → network → compute → data → automation → observability → FinOps — is the natural evolutionary order of operations. You’ll go through this flow again every time you build a new system.

If you’ve followed along through Part 4, you’ve gotten in hand one full lap of putting a small backend on ECS Fargate in an operable shape, reproducing it as code, deploying it automatically, observing it, and controlling its cost. The next Part 5 continues into operations · security · cost — the story of when this system grows larger, of designing the network more deeply, of standing up governance with multi-account, and of preparing for failures.

Exercises #

From the ratio table in §“Where the bill leaks,” pick the two often-forgotten cost items (NAT/Egress, Logs retention), and pair which section of this chapter cuts each and how.
Lay out, from §2, the three points for cutting Fargate compute cost (Right Sizing / Spot / Graviton), and write one sentence each on the risk or precondition for each. Explain, in connection with §“Pitfall 4,” why 100% Spot is dangerous.
Explain in one paragraph why the provider’s default_tags is key in cost management, and note where, in the module structure of Chapter 25 Terraform intro, this tag goes so that it applies consistently to every environment.

In short: In a small operation, more than half the bill is often Fargate and RDS, and NAT/Egress and log retention are easy to miss. Find leaks with Cost Explorer and Anomaly Detection, cut compute with Right Sizing, Spot, Graviton, and Savings Plans, and cut storage and network with Logs, S3, ECR lifecycle, and VPC Endpoints. Make all resources classifiable with default_tags, and FinOps starts to work because developers become conscious of their own cost.

Next chapter #

Through Part 4, one system has come together in an operable shape. From here it continues into Part 5, operations · security · cost. In the next Chapter 28 VPC in depth we re-design, at operational scale, the network we’ve raced past with the default VPC and quick setups — subnet routing, NAT and VPC Endpoints, security groups and NACLs, and a multi-AZ network structure.