Cost Optimization and Dashboards
Cost Explorer analysis, Savings Plans / Spot / Graviton, Right Sizing, tag enforcement and cost classification, and the FinOps area. Patterns that actually cut a production system's cost, wrapping up Part 4, 'From the console to code.'
In Chapter 22 ~ Chapter 26 — infrastructure / DB / CI/CD / IaC / monitoring — we’ve assembled an operable system. The last remaining topic is how much it’s costing, and how to cut that cost.
This chapter is the last of Part 4, “From the console to code.” If Chapter 3 cost management got the basics of billing alerts and Cost Explorer in hand, this chapter is the step that actually cuts a production system’s cost on top of that. And at the chapter’s end we organize the layers we’ve stacked so far and lay a bridge into Part 5, operations · security · cost.
Where the bill leaks #
A typical small operation (ECS Fargate + RDS + ALB + CloudFront + Logs) has a monthly bill breakdown like this.
| Resource | Share | Meaning |
|---|---|---|
| ECS Fargate (vCPU + memory time) | 30 ~ 50% | the biggest cost item |
| RDS (instance + Storage + IO) | 20 ~ 30% | 2x for Multi-AZ |
| NAT Gateway / Egress | 10 ~ 20% | the often-forgotten cost item |
| ALB / traffic | 5 ~ 10% | hours + LCU |
| CloudWatch Logs / Metrics | 5 ~ 10% | runs away when retention is omitted |
| S3 / ECR | 2 ~ 5% | accumulating images / objects |
| Other | 5% | DNS, KMS, Secrets, … |
If you look at this table and think “that resembles my bill,” the patterns below apply.
1) Cost Explorer — start with where the money goes #
Cost Explorer slices / dices the bill. The analyses you use often in the console are as follows.
1) by service — Fargate vs RDS vs Logs (the biggest cost items)
2) by tag — env=prod vs env=dev (per-environment cost)
3) by usage type — DataTransfer-Out-Bytes vs BoxUsage, etc.
4) by region — resources sleeping in another region (Chapter 1 pitfall)
5) by time trend — an item that suddenly rose since yesterdayFrom the CLI too #
aws ce get-cost-and-usage \
--time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICECost Anomaly Detection #
ML-based anomaly detection. It alerts automatically when you deviate from the usual pattern.
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "blog-services",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'If Chapter 3 cost management’s billing alert is “when a threshold is crossed,” Cost Anomaly is “when it’s different from usual.” It’s the tool for catching subtle leaks.
2) Compute cost — the three Fargate points #
A) Right Sizing — only as much as you really need #
Look at the average CPU / memory utilization in CloudWatch Container Insights (Chapter 26) and adjust the task size.
current: cpu=1024, memory=2048
observed: avg CPU 15%, p95 35%, memory avg 30%
adjusted: cpu=512, memory=1024 → 50% cost reductionA healthy level is CPU averaging in the 30 ~ 50% range. Below 20% is too large (still leave burst headroom).
In small environments, Compute Optimizer gives recommendations automatically. Just turn it on once in the console.
B) Fargate Spot — 70% cheaper #
Run batch-like / restartable tasks on Fargate Spot.
resource "aws_ecs_service" "this" {
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 1 # base on-demand
base = 2
}
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 4 # additional goes to Spot first
}
}The pattern above keeps 2 always on-demand, and beyond that is Spot at a 4:1 ratio. When load drops, it clears Spot first.
When an interruption occurs, ECS launches a new task, but ~120 seconds of downtime is possible. In production traffic, keep only part on Spot; 100% Spot is a big risk.
C) Graviton (ARM) — 20% cheaper + 20% faster #
db.t4g.* (RDS), the Fargate ARM option, and EC2 Graviton (m7g, c7g) are AWS’s ARM chips. If your container image can be built for ARM, there’s no reason not to use them.
# build
docker buildx build --platform linux/amd64,linux/arm64 \
-t $REPO/blog-api:v1 --push .resource "aws_ecs_task_definition" "this" {
cpu = "512"
memory = "1024"
runtime_platform {
cpu_architecture = "ARM64"
operating_system_family = "LINUX"
}
}The premise is that all libraries you use must be ARM-compatible. Most Python / Node / Go packages are OK. Some native bindings need verification.
3) Savings Plans / Reserved Capacity #
Commitment discounts for Fargate / EC2 / Lambda.
| Kind | Discount | Commitment |
|---|---|---|
| Compute Savings Plan | up to 66% | 1 year / 3 years, $/h commitment |
| EC2 Instance SP | up to 72% | commit down to the instance family |
| RDS Reserved | up to 65% | instance class + region |
Compute SP is the most flexible (applies to Fargate / EC2 / Lambda all). Review it from the point you enter stable operation. Never commit at the start (when traffic / architecture is still in flux).
operation start ~ 3 months : no commitment (fast-changing environment)
3 months ~ 6 months : usage analysis, start reviewing a 1-year SP
6 months + : commit a 1-year for 60~70% of stable usageCommit to the full 100% and you lose out when traffic drops. Always leave a safety margin.
4) Storage / Logs — the most frequently leaking items #
CloudWatch Logs #
The retention emphasized in Chapter 26. Apply it to every group.
resource "aws_cloudwatch_log_group" "ecs" {
for_each = toset(["/ecs/blog-api", "/ecs/blog-api-migrate"])
name = each.key
retention_in_days = 30
}S3 #
Automatically move old objects to a cheaper class.
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "to-ia-then-glacier"
status = "Enabled"
transition { days = 30, storage_class = "STANDARD_IA" }
transition { days = 90, storage_class = "GLACIER" }
expiration { days = 365 }
}
}ECR #
Automatically delete old images.
resource "aws_ecr_lifecycle_policy" "blog_api" {
repository = aws_ecr_repository.blog_api.name
policy = jsonencode({
rules = [{
rulePriority = 1
description = "keep only the latest 30"
selection = { tagStatus = "any", countType = "imageCountMoreThan", countNumber = 30 }
action = { type = "expire" }
}]
})
}5) Network — NAT and Egress #
The item that surprises people most when they first see a production bill is NAT Gateway and Egress.
per hour $0.045
per GB $0.045 (processing)
+ Egress $0.09/GB (to the internet)Even in a small system, a single NAT is ~$32/month + traffic. Savings by pattern are as follows.
| Pattern | Effect |
|---|---|
| VPC Endpoint for S3, DynamoDB | completely free, distributes NAT traffic |
| VPC Endpoint for ECR, Logs, Secrets | ~$0.01/hour + ~$0.01/GB (cheaper than NAT) |
| Put it behind CloudFront | Origin → CloudFront free, CloudFront → user ~$0.085/GB (by region) |
| Single NAT (dev environment) | one NAT instead of per-AZ NAT — availability ↓ |
One line of Endpoint #
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.this.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.endpoints.id]
private_dns_enabled = true
}When an ECS task pulls an image from ECR, it pulls through the endpoint rather than the NAT. It cuts both NAT traffic and cost.
6) Tagging — making cost classifiable #
Without tags, the bill is one lump. With tags, it’s sliced by environment / team / project.
provider "aws" {
default_tags {
tags = {
Environment = var.environment
Project = "blog-api"
ManagedBy = "terraform"
CostCenter = "product-blog"
}
}
}The provider’s default_tags applies automatically to all resources. It’s the heart of operations.
Activating Cost Allocation Tags #
Even if you add tags, Cost Explorer won’t classify by them unless you activate them in the console’s Billing → Cost Allocation Tags. The order is settings → activate tag → wait ~24h → available.
Tag enforcement (SCP / IAM Condition) #
Block the creation of untagged resources. Do it with an AWS Organizations SCP or an IAM policy’s Condition.
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
"Resource": "*",
"Condition": {
"Null": { "aws:RequestTag/Environment": "true" }
}
}]
}SCP and Organizations-level governance are covered in earnest in Chapter 29 security governance.
7) An operational cost dashboard #
Add a row of cost widgets to the CloudWatch Dashboard.
[1] month-to-date cost (vs the same point last month)
[2] by service (Fargate / RDS / Logs / NAT / ALB)
[3] by environment (env=prod vs env=staging vs env=dev)
[4] daily trend (90 days)
[5] count of Right Sizing recommendationsThirty minutes once a week in the oncall meeting catches worsening indicators early.
Shared responsibility — FinOps #
Large organizations have a dedicated area called FinOps that watches cost. In a small organization, the developer themselves must be conscious of their own module’s cost. That’s why tags are key. You only develop the awareness when you can see your own code’s bill.
Pitfalls — pitfalls you often meet with cost #
1) Resources sleeping in another region #
Exactly the pitfall from Chapter 1 AWS intro. Check regions that aren’t 0 in AWS Resource Explorer or Cost Explorer’s per-region cost.
2) A PoC you didn’t terraform destroy
#
The mistake of building a stack / environment and forgetting it. Block it with the tag + auto-cleanup Lambda pattern.
EventBridge schedule (daily at 9 AM)
│
▼
Lambda
- tag Project=PoC AND CreatedAt < 7 days ago
- delete resources / notify3) Not knowing Free Tier expired #
Chapter 3 cost management’s billing alert is the first safeguard, and Cost Anomaly Detection is the second.
4) Operating on 100% Spot causes downtime #
If a Spot interruption hits several tasks at once, the service can’t fill the desired count and 5xx runs away. Always keep a base on-demand.
5) Multi-AZ RDS doubles #
For a small system, Multi-AZ is a cost burden. But single-AZ is also problematic. The compromise is dev/staging single-AZ + prod Multi-AZ.
6) Not using VPC Endpoints #
In a simple NAT-only setup, high-traffic resources (Logs, S3) go through the NAT and the cost runs away. Be sure to review this at the point you enter production.
7) Architecture change after a commitment #
Move to ARM / Lambda right after buying a 3-year SP and the commitment is still billed. Start with short (1-year) commitments, only in a stable environment.
The layers stacked in Part 4 #
This is the end of Part 4, “From the console to code.” The tools you got in hand in Parts 1 ~ 3 stacked into one system in Part 4.
┌─────────────────────────────────────┐
│ FinOps │ ← Chapter 27
│ cost / tagging / commitments │
├─────────────────────────────────────┤
│ Observability │ ← Chapter 26
│ Logs / Metrics / Traces │
├─────────────────────────────────────┤
│ Automation │ ← Chapters 24, 25
│ CI/CD / IaC │
├─────────────────────────────────────┤
│ Data │ ← Chapter 23 + Chapter 11
│ RDS / Secrets │
├─────────────────────────────────────┤
│ Compute │ ← Chapter 22 + Chapters 15~21
│ ECS / Lambda │
├─────────────────────────────────────┤
│ Network │ ← Chapters 8, 13, 14
│ VPC / ALB / CloudFront │
├─────────────────────────────────────┤
│ Control plane │ ← Chapters 1~7
│ account / IAM / cost / security │
└─────────────────────────────────────┘The layers above in reverse — control → network → compute → data → automation → observability → FinOps — is the natural evolutionary order of operations. You’ll go through this flow again every time you build a new system.
If you’ve followed along through Part 4, you’ve gotten in hand one full lap of putting a small backend on ECS Fargate in an operable shape, reproducing it as code, deploying it automatically, observing it, and controlling its cost. The next Part 5 continues into operations · security · cost — the story of when this system grows larger, of designing the network more deeply, of standing up governance with multi-account, and of preparing for failures.
Exercises #
- From the ratio table in §“Where the bill leaks,” pick the two often-forgotten cost items (NAT/Egress, Logs retention), and pair which section of this chapter cuts each and how.
- Lay out, from §2, the three points for cutting Fargate compute cost (Right Sizing / Spot / Graviton), and write one sentence each on the risk or precondition for each. Explain, in connection with §“Pitfall 4,” why 100% Spot is dangerous.
- Explain in one paragraph why the
provider’sdefault_tagsis key in cost management, and note where, in the module structure of Chapter 25 Terraform intro, this tag goes so that it applies consistently to every environment.
In short: In a small operation, more than half the bill is often Fargate and RDS, and NAT/Egress and log retention are easy to miss. Find leaks with Cost Explorer and Anomaly Detection, cut compute with Right Sizing, Spot, Graviton, and Savings Plans, and cut storage and network with Logs, S3, ECR lifecycle, and VPC Endpoints. Make all resources classifiable with
default_tags, and FinOps starts to work because developers become conscious of their own cost.
Next chapter #
Through Part 4, one system has come together in an operable shape. From here it continues into Part 5, operations · security · cost. In the next Chapter 28 VPC in depth we re-design, at operational scale, the network we’ve raced past with the default VPC and quick setups — subnet routing, NAT and VPC Endpoints, security groups and NACLs, and a multi-AZ network structure.