AWS in Practice #6: Cost Optimization and Dashboards — Wrapping Up the Track
In #1 through #5, infrastructure, DB, CI/CD, IaC, and monitoring have come together into an operationally-ready system. The last topic remaining: how much it’s costing, and how to reduce that cost.
Half of this post is cost optimization, half is a 27-post AWS track retrospective.
Where the bill leaks #
In Basics #3 cost management we covered the basics of billing alerts and Cost Explorer. This post is on top of that — actually reducing production system cost.
A typical small production (ECS Fargate + RDS + ALB + CloudFront + Logs) monthly bill ratios:
| Resource | Ratio | Notes |
|---|---|---|
| ECS Fargate (vCPU + memory hours) | 30–50% | The biggest |
| RDS (instance + Storage + IO) | 20–30% | 2x with Multi-AZ |
| NAT Gateway / Egress | 10–20% | Often forgotten |
| ALB / Traffic | 5–10% | Hours + LCU |
| CloudWatch Logs / Metrics | 5–10% | Explodes when retention is missing |
| S3 / ECR | 2–5% | Image / object accumulation |
| Other | 5% | DNS, KMS, Secrets, … |
If this table looks familiar — the patterns below can help.
1) Cost Explorer — start by finding where money goes #
Cost Explorer slices and dices the bill. In the console:
1) By service — Fargate vs RDS vs Logs (the biggest)
2) By tag — env=prod vs env=dev (cost split by environment)
3) By usage type — DataTransfer-Out-Bytes vs BoxUsage etc.
4) By region — sleeping resources in other regions ([#1 pitfalls](/en/posts/aws-basics-1-account-region-az))
5) Time trend — what suddenly went up since yesterdayOr via CLI #
aws ce get-cost-and-usage \
--time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICECost Anomaly Detection #
ML-based outlier detection. Auto-alerts when usage diverges from normal patterns.
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "blog-services",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'If Basics #3 billing alerts fire “when a threshold is crossed,” Cost Anomaly alerts fire “when usage diverges from normal” — better at catching subtle leaks.
2) Compute cost — three Fargate levers #
A) Right Sizing — only what’s actually needed #
Look at the average CPU / memory in CloudWatch Container Insights (#5) and adjust task size.
Current: cpu=1024, memory=2048
Observed: avg CPU 15%, p95 35%, memory avg 30%
Adjusted: cpu=512, memory=1024 → 50% cost reductionHealthy CPU averages 30–50%. Below 20% is too big (still leave burst headroom).
For small environments, Compute Optimizer auto-recommends — turn it on once in the console.
B) Fargate Spot — 70% cheaper #
Batch / restartable tasks are perfect for Fargate Spot:
resource "aws_ecs_service" "this" {
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 1 # base on-demand
base = 2
}
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 4 # additional prefer Spot
}
}Above pattern: always 2 on-demand, beyond that 4:1 Spot. As load drops, Spot cleans up first.
When a Spot interruption occurs, ECS spawns replacement tasks, but there can be up to ~120s of downtime. For production traffic, run only a portion on Spot — 100% Spot is high risk.
C) Graviton (ARM) — 20% cheaper + 20% faster #
db.t4g.* (RDS), Fargate ARM option, EC2 Graviton (m7g, c7g) — AWS’s ARM chips. If your container image can build for ARM, there’s no reason not to use it.
# Build
docker buildx build --platform linux/amd64,linux/arm64 \
-t $REPO/blog-api:v1 --push .resource "aws_ecs_task_definition" "this" {
cpu = "512"
memory = "1024"
runtime_platform {
cpu_architecture = "ARM64"
operating_system_family = "LINUX"
}
}Prerequisite: all libraries must be ARM-compatible. Most Python, Node, and Go packages are. Some packages with native bindings will need verification.
3) Savings Plans / Reserved Capacity #
Commitment discounts for Fargate / EC2 / Lambda.
| Type | Discount | Commitment |
|---|---|---|
| Compute Savings Plan | Up to 66% | 1 year / 3 year, $/h commitment |
| EC2 Instance SP | Up to 72% | Commits to instance family |
| RDS Reserved | Up to 65% | Instance class + region |
Compute SP is the most flexible option (covers Fargate, EC2, and Lambda). Consider it once you reach stable production — never commit early when traffic or architecture is still in flux.
Production start ~ 3 months : no commitment (fast-changing phase)
3 months ~ 6 months : usage analysis, start considering 1-yr SP
6 months + : 1-yr commit at 60–70% of stable usageCommitting 100% becomes costly if traffic drops. Always leave a safety margin.
4) Storage / Logs — where leaks happen most #
CloudWatch Logs #
The retention emphasized in #5. Apply to all groups:
resource "aws_cloudwatch_log_group" "ecs" {
for_each = toset(["/ecs/blog-api", "/ecs/blog-api-migrate"])
name = each.key
retention_in_days = 30
}S3 #
Auto-tier old objects to cheaper classes:
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "to-ia-then-glacier"
status = "Enabled"
transition { days = 30, storage_class = "STANDARD_IA" }
transition { days = 90, storage_class = "GLACIER" }
expiration { days = 365 }
}
}ECR #
Auto-delete old images:
resource "aws_ecr_lifecycle_policy" "blog_api" {
repository = aws_ecr_repository.blog_api.name
policy = jsonencode({
rules = [{
rulePriority = 1
description = "Keep only the latest 30"
selection = { tagStatus = "any", countType = "imageCountMoreThan", countNumber = 30 }
action = { type = "expire" }
}]
})
}5) Network — NAT and Egress #
What most surprises people seeing their first production bill: NAT Gateway and Egress costs.
Hourly $0.045
Per GB $0.045 (processing)
+ Egress $0.09/GB (to internet)Even in a small system, a single NAT runs ~$32/month before traffic costs. Savings by approach:
| Pattern | Effect |
|---|---|
| VPC Endpoint for S3, DynamoDB | Completely free, splits NAT traffic |
| VPC Endpoint for ECR, Logs, Secrets | Hourly ~$0.01 + GB ~$0.01 (cheaper than NAT) |
| CloudFront in front | Origin → CloudFront free, CloudFront → user GB ~$0.085 (region-dependent) |
| Single NAT (dev environment) | Single NAT instead of per-AZ — availability ↓ |
One-line endpoint #
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.this.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.endpoints.id]
private_dns_enabled = true
}When ECS task pulls images from ECR, traffic goes through the endpoint instead of NAT — saving NAT traffic / cost both.
6) Tagging — making cost classifiable #
Without tags, the bill is one undifferentiated lump. With tags, you can slice costs by environment, team, or project.
provider "aws" {
default_tags {
tags = {
Environment = var.environment
Project = "blog-api"
ManagedBy = "terraform"
CostCenter = "product-blog"
}
}
}The provider’s default_tags block is automatically applied to all resources — it’s the operational core of cost tagging.
Cost Allocation Tag activation #
Even with tags applied, if they aren’t activated under console Billing → Cost Allocation Tags, Cost Explorer won’t classify them. Go to that settings page, activate each tag, wait ~24h, and they become usable.
Tag enforcement (SCP / IAM Condition) #
Block resource creation without tags. AWS Organizations SCP or IAM policy Condition:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
"Resource": "*",
"Condition": {
"Null": { "aws:RequestTag/Environment": "true" }
}
}]
}7) Operational cost dashboard #
One more cost widget to the CloudWatch Dashboard:
[1] This month accumulated cost (vs last month at the same point)
[2] By service (Fargate / RDS / Logs / NAT / ALB)
[3] By environment (env=prod vs env=staging vs env=dev)
[4] Daily trend (90 days)
[5] Right Sizing recommendation count30 minutes weekly in the on-call meeting — early detection of worsening areas.
Shared responsibility — FinOps #
Large organizations have a dedicated FinOps function that watches costs, but in smaller organizations, developers themselves need to be aware of what their own modules cost. Tags make that possible — being able to see the bill for your own code creates accountability.
Pitfalls — frequent cost traps #
1) Sleeping resources in other regions #
Same pitfall from Basics #1. Use AWS Resource Explorer or Cost Explorer’s per-region view → investigate any regions showing non-zero costs.
2) PoCs without terraform destroy
#
Stacks / environments built and forgotten. Tag + auto-cleanup lambda pattern:
EventBridge schedule (daily 9am)
│
▼
Lambda
- tag Project=PoC AND CreatedAt < 7 days ago
- delete resources / notify3) Free Tier expiry unnoticed #
Basics #3 billing alerts are the first defense. + Cost Anomaly Detection as second.
4) 100% Spot causes downtime #
Spot interruption hits multiple tasks at once → service can’t fill desired count → 5xx burst. Always have base on-demand.
5) Multi-AZ RDS doubles cost #
Multi-AZ adds cost pressure on small systems, but single-AZ is a reliability risk. The sensible compromise: single-AZ for dev/staging, Multi-AZ for prod.
6) VPC Endpoint not used #
A simple setup relying solely on NAT means high-traffic resources (Logs, S3) flow through the NAT, exploding cost. Always review this when entering production.
7) Architecture changes after commitment #
Bought a 3-yr SP and immediately moved to ARM / Lambda — the commitment still bills regardless. Start with shorter terms (1-yr), and only commit stable workloads.
AWS Track 27 Posts Retrospective #
If you were to sum up this track in one line:
“From the 200-service console catalog, picked only the toolbox needed to safely run a small backend.”
Per-series summary #
| Series | Posts | What gathered |
|---|---|---|
| Basics | 7 | Account / region / IAM / cost / CLI / security / logs — the map before entering the console |
| Intermediate | 7 | EC2 / VPC / S3 / RDS / Route 53 / ALB / CloudFront — the operational skeleton |
| Advanced | 7 | ECS / ECR / Lambda / API Gateway / EventBridge / Secrets / Step Functions — the modern backend domain |
| Practice | 6 | All as one system — Fargate / RDS / CI/CD / IaC / Monitoring / Cost |
Each series stands on its own, but when the four series come together as one system, something different emerges — a production-ready backend.
AWS’s essence in one place #
AWS isn’t a service catalog. It’s lego — stacking blocks on blocks.
┌─────────────────────────────────────┐
│ FinOps │ ← #6
│ Cost / Tagging / Commitment │
├─────────────────────────────────────┤
│ Observability │ ← #5
│ Logs / Metrics / Traces │
├─────────────────────────────────────┤
│ Automation │ ← #3, #4
│ CI/CD / IaC │
├─────────────────────────────────────┤
│ Data │ ← #2 + Intermediate #4
│ RDS / Secrets │
├─────────────────────────────────────┤
│ Compute │ ← #1 + Advanced #1~7
│ ECS / Lambda │
├─────────────────────────────────────┤
│ Network │ ← Intermediate #1, 6, 7
│ VPC / ALB / CloudFront │
├─────────────────────────────────────┤
│ Control plane │ ← All of Basics
│ Account / IAM / Cost / Security │
└─────────────────────────────────────┘Read the layers from bottom to top — control plane → network → compute → data → automation → observability → FinOps — and that’s the natural order in which operations evolve. Every new system you build, you’ll walk through this progression again.
Areas this track didn’t cover #
Areas that naturally lead to next tracks:
- Container standardization — Docker track is next. Multi-stage builds / slimming / security scans / multi-arch — the depth of the image itself running on Fargate.
- Kubernetes — the next step after ECS. Multi-cluster / GitOps / service mesh — natural evolution as traffic grows.
- Certifications — same domain seen from the exam angle. The roadmap’s Cloud Practitioner / SAA / DVA series.
- DataOps / ML — Glue / SageMaker / Athena. Once data grows, it goes there.
- Multi-cloud / hybrid — integration with Azure / GCP / on-prem. Something you meet in big organizations.
Each area deserves its own track.
Wrapping up the track #
If you’ve followed along to this post, finding what to look for and where in the AWS console should now be muscle memory. That was the real goal of this track. New services and features are added every year, but if you know which layer a new tool belongs to, it quickly finds its place.
What I recommend next #
- Docker track — go deep on the container itself this series depended on. Multi-stage / security / multi-arch / compose, 24 posts.
- Certifications — Cloud Practitioner / SAA / DVA — same tools from the exam angle. It pays off in interviews / job change / internal review.
- Your own project — the fastest way to make read knowledge stick to your hands. Spin up a small side project with this track’s patterns. #1’s infrastructure becomes the starting point almost as is.
Thank you for following along this long track. Facing AWS’s 200-service catalog, you now stand at a position of confidence — even unfamiliar tools come with a sense of place: “this belongs in the compute layer, that’s network.” From there, real operations begin, and this track has walked alongside you to that starting line.
Until the next track.