AWS in Practice #6: Cost Optimization and Dashboards — Wrapping Up the Track

11 min read

In #1 through #5, infrastructure, DB, CI/CD, IaC, and monitoring have come together into an operationally-ready system. The last topic remaining: how much it’s costing, and how to reduce that cost.

Half of this post is cost optimization, half is a 27-post AWS track retrospective.

Where the bill leaks #

In Basics #3 cost management we covered the basics of billing alerts and Cost Explorer. This post is on top of that — actually reducing production system cost.

A typical small production (ECS Fargate + RDS + ALB + CloudFront + Logs) monthly bill ratios:

ResourceRatioNotes
ECS Fargate (vCPU + memory hours)30–50%The biggest
RDS (instance + Storage + IO)20–30%2x with Multi-AZ
NAT Gateway / Egress10–20%Often forgotten
ALB / Traffic5–10%Hours + LCU
CloudWatch Logs / Metrics5–10%Explodes when retention is missing
S3 / ECR2–5%Image / object accumulation
Other5%DNS, KMS, Secrets, …

If this table looks familiar — the patterns below can help.

1) Cost Explorer — start by finding where money goes #

Cost Explorer slices and dices the bill. In the console:

Frequent analyses
1) By service       — Fargate vs RDS vs Logs (the biggest)
2) By tag           — env=prod vs env=dev (cost split by environment)
3) By usage type    — DataTransfer-Out-Bytes vs BoxUsage etc.
4) By region        — sleeping resources in other regions ([#1 pitfalls](/en/posts/aws-basics-1-account-region-az))
5) Time trend       — what suddenly went up since yesterday

Or via CLI #

This month's cost by service
aws ce get-cost-and-usage \
  --time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Cost Anomaly Detection #

ML-based outlier detection. Auto-alerts when usage diverges from normal patterns.

Anomaly monitor (per-service)
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "blog-services",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

If Basics #3 billing alerts fire “when a threshold is crossed,” Cost Anomaly alerts fire “when usage diverges from normal” — better at catching subtle leaks.

2) Compute cost — three Fargate levers #

A) Right Sizing — only what’s actually needed #

Look at the average CPU / memory in CloudWatch Container Insights (#5) and adjust task size.

Right Sizing
Current: cpu=1024, memory=2048
Observed: avg CPU 15%, p95 35%, memory avg 30%
Adjusted: cpu=512, memory=1024  → 50% cost reduction

Healthy CPU averages 30–50%. Below 20% is too big (still leave burst headroom).

For small environments, Compute Optimizer auto-recommends — turn it on once in the console.

B) Fargate Spot — 70% cheaper #

Batch / restartable tasks are perfect for Fargate Spot:

capacity provider strategy
resource "aws_ecs_service" "this" {
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1     # base on-demand
    base              = 2
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 4     # additional prefer Spot
  }
}

Above pattern: always 2 on-demand, beyond that 4:1 Spot. As load drops, Spot cleans up first.

When a Spot interruption occurs, ECS spawns replacement tasks, but there can be up to ~120s of downtime. For production traffic, run only a portion on Spot — 100% Spot is high risk.

C) Graviton (ARM) — 20% cheaper + 20% faster #

db.t4g.* (RDS), Fargate ARM option, EC2 Graviton (m7g, c7g) — AWS’s ARM chips. If your container image can build for ARM, there’s no reason not to use it.

Multi-arch build
# Build
docker buildx build --platform linux/amd64,linux/arm64 \
  -t $REPO/blog-api:v1 --push .
Fargate ARM 64
resource "aws_ecs_task_definition" "this" {
  cpu          = "512"
  memory       = "1024"
  runtime_platform {
    cpu_architecture       = "ARM64"
    operating_system_family = "LINUX"
  }
}

Prerequisite: all libraries must be ARM-compatible. Most Python, Node, and Go packages are. Some packages with native bindings will need verification.

3) Savings Plans / Reserved Capacity #

Commitment discounts for Fargate / EC2 / Lambda.

TypeDiscountCommitment
Compute Savings PlanUp to 66%1 year / 3 year, $/h commitment
EC2 Instance SPUp to 72%Commits to instance family
RDS ReservedUp to 65%Instance class + region

Compute SP is the most flexible option (covers Fargate, EC2, and Lambda). Consider it once you reach stable production — never commit early when traffic or architecture is still in flux.

Guideline
Production start ~ 3 months    : no commitment (fast-changing phase)
3 months ~ 6 months            : usage analysis, start considering 1-yr SP
6 months +                     : 1-yr commit at 60–70% of stable usage

Committing 100% becomes costly if traffic drops. Always leave a safety margin.

4) Storage / Logs — where leaks happen most #

CloudWatch Logs #

The retention emphasized in #5. Apply to all groups:

30 days uniformly with Terraform
resource "aws_cloudwatch_log_group" "ecs" {
  for_each          = toset(["/ecs/blog-api", "/ecs/blog-api-migrate"])
  name              = each.key
  retention_in_days = 30
}

S3 #

Auto-tier old objects to cheaper classes:

S3 lifecycle
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id
  rule {
    id     = "to-ia-then-glacier"
    status = "Enabled"
    transition { days = 30,  storage_class = "STANDARD_IA" }
    transition { days = 90,  storage_class = "GLACIER" }
    expiration { days = 365 }
  }
}

ECR #

Auto-delete old images:

ECR lifecycle
resource "aws_ecr_lifecycle_policy" "blog_api" {
  repository = aws_ecr_repository.blog_api.name
  policy = jsonencode({
    rules = [{
      rulePriority = 1
      description  = "Keep only the latest 30"
      selection    = { tagStatus = "any", countType = "imageCountMoreThan", countNumber = 30 }
      action       = { type = "expire" }
    }]
  })
}

5) Network — NAT and Egress #

What most surprises people seeing their first production bill: NAT Gateway and Egress costs.

NAT Gateway cost
Hourly  $0.045
Per GB  $0.045   (processing)
+ Egress $0.09/GB (to internet)

Even in a small system, a single NAT runs ~$32/month before traffic costs. Savings by approach:

PatternEffect
VPC Endpoint for S3, DynamoDBCompletely free, splits NAT traffic
VPC Endpoint for ECR, Logs, SecretsHourly ~$0.01 + GB ~$0.01 (cheaper than NAT)
CloudFront in frontOrigin → CloudFront free, CloudFront → user GB ~$0.085 (region-dependent)
Single NAT (dev environment)Single NAT instead of per-AZ — availability ↓

One-line endpoint #

ECR Interface Endpoint
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.this.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

When ECS task pulls images from ECR, traffic goes through the endpoint instead of NAT — saving NAT traffic / cost both.

6) Tagging — making cost classifiable #

Without tags, the bill is one undifferentiated lump. With tags, you can slice costs by environment, team, or project.

Default tags
provider "aws" {
  default_tags {
    tags = {
      Environment = var.environment
      Project     = "blog-api"
      ManagedBy   = "terraform"
      CostCenter  = "product-blog"
    }
  }
}

The provider’s default_tags block is automatically applied to all resources — it’s the operational core of cost tagging.

Cost Allocation Tag activation #

Even with tags applied, if they aren’t activated under console Billing → Cost Allocation Tags, Cost Explorer won’t classify them. Go to that settings page, activate each tag, wait ~24h, and they become usable.

Tag enforcement (SCP / IAM Condition) #

Block resource creation without tags. AWS Organizations SCP or IAM policy Condition:

Deny if no tag
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
    "Resource": "*",
    "Condition": {
      "Null": { "aws:RequestTag/Environment": "true" }
    }
  }]
}

7) Operational cost dashboard #

One more cost widget to the CloudWatch Dashboard:

Cost dashboard widgets
[1] This month accumulated cost (vs last month at the same point)
[2] By service (Fargate / RDS / Logs / NAT / ALB)
[3] By environment (env=prod vs env=staging vs env=dev)
[4] Daily trend (90 days)
[5] Right Sizing recommendation count

30 minutes weekly in the on-call meeting — early detection of worsening areas.

Shared responsibility — FinOps #

Large organizations have a dedicated FinOps function that watches costs, but in smaller organizations, developers themselves need to be aware of what their own modules cost. Tags make that possible — being able to see the bill for your own code creates accountability.

Pitfalls — frequent cost traps #

1) Sleeping resources in other regions #

Same pitfall from Basics #1. Use AWS Resource Explorer or Cost Explorer’s per-region view → investigate any regions showing non-zero costs.

2) PoCs without terraform destroy #

Stacks / environments built and forgotten. Tag + auto-cleanup lambda pattern:

Auto-cleanup
EventBridge schedule (daily 9am)
Lambda
   - tag Project=PoC AND CreatedAt < 7 days ago
   - delete resources / notify

3) Free Tier expiry unnoticed #

Basics #3 billing alerts are the first defense. + Cost Anomaly Detection as second.

4) 100% Spot causes downtime #

Spot interruption hits multiple tasks at once → service can’t fill desired count → 5xx burst. Always have base on-demand.

5) Multi-AZ RDS doubles cost #

Multi-AZ adds cost pressure on small systems, but single-AZ is a reliability risk. The sensible compromise: single-AZ for dev/staging, Multi-AZ for prod.

6) VPC Endpoint not used #

A simple setup relying solely on NAT means high-traffic resources (Logs, S3) flow through the NAT, exploding cost. Always review this when entering production.

7) Architecture changes after commitment #

Bought a 3-yr SP and immediately moved to ARM / Lambda — the commitment still bills regardless. Start with shorter terms (1-yr), and only commit stable workloads.


AWS Track 27 Posts Retrospective #

If you were to sum up this track in one line:

“From the 200-service console catalog, picked only the toolbox needed to safely run a small backend.”

Per-series summary #

SeriesPostsWhat gathered
Basics7Account / region / IAM / cost / CLI / security / logs — the map before entering the console
Intermediate7EC2 / VPC / S3 / RDS / Route 53 / ALB / CloudFront — the operational skeleton
Advanced7ECS / ECR / Lambda / API Gateway / EventBridge / Secrets / Step Functions — the modern backend domain
Practice6All as one system — Fargate / RDS / CI/CD / IaC / Monitoring / Cost

Each series stands on its own, but when the four series come together as one system, something different emerges — a production-ready backend.

AWS’s essence in one place #

AWS isn’t a service catalog. It’s lego — stacking blocks on blocks.

Layers stacked in this track
        ┌─────────────────────────────────────┐
        │ FinOps                              │   ← #6
        │ Cost / Tagging / Commitment         │
        ├─────────────────────────────────────┤
        │ Observability                       │   ← #5
        │ Logs / Metrics / Traces             │
        ├─────────────────────────────────────┤
        │ Automation                          │   ← #3, #4
        │ CI/CD / IaC                         │
        ├─────────────────────────────────────┤
        │ Data                                │   ← #2 + Intermediate #4
        │ RDS / Secrets                       │
        ├─────────────────────────────────────┤
        │ Compute                             │   ← #1 + Advanced #1~7
        │ ECS / Lambda                        │
        ├─────────────────────────────────────┤
        │ Network                             │   ← Intermediate #1, 6, 7
        │ VPC / ALB / CloudFront              │
        ├─────────────────────────────────────┤
        │ Control plane                       │   ← All of Basics
        │ Account / IAM / Cost / Security     │
        └─────────────────────────────────────┘

Read the layers from bottom to top — control plane → network → compute → data → automation → observability → FinOps — and that’s the natural order in which operations evolve. Every new system you build, you’ll walk through this progression again.

Areas this track didn’t cover #

Areas that naturally lead to next tracks:

  • Container standardizationDocker track is next. Multi-stage builds / slimming / security scans / multi-arch — the depth of the image itself running on Fargate.
  • Kubernetes — the next step after ECS. Multi-cluster / GitOps / service mesh — natural evolution as traffic grows.
  • Certifications — same domain seen from the exam angle. The roadmap’s Cloud Practitioner / SAA / DVA series.
  • DataOps / ML — Glue / SageMaker / Athena. Once data grows, it goes there.
  • Multi-cloud / hybrid — integration with Azure / GCP / on-prem. Something you meet in big organizations.

Each area deserves its own track.

Wrapping up the track #

If you’ve followed along to this post, finding what to look for and where in the AWS console should now be muscle memory. That was the real goal of this track. New services and features are added every year, but if you know which layer a new tool belongs to, it quickly finds its place.

What I recommend next #

  1. Docker track — go deep on the container itself this series depended on. Multi-stage / security / multi-arch / compose, 24 posts.
  2. Certifications — Cloud Practitioner / SAA / DVA — same tools from the exam angle. It pays off in interviews / job change / internal review.
  3. Your own project — the fastest way to make read knowledge stick to your hands. Spin up a small side project with this track’s patterns. #1’s infrastructure becomes the starting point almost as is.

Thank you for following along this long track. Facing AWS’s 200-service catalog, you now stand at a position of confidence — even unfamiliar tools come with a sense of place: “this belongs in the compute layer, that’s network.” From there, real operations begin, and this track has walked alongside you to that starting line.

Until the next track.

X