AWS in Practice #6: Cost Optimization and Dashboards — Wrapping Up the Track

Infrastructure AWS Cost optimization FinOps

Thursday, May 7, 2026

11 min read

In #1 through #5, infrastructure, DB, CI/CD, IaC, and monitoring have come together into an operationally-ready system. The last topic remaining: how much it’s costing, and how to reduce that cost.

Half of this post is cost optimization, half is a 27-post AWS track retrospective.

Where the bill leaks #

In Basics #3 cost management we covered the basics of billing alerts and Cost Explorer. This post is on top of that — actually reducing production system cost.

A typical small production (ECS Fargate + RDS + ALB + CloudFront + Logs) monthly bill ratios:

Resource	Ratio	Notes
ECS Fargate (vCPU + memory hours)	30–50%	The biggest
RDS (instance + Storage + IO)	20–30%	2x with Multi-AZ
NAT Gateway / Egress	10–20%	Often forgotten
ALB / Traffic	5–10%	Hours + LCU
CloudWatch Logs / Metrics	5–10%	Explodes when retention is missing
S3 / ECR	2–5%	Image / object accumulation
Other	5%	DNS, KMS, Secrets, …

If this table looks familiar — the patterns below can help.

1) Cost Explorer — start by finding where money goes #

Cost Explorer slices and dices the bill. In the console:

Frequent analyses

1) By service       — Fargate vs RDS vs Logs (the biggest)
2) By tag           — env=prod vs env=dev (cost split by environment)
3) By usage type    — DataTransfer-Out-Bytes vs BoxUsage etc.
4) By region        — sleeping resources in other regions ([#1 pitfalls](/en/posts/aws-basics-1-account-region-az))
5) Time trend       — what suddenly went up since yesterday

Or via CLI #

This month's cost by service

aws ce get-cost-and-usage \
  --time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Cost Anomaly Detection #

ML-based outlier detection. Auto-alerts when usage diverges from normal patterns.

Anomaly monitor (per-service)

aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "blog-services",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

If Basics #3 billing alerts fire “when a threshold is crossed,” Cost Anomaly alerts fire “when usage diverges from normal” — better at catching subtle leaks.

2) Compute cost — three Fargate levers #

A) Right Sizing — only what’s actually needed #

Look at the average CPU / memory in CloudWatch Container Insights (#5) and adjust task size.

Right Sizing

Current: cpu=1024, memory=2048
Observed: avg CPU 15%, p95 35%, memory avg 30%
Adjusted: cpu=512, memory=1024  → 50% cost reduction

Healthy CPU averages 30–50%. Below 20% is too big (still leave burst headroom).

For small environments, Compute Optimizer auto-recommends — turn it on once in the console.

B) Fargate Spot — 70% cheaper #

Batch / restartable tasks are perfect for Fargate Spot:

capacity provider strategy

resource "aws_ecs_service" "this" {
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1     # base on-demand
    base              = 2
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 4     # additional prefer Spot
  }
}

Above pattern: always 2 on-demand, beyond that 4:1 Spot. As load drops, Spot cleans up first.

When a Spot interruption occurs, ECS spawns replacement tasks, but there can be up to ~120s of downtime. For production traffic, run only a portion on Spot — 100% Spot is high risk.

C) Graviton (ARM) — 20% cheaper + 20% faster #

db.t4g.* (RDS), Fargate ARM option, EC2 Graviton (m7g, c7g) — AWS’s ARM chips. If your container image can build for ARM, there’s no reason not to use it.

Multi-arch build

# Build
docker buildx build --platform linux/amd64,linux/arm64 \
  -t $REPO/blog-api:v1 --push .

Fargate ARM 64

resource "aws_ecs_task_definition" "this" {
  cpu          = "512"
  memory       = "1024"
  runtime_platform {
    cpu_architecture       = "ARM64"
    operating_system_family = "LINUX"
  }
}

Prerequisite: all libraries must be ARM-compatible. Most Python, Node, and Go packages are. Some packages with native bindings will need verification.

3) Savings Plans / Reserved Capacity #

Commitment discounts for Fargate / EC2 / Lambda.

Type	Discount	Commitment
Compute Savings Plan	Up to 66%	1 year / 3 year, $/h commitment
EC2 Instance SP	Up to 72%	Commits to instance family
RDS Reserved	Up to 65%	Instance class + region

Compute SP is the most flexible option (covers Fargate, EC2, and Lambda). Consider it once you reach stable production — never commit early when traffic or architecture is still in flux.

Guideline

Production start ~ 3 months    : no commitment (fast-changing phase)
3 months ~ 6 months            : usage analysis, start considering 1-yr SP
6 months +                     : 1-yr commit at 60–70% of stable usage

Committing 100% becomes costly if traffic drops. Always leave a safety margin.

4) Storage / Logs — where leaks happen most #

CloudWatch Logs #

The retention emphasized in #5. Apply to all groups:

30 days uniformly with Terraform

resource "aws_cloudwatch_log_group" "ecs" {
  for_each          = toset(["/ecs/blog-api", "/ecs/blog-api-migrate"])
  name              = each.key
  retention_in_days = 30
}

S3 #

Auto-tier old objects to cheaper classes:

S3 lifecycle

resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id
  rule {
    id     = "to-ia-then-glacier"
    status = "Enabled"
    transition { days = 30,  storage_class = "STANDARD_IA" }
    transition { days = 90,  storage_class = "GLACIER" }
    expiration { days = 365 }
  }
}

ECR #

Auto-delete old images:

ECR lifecycle

resource "aws_ecr_lifecycle_policy" "blog_api" {
  repository = aws_ecr_repository.blog_api.name
  policy = jsonencode({
    rules = [{
      rulePriority = 1
      description  = "Keep only the latest 30"
      selection    = { tagStatus = "any", countType = "imageCountMoreThan", countNumber = 30 }
      action       = { type = "expire" }
    }]
  })
}

5) Network — NAT and Egress #

What most surprises people seeing their first production bill: NAT Gateway and Egress costs.

NAT Gateway cost

Hourly  $0.045
Per GB  $0.045   (processing)
+ Egress $0.09/GB (to internet)

Even in a small system, a single NAT runs ~$32/month before traffic costs. Savings by approach:

Pattern	Effect
VPC Endpoint for S3, DynamoDB	Completely free, splits NAT traffic
VPC Endpoint for ECR, Logs, Secrets	Hourly ~$0.01 + GB ~$0.01 (cheaper than NAT)
CloudFront in front	Origin → CloudFront free, CloudFront → user GB ~$0.085 (region-dependent)
Single NAT (dev environment)	Single NAT instead of per-AZ — availability ↓

One-line endpoint #

ECR Interface Endpoint

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.this.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

When ECS task pulls images from ECR, traffic goes through the endpoint instead of NAT — saving NAT traffic / cost both.

6) Tagging — making cost classifiable #

Without tags, the bill is one undifferentiated lump. With tags, you can slice costs by environment, team, or project.

Default tags

provider "aws" {
  default_tags {
    tags = {
      Environment = var.environment
      Project     = "blog-api"
      ManagedBy   = "terraform"
      CostCenter  = "product-blog"
    }
  }
}

The provider’s default_tags block is automatically applied to all resources — it’s the operational core of cost tagging.

Cost Allocation Tag activation #

Even with tags applied, if they aren’t activated under console Billing → Cost Allocation Tags, Cost Explorer won’t classify them. Go to that settings page, activate each tag, wait ~24h, and they become usable.

Tag enforcement (SCP / IAM Condition) #

Block resource creation without tags. AWS Organizations SCP or IAM policy Condition:

Deny if no tag

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
    "Resource": "*",
    "Condition": {
      "Null": { "aws:RequestTag/Environment": "true" }
    }
  }]
}

7) Operational cost dashboard #

One more cost widget to the CloudWatch Dashboard:

Cost dashboard widgets

[1] This month accumulated cost (vs last month at the same point)
[2] By service (Fargate / RDS / Logs / NAT / ALB)
[3] By environment (env=prod vs env=staging vs env=dev)
[4] Daily trend (90 days)
[5] Right Sizing recommendation count

30 minutes weekly in the on-call meeting — early detection of worsening areas.

Shared responsibility — FinOps #

Large organizations have a dedicated FinOps function that watches costs, but in smaller organizations, developers themselves need to be aware of what their own modules cost. Tags make that possible — being able to see the bill for your own code creates accountability.

Pitfalls — frequent cost traps #

1) Sleeping resources in other regions #

Same pitfall from Basics #1. Use AWS Resource Explorer or Cost Explorer’s per-region view → investigate any regions showing non-zero costs.

2) PoCs without `terraform destroy` #

Stacks / environments built and forgotten. Tag + auto-cleanup lambda pattern:

Auto-cleanup

EventBridge schedule (daily 9am)
    │
    ▼
Lambda
   - tag Project=PoC AND CreatedAt < 7 days ago
   - delete resources / notify

3) Free Tier expiry unnoticed #

Basics #3 billing alerts are the first defense. + Cost Anomaly Detection as second.

4) 100% Spot causes downtime #

Spot interruption hits multiple tasks at once → service can’t fill desired count → 5xx burst. Always have base on-demand.

5) Multi-AZ RDS doubles cost #

Multi-AZ adds cost pressure on small systems, but single-AZ is a reliability risk. The sensible compromise: single-AZ for dev/staging, Multi-AZ for prod.

6) VPC Endpoint not used #

A simple setup relying solely on NAT means high-traffic resources (Logs, S3) flow through the NAT, exploding cost. Always review this when entering production.

7) Architecture changes after commitment #

Bought a 3-yr SP and immediately moved to ARM / Lambda — the commitment still bills regardless. Start with shorter terms (1-yr), and only commit stable workloads.

AWS Track 27 Posts Retrospective #

If you were to sum up this track in one line:

“From the 200-service console catalog, picked only the toolbox needed to safely run a small backend.”

Per-series summary #

Series	Posts	What gathered
Basics	7	Account / region / IAM / cost / CLI / security / logs — the map before entering the console
Intermediate	7	EC2 / VPC / S3 / RDS / Route 53 / ALB / CloudFront — the operational skeleton
Advanced	7	ECS / ECR / Lambda / API Gateway / EventBridge / Secrets / Step Functions — the modern backend domain
Practice	6	All as one system — Fargate / RDS / CI/CD / IaC / Monitoring / Cost

Each series stands on its own, but when the four series come together as one system, something different emerges — a production-ready backend.

AWS’s essence in one place #

AWS isn’t a service catalog. It’s lego — stacking blocks on blocks.

Layers stacked in this track

        ┌─────────────────────────────────────┐
        │ FinOps                              │   ← #6
        │ Cost / Tagging / Commitment         │
        ├─────────────────────────────────────┤
        │ Observability                       │   ← #5
        │ Logs / Metrics / Traces             │
        ├─────────────────────────────────────┤
        │ Automation                          │   ← #3, #4
        │ CI/CD / IaC                         │
        ├─────────────────────────────────────┤
        │ Data                                │   ← #2 + Intermediate #4
        │ RDS / Secrets                       │
        ├─────────────────────────────────────┤
        │ Compute                             │   ← #1 + Advanced #1~7
        │ ECS / Lambda                        │
        ├─────────────────────────────────────┤
        │ Network                             │   ← Intermediate #1, 6, 7
        │ VPC / ALB / CloudFront              │
        ├─────────────────────────────────────┤
        │ Control plane                       │   ← All of Basics
        │ Account / IAM / Cost / Security     │
        └─────────────────────────────────────┘

Read the layers from bottom to top — control plane → network → compute → data → automation → observability → FinOps — and that’s the natural order in which operations evolve. Every new system you build, you’ll walk through this progression again.

Areas this track didn’t cover #

Areas that naturally lead to next tracks:

Container standardization — Docker track is next. Multi-stage builds / slimming / security scans / multi-arch — the depth of the image itself running on Fargate.
Kubernetes — the next step after ECS. Multi-cluster / GitOps / service mesh — natural evolution as traffic grows.
Certifications — same domain seen from the exam angle. The roadmap’s Cloud Practitioner / SAA / DVA series.
DataOps / ML — Glue / SageMaker / Athena. Once data grows, it goes there.
Multi-cloud / hybrid — integration with Azure / GCP / on-prem. Something you meet in big organizations.

Each area deserves its own track.

Wrapping up the track #

If you’ve followed along to this post, finding what to look for and where in the AWS console should now be muscle memory. That was the real goal of this track. New services and features are added every year, but if you know which layer a new tool belongs to, it quickly finds its place.

Docker track — go deep on the container itself this series depended on. Multi-stage / security / multi-arch / compose, 24 posts.
Certifications — Cloud Practitioner / SAA / DVA — same tools from the exam angle. It pays off in interviews / job change / internal review.
Your own project — the fastest way to make read knowledge stick to your hands. Spin up a small side project with this track’s patterns. #1’s infrastructure becomes the starting point almost as is.

Thank you for following along this long track. Facing AWS’s 200-service catalog, you now stand at a position of confidence — even unfamiliar tools come with a sense of place: “this belongs in the compute layer, that’s network.” From there, real operations begin, and this track has walked alongside you to that starting line.

Until the next track.

Where the bill leaks #

1) Cost Explorer — start by finding where money goes #

Or via CLI #

Cost Anomaly Detection #

2) Compute cost — three Fargate levers #

A) Right Sizing — only what’s actually needed #

B) Fargate Spot — 70% cheaper #

C) Graviton (ARM) — 20% cheaper + 20% faster #

3) Savings Plans / Reserved Capacity #

4) Storage / Logs — where leaks happen most #

CloudWatch Logs #

S3 #

ECR #

5) Network — NAT and Egress #

One-line endpoint #

6) Tagging — making cost classifiable #

Cost Allocation Tag activation #

Tag enforcement (SCP / IAM Condition) #

7) Operational cost dashboard #

Shared responsibility — FinOps #

Pitfalls — frequent cost traps #

1) Sleeping resources in other regions #

2) PoCs without terraform destroy #

3) Free Tier expiry unnoticed #

4) 100% Spot causes downtime #

5) Multi-AZ RDS doubles cost #

6) VPC Endpoint not used #

7) Architecture changes after commitment #

AWS Track 27 Posts Retrospective #

Per-series summary #

AWS’s essence in one place #

Areas this track didn’t cover #

Wrapping up the track #

What I recommend next #

2) PoCs without `terraform destroy` #