AWS in Practice #5: Monitoring — CloudWatch Alarms and X-Ray

10 min read

In #1 through #4, infrastructure became code and deployment became automated. But we still have no single place to see whether this system is running well — whether 5xx errors are climbing, RDS CPU is at 80%, or which request took 5 seconds.

This post turns on that eye.

  • CloudWatch Logs + Logs Insights operational queries
  • CloudWatch Metrics — ECS / RDS / ALB core metrics and alarm thresholds
  • The flow of alarm → SNS → Slack
  • X-Ray — distributed tracing for “where is it slow” in one line
  • Dashboard — system state in one screen

The big picture — the 4 pillars of monitoring #

The pillars of observability
┌──────────────┬──────────────┬──────────────┬──────────────┐
│   Metrics    │     Logs     │    Traces    │    Events    │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ "How much"   │ "What"       │ "Where"      │ "When"       │
│ requests, 5xx│ stacktrace   │ DB took 5s   │ deploys, scale│
│ CPU, memory  │ access log   │ external 1s  │ failover     │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ CloudWatch   │ CloudWatch   │ X-Ray        │ EventBridge  │
│ Metrics      │ Logs         │              │              │
└──────────────┴──────────────┴──────────────┴──────────────┘

This post: Metrics + Logs + Traces — three pillars. Events were already in Advanced #5.

1) CloudWatch Logs — already flowing #

#1’s Task Definition has awslogs baked in, so all container stdout/stderr automatically goes to CloudWatch Logs.

Log hierarchy
Log Group: /ecs/blog-api
   ├── Log Stream: api/<task-id-1>     ← one stream per task
   ├── Log Stream: api/<task-id-2>
   └── Log Stream: api/<task-id-3>

Retention setup — cost discipline #

The default is infinite retention. Even modest traffic can accumulate GBs of logs per month, causing costs to balloon.

30-day retention
aws logs put-retention-policy \
  --log-group-name /ecs/blog-api \
  --retention-in-days 30

Recommendations:

  • Production access logs: 30–90 days
  • Debug / verbose: 7 days
  • Audit logs: 365 days (or export to S3 then delete)

Structured logs are key #

Plain print() output is hard to search. JSON format lets Logs Insights query on individual keys.

FastAPI — JSON logging
import logging, json, sys

class JsonFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "level":   record.levelname,
            "logger":  record.name,
            "message": record.getMessage(),
            "ts":      self.formatTime(record),
            **getattr(record, "extra", {}),
        })

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])
Per-request log
@app.middleware("http")
async def access_log(request, call_next):
    start = time.time()
    response = await call_next(request)
    logging.info("access", extra={"extra": {
        "method": request.method,
        "path": str(request.url.path),
        "status": response.status_code,
        "duration_ms": int((time.time() - start) * 1000),
        "request_id": request.state.request_id,
    }})
    return response

Logging in this format lets Logs Insights answer queries like those below precisely.

2) Logs Insights — 7 operational queries #

Frequently used queries — bookmark them.

A) Filter just 5xx #

fields @timestamp, status, path, request_id, message
| filter status >= 500
| sort @timestamp desc
| limit 100

B) Response time distribution (p50/p90/p99) #

fields @timestamp, duration_ms
| filter ispresent(duration_ms)
| stats
    count(*) as requests,
    pct(duration_ms, 50) as p50,
    pct(duration_ms, 90) as p90,
    pct(duration_ms, 99) as p99
  by bin(5m)

C) Slowest paths #

fields path, duration_ms
| filter duration_ms > 1000
| stats count(*), avg(duration_ms), max(duration_ms) by path
| sort avg(duration_ms) desc
| limit 20

D) Trace one request by request_id #

fields @timestamp, level, message, path, status, duration_ms
| filter request_id = "abc-123-xyz"
| sort @timestamp asc

E) Lines with stacktrace #

fields @timestamp, message
| filter @message like /Traceback|exception/
| sort @timestamp desc

F) Login attempts (auth failures) #

fields @timestamp, source_ip, username
| filter event = "auth_fail"
| stats count(*) by source_ip
| sort count(*) desc

G) Cost — which path gets called the most #

fields path
| stats count(*) by path
| sort count(*) desc
| limit 30

Saved Queries #

Save frequently used queries in the console — share across the team. Can be IaC’d via CloudFormation / Terraform (aws_cloudwatch_query_definition).

3) CloudWatch Metrics — the core #

ECS Container Insights #

Default ECS metrics are sparse. Enable Container Insights to get task- and service-level CPU, memory, network, disk, and running task counts all at once.

Enable Container Insights
aws ecs update-cluster-settings \
  --cluster blog-cluster \
  --settings name=containerInsights,value=enabled

There’s an extra cost (~$1–3/month for small clusters), but it’s essential for production.

Monitoring table — what to watch #

MetricResourceMeaningAlarm threshold (example)
HTTPCode_Target_5XX_CountALBBackend 5xx5-min sum ≥ 5
HTTPCode_ELB_5XX_CountALBALB self 5xx (mostly 0 healthy hosts)5-min sum ≥ 1
TargetResponseTime (p99)ALBResponse time p995-min average ≥ 1.0s
UnHealthyHostCountTarget GroupDead task count5-min average ≥ 1
CPUUtilization (Service)ECSService average CPU5-min average ≥ 80%
MemoryUtilization (Service)ECSMemory5-min average ≥ 85%
RunningTaskCountECSRunning task countDifferent from desired
CPUUtilizationRDSDB CPU5-min average ≥ 80%
DatabaseConnectionsRDSConnection count80% of max_connections
FreeStorageSpaceRDSFree disk< 5GB
ReadLatency / WriteLatencyRDSDisk latency> 50ms

Custom Metrics #

Metrics emitted from the app. Embedded Metric Format (EMF) in logs — without separate API calls.

Business metrics via EMF
import json, time, logging

def emit_metric(metric_name, value, unit="Count", **dims):
    payload = {
        "_aws": {
            "Timestamp": int(time.time() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "BlogApp",
                "Dimensions": [list(dims.keys())],
                "Metrics": [{"Name": metric_name, "Unit": unit}],
            }],
        },
        metric_name: value,
        **dims,
    }
    logging.info(json.dumps(payload))

emit_metric("PostCreated", 1, env="prod")
emit_metric("CommentCreated", 1, env="prod")
emit_metric("LoginFailed", 1, source_ip="...")

CloudWatch parses the logs and automatically creates the BlogApp/PostCreated metric. No PutMetricData API call required — saves both cost and latency.

4) Alarms — calling people when thresholds break #

ALB 5xx alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "blog-alb-5xx-burst" \
  --metric-name HTTPCode_Target_5XX_Count \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --dimensions Name=LoadBalancer,Value=app/blog-alb/abc123 \
  --alarm-actions arn:aws:sns:ap-northeast-2:123456789012:ops-alerts

Key options:

OptionMeaning
periodData point unit (60 = 1 minute)
evaluation-periodsHow many points to evaluate
datapoints-to-alarmHow many of those crossing fires the alarm
treat-missing-dataWhen data is missing — notBreaching recommended
comparison-operator>= / > / < / <=

The 5/3 pattern (“3 out of the last 5 data points cross the threshold”) filters out momentary spikes while still catching real incidents.

Composite Alarm #

Combine multiple alarms. “ALB 5xx alarm AND task running fine” → real backend problem.

Composite Alarm
aws cloudwatch put-composite-alarm \
  --alarm-name "blog-real-incident" \
  --alarm-rule "ALARM('blog-alb-5xx-burst') AND OK('blog-running-tasks-low')"

OK() matches the normal (non-alarm) state — when one alarm is firing but the other is OK, the composite alarm stays quiet, reducing noise.

5) SNS → Slack — reaching humans #

Notification flow
CloudWatch Alarm
SNS Topic (ops-alerts)
   ├── Email subscription   (ops team)
   ├── SMS subscription      (oncall)
   ├── Lambda subscription   ← convert to Slack webhook
   └── PagerDuty / OpsGenie

SNS → Slack Lambda #

lambda_handler.py
import json, os, urllib.request

WEBHOOK = os.environ["SLACK_WEBHOOK"]

def handler(event, context):
    for record in event["Records"]:
        msg = json.loads(record["Sns"]["Message"])
        text = (
            f":rotating_light: *{msg['AlarmName']}*\n"
            f"Region: {msg['Region']}\n"
            f"State: {msg['NewStateValue']} (was {msg['OldStateValue']})\n"
            f"Reason: {msg['NewStateReason']}\n"
        )
        req = urllib.request.Request(
            WEBHOOK,
            data=json.dumps({"text": text}).encode(),
            headers={"Content-Type": "application/json"},
        )
        urllib.request.urlopen(req)

Pattern from Advanced #3 Lambda. Add an SNS subscription that calls Lambda, and you’re done.

Alarm message format #

A good alarm message has:

  • What broke (alarm name)
  • How much broke (threshold / actual value)
  • Where (region, service)
  • When (timestamp)
  • Links — direct to console / dashboard / Logs Insights

Links matter most. At 3am, the on-call engineer reading Slack clicks once and lands directly in context.

6) X-Ray — distributed tracing #

“5xx is up” — Metrics tells. “Why is 5xx up?” — Logs do. “Where did this request spend 5 seconds?” — X-Ray answers.

X-Ray Trace shape
Request: POST /posts                       4.2s
   ├── ALB                                 0.01s
   └── ECS api                             4.15s
         ├── auth.verify_token             0.05s
         ├── db.posts.insert               3.80s   ← suspect
         │     └── RDS PostgreSQL          3.78s
         │           └── (slow query)
         └── notify.publish (SNS)          0.30s
               └── SNS:Publish             0.28s

FastAPI/Django integration #

Install
pip install aws-xray-sdk
FastAPI
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.fastapi.middleware import XRayMiddleware
from aws_xray_sdk.ext.sqlalchemy.query import XRayQuery

xray_recorder.configure(service="blog-api")

app = FastAPI()
app.add_middleware(XRayMiddleware, recorder=xray_recorder)

# SQLAlchemy tracing
from aws_xray_sdk.ext.sqlalchemy_core import unpatch
# (auto-patch on engine creation)

Sidecar — X-Ray Daemon #

In ECS, the X-Ray Daemon container runs as a sidecar in the same task definition:

task definition with sidecar
{
  "containerDefinitions": [
    { "name": "api", ... },
    {
      "name": "xray-daemon",
      "image": "public.ecr.aws/xray/aws-xray-daemon:latest",
      "portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
      "essential": false
    }
  ]
}

App sends traces to 127.0.0.1:2000, daemon batches them to X-Ray service. Task role needs xray:PutTraceSegments.

Where it shines most #

CaseX-Ray value
Single container + single DBModerate — Logs alone may suffice
Multiple microservice callsVery big — see which step is slow in one line
External API dependencyVery big — verify external is actually slow
Lambda + DynamoDBVery big — separate Lambda cold start, external calls

Sampling #

Tracing every request is expensive. Use sampling rules for 5–10% only:

x-ray.json
{
  "version": 2,
  "rules": [{
    "description": "Default",
    "service_name": "*",
    "http_method": "*",
    "url_path": "*",
    "fixed_target": 1,
    "rate": 0.05
  }],
  "default": { "fixed_target": 1, "rate": 0.05 }
}

Set health checks like /health to 0% so traces don’t drown in noise.

7) Dashboard — one screen #

Put operational signals in one CloudWatch Dashboard:

9 recommended widgets
[1] Requests/s (ALB)         [2] 5xx rate (ALB)         [3] p99 latency
[4] ECS CPU (Service)         [5] ECS Memory             [6] Running tasks
[7] RDS CPU                   [8] RDS Connections        [9] RDS FreeStorage
Dashboard IaC (Terraform excerpt)
resource "aws_cloudwatch_dashboard" "blog" {
  dashboard_name = "blog-overview"
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric",
        x      = 0, y = 0, width = 8, height = 6,
        properties = {
          metrics = [["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/blog-alb/abc123"]],
          period  = 60, stat = "Sum", region = "ap-northeast-2",
          title   = "Requests/min"
        }
      }
      # ... 8 more
    ]
  })
}

Periodic review #

Once a week, the on-call engineer scans the dashboard to spot gradually worsening areas. Alarms catch only acute incidents; slow degradation is caught faster by human eyes.

Pitfalls — production monitoring #

1) Alert Fatigue — too many alarms #

30 alarms / day → soon everyone ignores alarms. Recommendation:

Alarm tierFrequencyChannel
Critical1–2/monthPagerDuty / SMS
Warning1–2/weekSlack #ops
InfoFrequentSlack #ops-info (quiet channel)

Keep fewer than 5 alarms that actually wake people up.

2) Logs growing infinitely #

Missing retention setup leads to bill shock 6 months later. Set retention on every log group — easily done all at once with Terraform.

3) Logs too small #

Right after an incident, “let’s check the logs” → 7-day retention has already expired. Trying to export after the fact is too late. Core log groups deserve 30+ days of retention.

4) X-Ray 100% sampling #

Cost explodes. 5–10% sampling + 100% on errors / slow requests (X-Ray sampling rules).

5) Alarms without SLOs #

Where did the alarm threshold come from — “I picked 80%”? Without an SLO (e.g., p99 latency < 500ms, 99% of the time), thresholds are arbitrary. Define your SLO first, then derive the threshold from it.

6) Dashboard exists, never viewed #

A dashboard built and never seen is the same as none. Add 30-min dashboard review to weekly oncall meetings.

7) Alarms don’t reach people #

Email only → buried in the inbox. Use SMS / PagerDuty / Slack mentions — channels that actually reach you.

Wrapping up #

What we covered in this post:

  • 4 pillars — Metrics / Logs / Traces / Events
  • CloudWatch Logs — automatic awslogs, retention setup, structured JSON, 7 operational queries
  • CloudWatch Metrics — Container Insights enabled, ECS / RDS / ALB core metrics and thresholds
  • EMF — emit metrics through logs without PutMetricData
  • Alarmsperiod × evaluation × datapoints pattern, treat-missing-data
  • Composite Alarm — reduce noise
  • SNS → Lambda → Slack — reach humans with alarms
  • X-Ray — distributed tracing, sidecar daemon, sampling for cost control
  • Dashboard — 9-widget single screen, IaC-ified
  • Pitfalls — alert fatigue, retention, sampling, missing SLO, ignored dashboard, alarm channels

Next — Cost and track wrap-up #

Now the system runs well, and alarms fire on incidents. One last topic remains — how much is it costing, and a retrospective of 27 posts of track.

In #6 Cost optimization and dashboards — wrapping up the track we’ll cover Cost Explorer analysis, Savings Plans / Spot Fargate, Right Sizing, tag enforcement, cost dashboards, and how the 27 posts of the AWS track come together as one system.

X