AWS in Practice #5: Monitoring — CloudWatch Alarms and X-Ray

Infrastructure AWS CloudWatch X-Ray Monitoring

Wednesday, May 6, 2026

10 min read

In #1 through #4, infrastructure became code and deployment became automated. But we still have no single place to see whether this system is running well — whether 5xx errors are climbing, RDS CPU is at 80%, or which request took 5 seconds.

This post turns on that eye.

CloudWatch Logs + Logs Insights operational queries
CloudWatch Metrics — ECS / RDS / ALB core metrics and alarm thresholds
The flow of alarm → SNS → Slack
X-Ray — distributed tracing for “where is it slow” in one line
Dashboard — system state in one screen

The big picture — the 4 pillars of monitoring #

The pillars of observability

┌──────────────┬──────────────┬──────────────┬──────────────┐
│   Metrics    │     Logs     │    Traces    │    Events    │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ "How much"   │ "What"       │ "Where"      │ "When"       │
│ requests, 5xx│ stacktrace   │ DB took 5s   │ deploys, scale│
│ CPU, memory  │ access log   │ external 1s  │ failover     │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ CloudWatch   │ CloudWatch   │ X-Ray        │ EventBridge  │
│ Metrics      │ Logs         │              │              │
└──────────────┴──────────────┴──────────────┴──────────────┘

This post: Metrics + Logs + Traces — three pillars. Events were already in Advanced #5.

1) CloudWatch Logs — already flowing #

#1’s Task Definition has awslogs baked in, so all container stdout/stderr automatically goes to CloudWatch Logs.

Log hierarchy

Log Group: /ecs/blog-api
   │
   ├── Log Stream: api/<task-id-1>     ← one stream per task
   ├── Log Stream: api/<task-id-2>
   └── Log Stream: api/<task-id-3>

Retention setup — cost discipline #

The default is infinite retention. Even modest traffic can accumulate GBs of logs per month, causing costs to balloon.

30-day retention

aws logs put-retention-policy \
  --log-group-name /ecs/blog-api \
  --retention-in-days 30

Recommendations:

Production access logs: 30–90 days
Debug / verbose: 7 days
Audit logs: 365 days (or export to S3 then delete)

Structured logs are key #

Plain print() output is hard to search. JSON format lets Logs Insights query on individual keys.

FastAPI — JSON logging

import logging, json, sys

class JsonFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "level":   record.levelname,
            "logger":  record.name,
            "message": record.getMessage(),
            "ts":      self.formatTime(record),
            **getattr(record, "extra", {}),
        })

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])

Per-request log

@app.middleware("http")
async def access_log(request, call_next):
    start = time.time()
    response = await call_next(request)
    logging.info("access", extra={"extra": {
        "method": request.method,
        "path": str(request.url.path),
        "status": response.status_code,
        "duration_ms": int((time.time() - start) * 1000),
        "request_id": request.state.request_id,
    }})
    return response

Logging in this format lets Logs Insights answer queries like those below precisely.

2) Logs Insights — 7 operational queries #

Frequently used queries — bookmark them.

A) Filter just 5xx #

fields @timestamp, status, path, request_id, message
| filter status >= 500
| sort @timestamp desc
| limit 100

B) Response time distribution (p50/p90/p99) #

fields @timestamp, duration_ms
| filter ispresent(duration_ms)
| stats
    count(*) as requests,
    pct(duration_ms, 50) as p50,
    pct(duration_ms, 90) as p90,
    pct(duration_ms, 99) as p99
  by bin(5m)

C) Slowest paths #

fields path, duration_ms
| filter duration_ms > 1000
| stats count(*), avg(duration_ms), max(duration_ms) by path
| sort avg(duration_ms) desc
| limit 20

D) Trace one request by request_id #

fields @timestamp, level, message, path, status, duration_ms
| filter request_id = "abc-123-xyz"
| sort @timestamp asc

E) Lines with stacktrace #

fields @timestamp, message
| filter @message like /Traceback|exception/
| sort @timestamp desc

fields @timestamp, source_ip, username
| filter event = "auth_fail"
| stats count(*) by source_ip
| sort count(*) desc

G) Cost — which path gets called the most #

fields path
| stats count(*) by path
| sort count(*) desc
| limit 30

Saved Queries #

Save frequently used queries in the console — share across the team. Can be IaC’d via CloudFormation / Terraform (aws_cloudwatch_query_definition).

3) CloudWatch Metrics — the core #

ECS Container Insights #

Default ECS metrics are sparse. Enable Container Insights to get task- and service-level CPU, memory, network, disk, and running task counts all at once.

Enable Container Insights

aws ecs update-cluster-settings \
  --cluster blog-cluster \
  --settings name=containerInsights,value=enabled

There’s an extra cost (~$1–3/month for small clusters), but it’s essential for production.

Monitoring table — what to watch #

Metric	Resource	Meaning	Alarm threshold (example)
`HTTPCode_Target_5XX_Count`	ALB	Backend 5xx	5-min sum ≥ 5
`HTTPCode_ELB_5XX_Count`	ALB	ALB self 5xx (mostly 0 healthy hosts)	5-min sum ≥ 1
`TargetResponseTime` (p99)	ALB	Response time p99	5-min average ≥ 1.0s
`UnHealthyHostCount`	Target Group	Dead task count	5-min average ≥ 1
`CPUUtilization` (Service)	ECS	Service average CPU	5-min average ≥ 80%
`MemoryUtilization` (Service)	ECS	Memory	5-min average ≥ 85%
`RunningTaskCount`	ECS	Running task count	Different from desired
`CPUUtilization`	RDS	DB CPU	5-min average ≥ 80%
`DatabaseConnections`	RDS	Connection count	80% of max_connections
`FreeStorageSpace`	RDS	Free disk	< 5GB
`ReadLatency` / `WriteLatency`	RDS	Disk latency	> 50ms

Custom Metrics #

Metrics emitted from the app. Embedded Metric Format (EMF) in logs — without separate API calls.

Business metrics via EMF

import json, time, logging

def emit_metric(metric_name, value, unit="Count", **dims):
    payload = {
        "_aws": {
            "Timestamp": int(time.time() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "BlogApp",
                "Dimensions": [list(dims.keys())],
                "Metrics": [{"Name": metric_name, "Unit": unit}],
            }],
        },
        metric_name: value,
        **dims,
    }
    logging.info(json.dumps(payload))

emit_metric("PostCreated", 1, env="prod")
emit_metric("CommentCreated", 1, env="prod")
emit_metric("LoginFailed", 1, source_ip="...")

CloudWatch parses the logs and automatically creates the BlogApp/PostCreated metric. No PutMetricData API call required — saves both cost and latency.

4) Alarms — calling people when thresholds break #

ALB 5xx alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "blog-alb-5xx-burst" \
  --metric-name HTTPCode_Target_5XX_Count \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --dimensions Name=LoadBalancer,Value=app/blog-alb/abc123 \
  --alarm-actions arn:aws:sns:ap-northeast-2:123456789012:ops-alerts

Key options:

Option	Meaning
`period`	Data point unit (60 = 1 minute)
`evaluation-periods`	How many points to evaluate
`datapoints-to-alarm`	How many of those crossing fires the alarm
`treat-missing-data`	When data is missing — `notBreaching` recommended
`comparison-operator`	`>= / > / < / <=`

The 5/3 pattern (“3 out of the last 5 data points cross the threshold”) filters out momentary spikes while still catching real incidents.

Composite Alarm #

Combine multiple alarms. “ALB 5xx alarm AND task running fine” → real backend problem.

Composite Alarm

aws cloudwatch put-composite-alarm \
  --alarm-name "blog-real-incident" \
  --alarm-rule "ALARM('blog-alb-5xx-burst') AND OK('blog-running-tasks-low')"

OK() matches the normal (non-alarm) state — when one alarm is firing but the other is OK, the composite alarm stays quiet, reducing noise.

5) SNS → Slack — reaching humans #

Notification flow

CloudWatch Alarm
   │
   ▼
SNS Topic (ops-alerts)
   │
   ├── Email subscription   (ops team)
   ├── SMS subscription      (oncall)
   ├── Lambda subscription   ← convert to Slack webhook
   └── PagerDuty / OpsGenie

SNS → Slack Lambda #

lambda_handler.py

import json, os, urllib.request

WEBHOOK = os.environ["SLACK_WEBHOOK"]

def handler(event, context):
    for record in event["Records"]:
        msg = json.loads(record["Sns"]["Message"])
        text = (
            f":rotating_light: *{msg['AlarmName']}*\n"
            f"Region: {msg['Region']}\n"
            f"State: {msg['NewStateValue']} (was {msg['OldStateValue']})\n"
            f"Reason: {msg['NewStateReason']}\n"
        )
        req = urllib.request.Request(
            WEBHOOK,
            data=json.dumps({"text": text}).encode(),
            headers={"Content-Type": "application/json"},
        )
        urllib.request.urlopen(req)

Pattern from Advanced #3 Lambda. Add an SNS subscription that calls Lambda, and you’re done.

Alarm message format #

A good alarm message has:

What broke (alarm name)
How much broke (threshold / actual value)
Where (region, service)
When (timestamp)
Links — direct to console / dashboard / Logs Insights

Links matter most. At 3am, the on-call engineer reading Slack clicks once and lands directly in context.

6) X-Ray — distributed tracing #

“5xx is up” — Metrics tells. “Why is 5xx up?” — Logs do. “Where did this request spend 5 seconds?” — X-Ray answers.

X-Ray Trace shape

Request: POST /posts                       4.2s
   │
   ├── ALB                                 0.01s
   │
   └── ECS api                             4.15s
         │
         ├── auth.verify_token             0.05s
         │
         ├── db.posts.insert               3.80s   ← suspect
         │     └── RDS PostgreSQL          3.78s
         │           └── (slow query)
         │
         └── notify.publish (SNS)          0.30s
               └── SNS:Publish             0.28s

FastAPI/Django integration #

Install

pip install aws-xray-sdk

FastAPI

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.fastapi.middleware import XRayMiddleware
from aws_xray_sdk.ext.sqlalchemy.query import XRayQuery

xray_recorder.configure(service="blog-api")

app = FastAPI()
app.add_middleware(XRayMiddleware, recorder=xray_recorder)

# SQLAlchemy tracing
from aws_xray_sdk.ext.sqlalchemy_core import unpatch
# (auto-patch on engine creation)

Sidecar — X-Ray Daemon #

In ECS, the X-Ray Daemon container runs as a sidecar in the same task definition:

task definition with sidecar

{
  "containerDefinitions": [
    { "name": "api", ... },
    {
      "name": "xray-daemon",
      "image": "public.ecr.aws/xray/aws-xray-daemon:latest",
      "portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
      "essential": false
    }
  ]
}

App sends traces to 127.0.0.1:2000, daemon batches them to X-Ray service. Task role needs xray:PutTraceSegments.

Where it shines most #

Case	X-Ray value
Single container + single DB	Moderate — Logs alone may suffice
Multiple microservice calls	Very big — see which step is slow in one line
External API dependency	Very big — verify external is actually slow
Lambda + DynamoDB	Very big — separate Lambda cold start, external calls

Sampling #

Tracing every request is expensive. Use sampling rules for 5–10% only:

x-ray.json

{
  "version": 2,
  "rules": [{
    "description": "Default",
    "service_name": "*",
    "http_method": "*",
    "url_path": "*",
    "fixed_target": 1,
    "rate": 0.05
  }],
  "default": { "fixed_target": 1, "rate": 0.05 }
}

Set health checks like /health to 0% so traces don’t drown in noise.

7) Dashboard — one screen #

Put operational signals in one CloudWatch Dashboard:

9 recommended widgets

[1] Requests/s (ALB)         [2] 5xx rate (ALB)         [3] p99 latency
[4] ECS CPU (Service)         [5] ECS Memory             [6] Running tasks
[7] RDS CPU                   [8] RDS Connections        [9] RDS FreeStorage

Dashboard IaC (Terraform excerpt)

resource "aws_cloudwatch_dashboard" "blog" {
  dashboard_name = "blog-overview"
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric",
        x      = 0, y = 0, width = 8, height = 6,
        properties = {
          metrics = [["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/blog-alb/abc123"]],
          period  = 60, stat = "Sum", region = "ap-northeast-2",
          title   = "Requests/min"
        }
      }
      # ... 8 more
    ]
  })
}

Periodic review #

Once a week, the on-call engineer scans the dashboard to spot gradually worsening areas. Alarms catch only acute incidents; slow degradation is caught faster by human eyes.

Pitfalls — production monitoring #

1) Alert Fatigue — too many alarms #

30 alarms / day → soon everyone ignores alarms. Recommendation:

Alarm tier	Frequency	Channel
Critical	1–2/month	PagerDuty / SMS
Warning	1–2/week	Slack #ops
Info	Frequent	Slack #ops-info (quiet channel)

Keep fewer than 5 alarms that actually wake people up.

2) Logs growing infinitely #

Missing retention setup leads to bill shock 6 months later. Set retention on every log group — easily done all at once with Terraform.

3) Logs too small #

Right after an incident, “let’s check the logs” → 7-day retention has already expired. Trying to export after the fact is too late. Core log groups deserve 30+ days of retention.

4) X-Ray 100% sampling #

Cost explodes. 5–10% sampling + 100% on errors / slow requests (X-Ray sampling rules).

5) Alarms without SLOs #

Where did the alarm threshold come from — “I picked 80%”? Without an SLO (e.g., p99 latency < 500ms, 99% of the time), thresholds are arbitrary. Define your SLO first, then derive the threshold from it.

6) Dashboard exists, never viewed #

A dashboard built and never seen is the same as none. Add 30-min dashboard review to weekly oncall meetings.

7) Alarms don’t reach people #

Email only → buried in the inbox. Use SMS / PagerDuty / Slack mentions — channels that actually reach you.

Wrapping up #

What we covered in this post:

4 pillars — Metrics / Logs / Traces / Events
CloudWatch Logs — automatic awslogs, retention setup, structured JSON, 7 operational queries
CloudWatch Metrics — Container Insights enabled, ECS / RDS / ALB core metrics and thresholds
EMF — emit metrics through logs without PutMetricData
Alarms — period × evaluation × datapoints pattern, treat-missing-data
Composite Alarm — reduce noise
SNS → Lambda → Slack — reach humans with alarms
X-Ray — distributed tracing, sidecar daemon, sampling for cost control
Dashboard — 9-widget single screen, IaC-ified
Pitfalls — alert fatigue, retention, sampling, missing SLO, ignored dashboard, alarm channels

Next — Cost and track wrap-up #

Now the system runs well, and alarms fire on incidents. One last topic remains — how much is it costing, and a retrospective of 27 posts of track.

In #6 Cost optimization and dashboards — wrapping up the track we’ll cover Cost Explorer analysis, Savings Plans / Spot Fargate, Right Sizing, tag enforcement, cost dashboards, and how the 27 posts of the AWS track come together as one system.

The big picture — the 4 pillars of monitoring #

1) CloudWatch Logs — already flowing #

Retention setup — cost discipline #

Structured logs are key #

2) Logs Insights — 7 operational queries #

A) Filter just 5xx #

B) Response time distribution (p50/p90/p99) #

C) Slowest paths #

D) Trace one request by request_id #

E) Lines with stacktrace #

F) Login attempts (auth failures) #

G) Cost — which path gets called the most #

Saved Queries #

3) CloudWatch Metrics — the core #

ECS Container Insights #

Monitoring table — what to watch #

Custom Metrics #

4) Alarms — calling people when thresholds break #

Composite Alarm #

5) SNS → Slack — reaching humans #

SNS → Slack Lambda #

Alarm message format #

6) X-Ray — distributed tracing #

FastAPI/Django integration #

Sidecar — X-Ray Daemon #

Where it shines most #

Sampling #

7) Dashboard — one screen #

Periodic review #

Pitfalls — production monitoring #

1) Alert Fatigue — too many alarms #

2) Logs growing infinitely #

3) Logs too small #

4) X-Ray 100% sampling #

5) Alarms without SLOs #

6) Dashboard exists, never viewed #

7) Alarms don’t reach people #

Wrapping up #

Next — Cost and track wrap-up #