26 Chapter

Monitoring — CloudWatch Alarms and X-Ray

Operational CloudWatch Logs Insights queries, the core metrics and alarm thresholds for ECS / RDS / ALB, SNS → Slack notifications, and capturing a slow request with X-Ray distributed tracing. Turning on the eyes of operations.

In Chapter 22 ~ Chapter 25 the infrastructure became code and deployment became automatic. Yet we can’t actually see, on one screen, whether this system is running well — whether 5xx is up, whether RDS CPU is at 80%, which request took 5 seconds.

This chapter makes that state visible at a glance. As the fifth chapter of Part 4, what it covers is as follows.

CloudWatch Logs + operational Logs Insights queries
CloudWatch Metrics — the core metrics and alarm thresholds for ECS / RDS / ALB
the alarm → SNS → Slack flow
X-Ray — pinpointing “where is it slow” with distributed tracing
dashboards — system state on one screen

The big picture — the 4 components of monitoring #

The components of observability

┌──────────────┬──────────────┬──────────────┬──────────────┐
│   Metrics    │     Logs     │    Traces    │    Events    │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ "how much"   │ "what"       │ "where"      │ "when"       │
│ requests,5xx │ stacktrace   │ DB 5s        │ deploy,scale │
│ CPU, memory  │ access log   │ ext API 1s   │ failover     │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ CloudWatch   │ CloudWatch   │ X-Ray        │ EventBridge  │
│ Metrics      │ Logs         │              │              │
└──────────────┴──────────────┴──────────────┴──────────────┘

This chapter is the three areas Metrics + Logs + Traces. Events was covered in Chapter 19 EventBridge / SQS / SNS.

1) CloudWatch Logs — already flowing #

Because Chapter 22’s Task Definition includes awslogs, all container stdout/stderr goes automatically to CloudWatch Logs.

The hierarchy of logs

Log Group: /ecs/blog-api
   │
   ├── Log Stream: api/<task-id-1>     ← one stream per Task
   ├── Log Stream: api/<task-id-2>
   └── Log Stream: api/<task-id-3>

Retention setting — cost separation #

The default is infinite retention. Even at small traffic, when a month’s logs pile up by the GB the cost grows.

30-day retention

aws logs put-retention-policy \
  --log-group-name /ecs/blog-api \
  --retention-in-days 30

Recommended values are as follows.

production access log: 30 ~ 90 days
debug / verbose: 7 days
audit log: 365 days (or export to S3 then delete)

Structured logs are key #

print() is hard to search. Emit JSON and Logs Insights can query it key by key.

FastAPI — JSON logging

import logging, json, sys

class JsonFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "level":   record.levelname,
            "logger":  record.name,
            "message": record.getMessage(),
            "ts":      self.formatTime(record),
            **getattr(record, "extra", {}),
        })

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])

Per-request log

@app.middleware("http")
async def access_log(request, call_next):
    start = time.time()
    response = await call_next(request)
    logging.info("access", extra={"extra": {
        "method": request.method,
        "path": str(request.url.path),
        "status": response.status_code,
        "duration_ms": int((time.time() - start) * 1000),
        "request_id": request.state.request_id,
    }})
    return response

Emitted this way, Logs Insights answers queries like the following exactly.

2) Logs Insights — 7 operational queries #

A collection of frequently used queries. Worth bookmarking.

A) Pulling out only 5xx #

fields @timestamp, status, path, request_id, message
| filter status >= 500
| sort @timestamp desc
| limit 100

B) Response-time distribution (p50/p90/p99) #

fields @timestamp, duration_ms
| filter ispresent(duration_ms)
| stats
    count(*) as requests,
    pct(duration_ms, 50) as p50,
    pct(duration_ms, 90) as p90,
    pct(duration_ms, 99) as p99
  by bin(5m)

C) The slowest paths #

fields path, duration_ms
| filter duration_ms > 1000
| stats count(*), avg(duration_ms), max(duration_ms) by path
| sort avg(duration_ms) desc
| limit 20

D) Tracing one request by request_id #

fields @timestamp, level, message, path, status, duration_ms
| filter request_id = "abc-123-xyz"
| sort @timestamp asc

E) Lines with a stacktrace #

fields @timestamp, message
| filter @message like /Traceback|exception/
| sort @timestamp desc

fields @timestamp, source_ip, username
| filter event = "auth_fail"
| stats count(*) by source_ip
| sort count(*) desc

G) Cost — which path is called the most #

fields path
| stats count(*) by path
| sort count(*) desc
| limit 30

Saved Queries #

Save frequently used queries in the console to share across the whole team. You can codify them as IaC with CloudFormation / Terraform (aws_cloudwatch_query_definition).

3) CloudWatch Metrics — the core indicators #

ECS Container Insights #

Default ECS metrics are sparse. Turn on Container Insights and you see CPU / memory / network / disk / running task count per task / service all at once.

Enabling Container Insights

aws ecs update-cluster-settings \
  --cluster blog-cluster \
  --settings name=containerInsights,value=enabled

There’s an added cost (~$1 ~ 3/month for a small cluster), but it’s essential in production.

Monitoring table — what to watch #

Metric	Resource	Meaning	Alarm threshold (example)
`HTTPCode_Target_5XX_Count`	ALB	backend 5xx	5-min sum ≥ 5
`HTTPCode_ELB_5XX_Count`	ALB	the ALB’s own 5xx (mostly 0 healthy hosts)	5-min sum ≥ 1
`TargetResponseTime` (p99)	ALB	response time p99	5-min avg ≥ 1.0s
`UnHealthyHostCount`	Target Group	count of dead tasks	5-min avg ≥ 1
`CPUUtilization` (Service)	ECS	service average CPU	5-min avg ≥ 80%
`MemoryUtilization` (Service)	ECS	memory	5-min avg ≥ 85%
`RunningTaskCount`	ECS	running task count	differs from desired
`CPUUtilization`	RDS	DB CPU	5-min avg ≥ 80%
`DatabaseConnections`	RDS	connection count	80% of max_connections
`FreeStorageSpace`	RDS	remaining disk	< 5GB
`ReadLatency` / `WriteLatency`	RDS	disk latency	> 50ms

Custom Metrics #

Metrics you emit directly from the app. Embed EMF (Embedded Metric Format) into the log and the metric is created with no separate call.

Business metrics via EMF

import json, time, logging

def emit_metric(metric_name, value, unit="Count", **dims):
    payload = {
        "_aws": {
            "Timestamp": int(time.time() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "BlogApp",
                "Dimensions": [list(dims.keys())],
                "Metrics": [{"Name": metric_name, "Unit": unit}],
            }],
        },
        metric_name: value,
        **dims,
    }
    logging.info(json.dumps(payload))

emit_metric("PostCreated", 1, env="prod")
emit_metric("CommentCreated", 1, env="prod")
emit_metric("LoginFailed", 1, source_ip="...")

CloudWatch parses the log and automatically creates the BlogApp/PostCreated metric. With no separate PutMetricData API call, it saves both cost and latency.

4) Alarms — call a human when the threshold is crossed #

ALB 5xx alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "blog-alb-5xx-burst" \
  --metric-name HTTPCode_Target_5XX_Count \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --dimensions Name=LoadBalancer,Value=app/blog-alb/abc123 \
  --alarm-actions arn:aws:sns:ap-northeast-2:123456789012:ops-alerts

The key options laid out:

Option	Meaning
`period`	data point unit (60 = 1 minute)
`evaluation-periods`	how many points to evaluate
`datapoints-to-alarm`	how many of those crossing the threshold trigger the alarm
`treat-missing-data`	when there’s no data — `notBreaching` recommended
`comparison-operator`	`>= / > / < / <=`

The 5/3 pattern (“if 3 of the last 5 minutes cross the threshold”) is the standard that filters out transient spikes while catching real incidents.

Composite Alarm #

Bundles multiple alarms. “ALB 5xx alarm AND task running is normal” means it’s a real backend problem.

Composite Alarm

aws cloudwatch put-composite-alarm \
  --alarm-name "blog-real-incident" \
  --alarm-rule "ALARM('blog-alb-5xx-burst') AND OK('blog-running-tasks-low')"

OK() means the case where it’s normally ok but one other alarm is in alarm, which reduces noise.

5) SNS → Slack — the part that reaches a human #

The notification flow

CloudWatch Alarm
   │
   ▼
SNS Topic (ops-alerts)
   │
   ├── Email subscription   (operations team)
   ├── SMS subscription      (oncall)
   ├── Lambda subscription   ← converts to a Slack webhook
   └── PagerDuty / OpsGenie

SNS → Slack Lambda #

lambda_handler.py

import json, os, urllib.request

WEBHOOK = os.environ["SLACK_WEBHOOK"]

def handler(event, context):
    for record in event["Records"]:
        msg = json.loads(record["Sns"]["Message"])
        text = (
            f":rotating_light: *{msg['AlarmName']}*\n"
            f"Region: {msg['Region']}\n"
            f"State: {msg['NewStateValue']} (was {msg['OldStateValue']})\n"
            f"Reason: {msg['NewStateReason']}\n"
        )
        req = urllib.request.Request(
            WEBHOOK,
            data=json.dumps({"text": text}).encode(),
            headers={"Content-Type": "application/json"},
        )
        urllib.request.urlopen(req)

It’s the pattern from Chapter 17 Lambda basics. Set up an SNS subscription so SNS invokes the Lambda, and you’re done.

The alarm message format #

A good alarm message contains the following.

what broke (alarm name)
how much it broke (threshold / actual value)
where (region, service)
when (timestamp)
a link — straight to the console / dashboard / Logs Insights

The link matters most. An oncall who sees Slack at 3 AM should be able to get into context with one click.

6) X-Ray — distributed tracing #

Metrics tell you up to “5xx is up.” “Why is 5xx up?” is answered by Logs. “Where did this request spend 5 seconds?” is answered by X-Ray.

The shape of an X-Ray Trace

Request: POST /posts                       4.2s
   │
   ├── ALB                                 0.01s
   │
   └── ECS api                             4.15s
         │
         ├── auth.verify_token             0.05s
         │
         ├── db.posts.insert               3.80s   ← the culprit
         │     └── RDS PostgreSQL          3.78s
         │           └── (slow query)
         │
         └── notify.publish (SNS)          0.30s
               └── SNS:Publish             0.28s

FastAPI/Django integration #

Install

pip install aws-xray-sdk

FastAPI

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.fastapi.middleware import XRayMiddleware
from aws_xray_sdk.ext.sqlalchemy.query import XRayQuery

xray_recorder.configure(service="blog-api")

app = FastAPI()
app.add_middleware(XRayMiddleware, recorder=xray_recorder)

# SQLAlchemy tracing
from aws_xray_sdk.ext.sqlalchemy_core import unpatch
# (auto-patched when the engine is created)

Sidecar — X-Ray Daemon #

On ECS, place the X-Ray Daemon container as a sidecar inside the same task definition.

Adding the sidecar to the task definition

{
  "containerDefinitions": [
    { "name": "api", ... },
    {
      "name": "xray-daemon",
      "image": "public.ecr.aws/xray/aws-xray-daemon:latest",
      "portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
      "essential": false
    }
  ]
}

The app sends traces to 127.0.0.1:2000, and the daemon batches them off to the X-Ray service. A separate IAM action (xray:PutTraceSegments) is needed on the task role.

Where the value is greatest #

Situation	X-Ray value
single container + single DB	moderate — Logs alone is enough
multiple microservice calls	very high — which step is slow
dependency on external APIs	very high — verify whether the external one is really slow
Lambda + DynamoDB	very high — separates Lambda cold start from external calls

Sampling #

Tracing every request is costly. Use a sampling rule to trace only 5 ~ 10%.

x-ray.json

{
  "version": 2,
  "rules": [{
    "description": "Default",
    "service_name": "*",
    "http_method": "*",
    "url_path": "*",
    "fixed_target": 1,
    "rate": 0.05
  }],
  "default": { "fixed_target": 1, "rate": 0.05 }
}

You should exclude health checks like /health at 0% so the traces don’t fill up with noise.

7) Dashboard — one screen #

Gather the operational signals onto one screen in a CloudWatch Dashboard.

9 recommended widgets

[1] Requests/s (ALB)         [2] 5xx rate (ALB)         [3] p99 latency
[4] ECS CPU (Service)         [5] ECS Memory             [6] Running tasks
[7] RDS CPU                   [8] RDS Connections        [9] RDS FreeStorage

Dashboard IaC (Terraform excerpt)

resource "aws_cloudwatch_dashboard" "blog" {
  dashboard_name = "blog-overview"
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric",
        x      = 0, y = 0, width = 8, height = 6,
        properties = {
          metrics = [["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/blog-alb/abc123"]],
          period  = 60, stat = "Sum", region = "ap-northeast-2",
          title   = "Requests/min"
        }
      }
      # ... 8 more
    ]
  })
}

Regular review #

Once a week, have the oncall scan the dashboard for gradually worsening indicators. Alarms only catch immediate incidents; for gradual deterioration, the human eye is faster.

Pitfalls — pitfalls of operational monitoring #

1) Alert Fatigue — too many alarms #

If there are 30 alarms a day, soon everyone ignores them. The recommended tiers are as follows.

Alarm tier	Frequency	Channel
Critical	1 ~ 2 times a month	PagerDuty / SMS
Warning	1 ~ 2 times a week	Slack #ops
Info	often	Slack #ops-info (a quiet channel)

Keep the alarms that truly wake a person to fewer than 5.

2) Logs grow infinitely #

Omit the retention setting and a bill shock comes six months later. Apply retention to every log group (all at once with Terraform).

3) Logs too small #

After an incident you go “let’s look at the logs from then,” but with 7-day retention they’re already gone. Exporting right after an incident is too late. Keep key groups at 30 days or more.

4) X-Ray 100% sampling #

Cost runs away. Keep it at 5 ~ 10% sampling + 100% for errors / slow requests only (possible with X-Ray’s sampling rule).

5) Alarms without SLOs #

Where the alarm’s threshold came from becomes — “I said 80%.” Without a stated SLO (e.g., p99 < 500ms for 99% of the time), the threshold becomes arbitrary. Derive alarm thresholds from the SLO definition.

6) Dashboard exists but isn’t looked at #

A dashboard you make and don’t look at is the same as not having one. Put a 30-minute dashboard review into the weekly oncall meeting.

7) Alarms don’t reach a human #

Use email only and it goes deep into the inbox. Use a self-summoning channel like SMS / PagerDuty / a Slack mention.

Exercises #

Write out, without looking at §“The big picture,” what question each of monitoring’s 4 components (Metrics / Logs / Traces / Events) answers (“how much / what / where / when”). Also mark which three of those areas this chapter covers.
Explain, on the basis of §“Alarms,” how the three values period, evaluation-periods, and datapoints-to-alarm of the ALB 5xx alarm make the 5/3 pattern, and write in one sentence why this pattern filters out transient spikes.
From the §“Where the value is greatest” table, pick the two situations where X-Ray’s value is highest, and explain, in connection with Chapter 27 cost optimization, why 100% sampling is dangerous.

In short: observability divides into Metrics (how much), Logs (what), Traces (where), and Events (when). Logs auto-collected via awslogs are emitted as structured JSON and queried with Logs Insights, and ECS/RDS/ALB metrics are seen with Container Insights. Alarms filter noise with the period × evaluation × datapoints pattern and reach a human via SNS → Lambda → Slack, while X-Ray pinpoints the slow step with distributed tracing but controls cost through sampling.

Next chapter #

Now we’ve reached the structure where the system runs well and an alarm sounds when an incident happens. Finally — how much is it costing, and how do you cut that cost? In the next Chapter 27 cost optimization and dashboards we cover Cost Explorer analysis, Savings Plans / Spot Fargate / Graviton, Right Sizing, tag enforcement, and a cost dashboard, and wrap up Part 4.