Monitoring — CloudWatch Alarms and X-Ray
Operational CloudWatch Logs Insights queries, the core metrics and alarm thresholds for ECS / RDS / ALB, SNS → Slack notifications, and capturing a slow request with X-Ray distributed tracing. Turning on the eyes of operations.
In Chapter 22 ~ Chapter 25 the infrastructure became code and deployment became automatic. Yet we can’t actually see, on one screen, whether this system is running well — whether 5xx is up, whether RDS CPU is at 80%, which request took 5 seconds.
This chapter makes that state visible at a glance. As the fifth chapter of Part 4, what it covers is as follows.
- CloudWatch Logs + operational Logs Insights queries
- CloudWatch Metrics — the core metrics and alarm thresholds for ECS / RDS / ALB
- the alarm → SNS → Slack flow
- X-Ray — pinpointing “where is it slow” with distributed tracing
- dashboards — system state on one screen
The big picture — the 4 components of monitoring #
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Metrics │ Logs │ Traces │ Events │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ "how much" │ "what" │ "where" │ "when" │
│ requests,5xx │ stacktrace │ DB 5s │ deploy,scale │
│ CPU, memory │ access log │ ext API 1s │ failover │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ CloudWatch │ CloudWatch │ X-Ray │ EventBridge │
│ Metrics │ Logs │ │ │
└──────────────┴──────────────┴──────────────┴──────────────┘This chapter is the three areas Metrics + Logs + Traces. Events was covered in Chapter 19 EventBridge / SQS / SNS.
1) CloudWatch Logs — already flowing #
Because Chapter 22’s Task Definition includes awslogs, all container stdout/stderr goes automatically to CloudWatch Logs.
Log Group: /ecs/blog-api
│
├── Log Stream: api/<task-id-1> ← one stream per Task
├── Log Stream: api/<task-id-2>
└── Log Stream: api/<task-id-3>Retention setting — cost separation #
The default is infinite retention. Even at small traffic, when a month’s logs pile up by the GB the cost grows.
aws logs put-retention-policy \
--log-group-name /ecs/blog-api \
--retention-in-days 30Recommended values are as follows.
- production access log: 30 ~ 90 days
- debug / verbose: 7 days
- audit log: 365 days (or export to S3 then delete)
Structured logs are key #
print() is hard to search. Emit JSON and Logs Insights can query it key by key.
import logging, json, sys
class JsonFormatter(logging.Formatter):
def format(self, record):
return json.dumps({
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
"ts": self.formatTime(record),
**getattr(record, "extra", {}),
})
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])@app.middleware("http")
async def access_log(request, call_next):
start = time.time()
response = await call_next(request)
logging.info("access", extra={"extra": {
"method": request.method,
"path": str(request.url.path),
"status": response.status_code,
"duration_ms": int((time.time() - start) * 1000),
"request_id": request.state.request_id,
}})
return responseEmitted this way, Logs Insights answers queries like the following exactly.
2) Logs Insights — 7 operational queries #
A collection of frequently used queries. Worth bookmarking.
A) Pulling out only 5xx #
fields @timestamp, status, path, request_id, message
| filter status >= 500
| sort @timestamp desc
| limit 100B) Response-time distribution (p50/p90/p99) #
fields @timestamp, duration_ms
| filter ispresent(duration_ms)
| stats
count(*) as requests,
pct(duration_ms, 50) as p50,
pct(duration_ms, 90) as p90,
pct(duration_ms, 99) as p99
by bin(5m)C) The slowest paths #
fields path, duration_ms
| filter duration_ms > 1000
| stats count(*), avg(duration_ms), max(duration_ms) by path
| sort avg(duration_ms) desc
| limit 20D) Tracing one request by request_id #
fields @timestamp, level, message, path, status, duration_ms
| filter request_id = "abc-123-xyz"
| sort @timestamp ascE) Lines with a stacktrace #
fields @timestamp, message
| filter @message like /Traceback|exception/
| sort @timestamp descF) Login attempts (auth failures) #
fields @timestamp, source_ip, username
| filter event = "auth_fail"
| stats count(*) by source_ip
| sort count(*) descG) Cost — which path is called the most #
fields path
| stats count(*) by path
| sort count(*) desc
| limit 30Saved Queries #
Save frequently used queries in the console to share across the whole team. You can codify them as IaC with CloudFormation / Terraform (aws_cloudwatch_query_definition).
3) CloudWatch Metrics — the core indicators #
ECS Container Insights #
Default ECS metrics are sparse. Turn on Container Insights and you see CPU / memory / network / disk / running task count per task / service all at once.
aws ecs update-cluster-settings \
--cluster blog-cluster \
--settings name=containerInsights,value=enabledThere’s an added cost (~$1 ~ 3/month for a small cluster), but it’s essential in production.
Monitoring table — what to watch #
| Metric | Resource | Meaning | Alarm threshold (example) |
|---|---|---|---|
HTTPCode_Target_5XX_Count | ALB | backend 5xx | 5-min sum ≥ 5 |
HTTPCode_ELB_5XX_Count | ALB | the ALB’s own 5xx (mostly 0 healthy hosts) | 5-min sum ≥ 1 |
TargetResponseTime (p99) | ALB | response time p99 | 5-min avg ≥ 1.0s |
UnHealthyHostCount | Target Group | count of dead tasks | 5-min avg ≥ 1 |
CPUUtilization (Service) | ECS | service average CPU | 5-min avg ≥ 80% |
MemoryUtilization (Service) | ECS | memory | 5-min avg ≥ 85% |
RunningTaskCount | ECS | running task count | differs from desired |
CPUUtilization | RDS | DB CPU | 5-min avg ≥ 80% |
DatabaseConnections | RDS | connection count | 80% of max_connections |
FreeStorageSpace | RDS | remaining disk | < 5GB |
ReadLatency / WriteLatency | RDS | disk latency | > 50ms |
Custom Metrics #
Metrics you emit directly from the app. Embed EMF (Embedded Metric Format) into the log and the metric is created with no separate call.
import json, time, logging
def emit_metric(metric_name, value, unit="Count", **dims):
payload = {
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "BlogApp",
"Dimensions": [list(dims.keys())],
"Metrics": [{"Name": metric_name, "Unit": unit}],
}],
},
metric_name: value,
**dims,
}
logging.info(json.dumps(payload))
emit_metric("PostCreated", 1, env="prod")
emit_metric("CommentCreated", 1, env="prod")
emit_metric("LoginFailed", 1, source_ip="...")CloudWatch parses the log and automatically creates the BlogApp/PostCreated metric. With no separate PutMetricData API call, it saves both cost and latency.
4) Alarms — call a human when the threshold is crossed #
aws cloudwatch put-metric-alarm \
--alarm-name "blog-alb-5xx-burst" \
--metric-name HTTPCode_Target_5XX_Count \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 60 \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--threshold 5 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--dimensions Name=LoadBalancer,Value=app/blog-alb/abc123 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:ops-alertsThe key options laid out:
| Option | Meaning |
|---|---|
period | data point unit (60 = 1 minute) |
evaluation-periods | how many points to evaluate |
datapoints-to-alarm | how many of those crossing the threshold trigger the alarm |
treat-missing-data | when there’s no data — notBreaching recommended |
comparison-operator | >= / > / < / <= |
The 5/3 pattern (“if 3 of the last 5 minutes cross the threshold”) is the standard that filters out transient spikes while catching real incidents.
Composite Alarm #
Bundles multiple alarms. “ALB 5xx alarm AND task running is normal” means it’s a real backend problem.
aws cloudwatch put-composite-alarm \
--alarm-name "blog-real-incident" \
--alarm-rule "ALARM('blog-alb-5xx-burst') AND OK('blog-running-tasks-low')"OK() means the case where it’s normally ok but one other alarm is in alarm, which reduces noise.
5) SNS → Slack — the part that reaches a human #
CloudWatch Alarm
│
▼
SNS Topic (ops-alerts)
│
├── Email subscription (operations team)
├── SMS subscription (oncall)
├── Lambda subscription ← converts to a Slack webhook
└── PagerDuty / OpsGenieSNS → Slack Lambda #
import json, os, urllib.request
WEBHOOK = os.environ["SLACK_WEBHOOK"]
def handler(event, context):
for record in event["Records"]:
msg = json.loads(record["Sns"]["Message"])
text = (
f":rotating_light: *{msg['AlarmName']}*\n"
f"Region: {msg['Region']}\n"
f"State: {msg['NewStateValue']} (was {msg['OldStateValue']})\n"
f"Reason: {msg['NewStateReason']}\n"
)
req = urllib.request.Request(
WEBHOOK,
data=json.dumps({"text": text}).encode(),
headers={"Content-Type": "application/json"},
)
urllib.request.urlopen(req)It’s the pattern from Chapter 17 Lambda basics. Set up an SNS subscription so SNS invokes the Lambda, and you’re done.
The alarm message format #
A good alarm message contains the following.
- what broke (alarm name)
- how much it broke (threshold / actual value)
- where (region, service)
- when (timestamp)
- a link — straight to the console / dashboard / Logs Insights
The link matters most. An oncall who sees Slack at 3 AM should be able to get into context with one click.
6) X-Ray — distributed tracing #
Metrics tell you up to “5xx is up.” “Why is 5xx up?” is answered by Logs. “Where did this request spend 5 seconds?” is answered by X-Ray.
Request: POST /posts 4.2s
│
├── ALB 0.01s
│
└── ECS api 4.15s
│
├── auth.verify_token 0.05s
│
├── db.posts.insert 3.80s ← the culprit
│ └── RDS PostgreSQL 3.78s
│ └── (slow query)
│
└── notify.publish (SNS) 0.30s
└── SNS:Publish 0.28sFastAPI/Django integration #
pip install aws-xray-sdkfrom aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.fastapi.middleware import XRayMiddleware
from aws_xray_sdk.ext.sqlalchemy.query import XRayQuery
xray_recorder.configure(service="blog-api")
app = FastAPI()
app.add_middleware(XRayMiddleware, recorder=xray_recorder)
# SQLAlchemy tracing
from aws_xray_sdk.ext.sqlalchemy_core import unpatch
# (auto-patched when the engine is created)Sidecar — X-Ray Daemon #
On ECS, place the X-Ray Daemon container as a sidecar inside the same task definition.
{
"containerDefinitions": [
{ "name": "api", ... },
{
"name": "xray-daemon",
"image": "public.ecr.aws/xray/aws-xray-daemon:latest",
"portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
"essential": false
}
]
}The app sends traces to 127.0.0.1:2000, and the daemon batches them off to the X-Ray service. A separate IAM action (xray:PutTraceSegments) is needed on the task role.
Where the value is greatest #
| Situation | X-Ray value |
|---|---|
| single container + single DB | moderate — Logs alone is enough |
| multiple microservice calls | very high — which step is slow |
| dependency on external APIs | very high — verify whether the external one is really slow |
| Lambda + DynamoDB | very high — separates Lambda cold start from external calls |
Sampling #
Tracing every request is costly. Use a sampling rule to trace only 5 ~ 10%.
{
"version": 2,
"rules": [{
"description": "Default",
"service_name": "*",
"http_method": "*",
"url_path": "*",
"fixed_target": 1,
"rate": 0.05
}],
"default": { "fixed_target": 1, "rate": 0.05 }
}You should exclude health checks like /health at 0% so the traces don’t fill up with noise.
7) Dashboard — one screen #
Gather the operational signals onto one screen in a CloudWatch Dashboard.
[1] Requests/s (ALB) [2] 5xx rate (ALB) [3] p99 latency
[4] ECS CPU (Service) [5] ECS Memory [6] Running tasks
[7] RDS CPU [8] RDS Connections [9] RDS FreeStorageresource "aws_cloudwatch_dashboard" "blog" {
dashboard_name = "blog-overview"
dashboard_body = jsonencode({
widgets = [
{
type = "metric",
x = 0, y = 0, width = 8, height = 6,
properties = {
metrics = [["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/blog-alb/abc123"]],
period = 60, stat = "Sum", region = "ap-northeast-2",
title = "Requests/min"
}
}
# ... 8 more
]
})
}Regular review #
Once a week, have the oncall scan the dashboard for gradually worsening indicators. Alarms only catch immediate incidents; for gradual deterioration, the human eye is faster.
Pitfalls — pitfalls of operational monitoring #
1) Alert Fatigue — too many alarms #
If there are 30 alarms a day, soon everyone ignores them. The recommended tiers are as follows.
| Alarm tier | Frequency | Channel |
|---|---|---|
| Critical | 1 ~ 2 times a month | PagerDuty / SMS |
| Warning | 1 ~ 2 times a week | Slack #ops |
| Info | often | Slack #ops-info (a quiet channel) |
Keep the alarms that truly wake a person to fewer than 5.
2) Logs grow infinitely #
Omit the retention setting and a bill shock comes six months later. Apply retention to every log group (all at once with Terraform).
3) Logs too small #
After an incident you go “let’s look at the logs from then,” but with 7-day retention they’re already gone. Exporting right after an incident is too late. Keep key groups at 30 days or more.
4) X-Ray 100% sampling #
Cost runs away. Keep it at 5 ~ 10% sampling + 100% for errors / slow requests only (possible with X-Ray’s sampling rule).
5) Alarms without SLOs #
Where the alarm’s threshold came from becomes — “I said 80%.” Without a stated SLO (e.g., p99 < 500ms for 99% of the time), the threshold becomes arbitrary. Derive alarm thresholds from the SLO definition.
6) Dashboard exists but isn’t looked at #
A dashboard you make and don’t look at is the same as not having one. Put a 30-minute dashboard review into the weekly oncall meeting.
7) Alarms don’t reach a human #
Use email only and it goes deep into the inbox. Use a self-summoning channel like SMS / PagerDuty / a Slack mention.
Exercises #
- Write out, without looking at §“The big picture,” what question each of monitoring’s 4 components (Metrics / Logs / Traces / Events) answers (“how much / what / where / when”). Also mark which three of those areas this chapter covers.
- Explain, on the basis of §“Alarms,” how the three values
period,evaluation-periods, anddatapoints-to-alarmof the ALB 5xx alarm make the5/3pattern, and write in one sentence why this pattern filters out transient spikes. - From the §“Where the value is greatest” table, pick the two situations where X-Ray’s value is highest, and explain, in connection with Chapter 27 cost optimization, why 100% sampling is dangerous.
In short: observability divides into Metrics (how much), Logs (what), Traces (where), and Events (when). Logs auto-collected via awslogs are emitted as structured JSON and queried with Logs Insights, and ECS/RDS/ALB metrics are seen with Container Insights. Alarms filter noise with the
period × evaluation × datapointspattern and reach a human via SNS → Lambda → Slack, while X-Ray pinpoints the slow step with distributed tracing but controls cost through sampling.
Next chapter #
Now we’ve reached the structure where the system runs well and an alarm sounds when an incident happens. Finally — how much is it costing, and how do you cut that cost? In the next Chapter 27 cost optimization and dashboards we cover Cost Explorer analysis, Savings Plans / Spot Fargate / Graviton, Right Sizing, tag enforcement, and a cost dashboard, and wrap up Part 4.