AWS in Practice #5: Monitoring — CloudWatch Alarms and X-Ray
In #1 through #4, infrastructure became code and deployment became automated. But we still have no single place to see whether this system is running well — whether 5xx errors are climbing, RDS CPU is at 80%, or which request took 5 seconds.
This post turns on that eye.
- CloudWatch Logs + Logs Insights operational queries
- CloudWatch Metrics — ECS / RDS / ALB core metrics and alarm thresholds
- The flow of alarm → SNS → Slack
- X-Ray — distributed tracing for “where is it slow” in one line
- Dashboard — system state in one screen
The big picture — the 4 pillars of monitoring #
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Metrics │ Logs │ Traces │ Events │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ "How much" │ "What" │ "Where" │ "When" │
│ requests, 5xx│ stacktrace │ DB took 5s │ deploys, scale│
│ CPU, memory │ access log │ external 1s │ failover │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ CloudWatch │ CloudWatch │ X-Ray │ EventBridge │
│ Metrics │ Logs │ │ │
└──────────────┴──────────────┴──────────────┴──────────────┘This post: Metrics + Logs + Traces — three pillars. Events were already in Advanced #5.
1) CloudWatch Logs — already flowing #
#1’s Task Definition has awslogs baked in, so all container stdout/stderr automatically goes to CloudWatch Logs.
Log Group: /ecs/blog-api
│
├── Log Stream: api/<task-id-1> ← one stream per task
├── Log Stream: api/<task-id-2>
└── Log Stream: api/<task-id-3>Retention setup — cost discipline #
The default is infinite retention. Even modest traffic can accumulate GBs of logs per month, causing costs to balloon.
aws logs put-retention-policy \
--log-group-name /ecs/blog-api \
--retention-in-days 30Recommendations:
- Production access logs: 30–90 days
- Debug / verbose: 7 days
- Audit logs: 365 days (or export to S3 then delete)
Structured logs are key #
Plain print() output is hard to search. JSON format lets Logs Insights query on individual keys.
import logging, json, sys
class JsonFormatter(logging.Formatter):
def format(self, record):
return json.dumps({
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
"ts": self.formatTime(record),
**getattr(record, "extra", {}),
})
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])@app.middleware("http")
async def access_log(request, call_next):
start = time.time()
response = await call_next(request)
logging.info("access", extra={"extra": {
"method": request.method,
"path": str(request.url.path),
"status": response.status_code,
"duration_ms": int((time.time() - start) * 1000),
"request_id": request.state.request_id,
}})
return responseLogging in this format lets Logs Insights answer queries like those below precisely.
2) Logs Insights — 7 operational queries #
Frequently used queries — bookmark them.
A) Filter just 5xx #
fields @timestamp, status, path, request_id, message
| filter status >= 500
| sort @timestamp desc
| limit 100B) Response time distribution (p50/p90/p99) #
fields @timestamp, duration_ms
| filter ispresent(duration_ms)
| stats
count(*) as requests,
pct(duration_ms, 50) as p50,
pct(duration_ms, 90) as p90,
pct(duration_ms, 99) as p99
by bin(5m)C) Slowest paths #
fields path, duration_ms
| filter duration_ms > 1000
| stats count(*), avg(duration_ms), max(duration_ms) by path
| sort avg(duration_ms) desc
| limit 20D) Trace one request by request_id #
fields @timestamp, level, message, path, status, duration_ms
| filter request_id = "abc-123-xyz"
| sort @timestamp ascE) Lines with stacktrace #
fields @timestamp, message
| filter @message like /Traceback|exception/
| sort @timestamp descF) Login attempts (auth failures) #
fields @timestamp, source_ip, username
| filter event = "auth_fail"
| stats count(*) by source_ip
| sort count(*) descG) Cost — which path gets called the most #
fields path
| stats count(*) by path
| sort count(*) desc
| limit 30Saved Queries #
Save frequently used queries in the console — share across the team. Can be IaC’d via CloudFormation / Terraform (aws_cloudwatch_query_definition).
3) CloudWatch Metrics — the core #
ECS Container Insights #
Default ECS metrics are sparse. Enable Container Insights to get task- and service-level CPU, memory, network, disk, and running task counts all at once.
aws ecs update-cluster-settings \
--cluster blog-cluster \
--settings name=containerInsights,value=enabledThere’s an extra cost (~$1–3/month for small clusters), but it’s essential for production.
Monitoring table — what to watch #
| Metric | Resource | Meaning | Alarm threshold (example) |
|---|---|---|---|
HTTPCode_Target_5XX_Count | ALB | Backend 5xx | 5-min sum ≥ 5 |
HTTPCode_ELB_5XX_Count | ALB | ALB self 5xx (mostly 0 healthy hosts) | 5-min sum ≥ 1 |
TargetResponseTime (p99) | ALB | Response time p99 | 5-min average ≥ 1.0s |
UnHealthyHostCount | Target Group | Dead task count | 5-min average ≥ 1 |
CPUUtilization (Service) | ECS | Service average CPU | 5-min average ≥ 80% |
MemoryUtilization (Service) | ECS | Memory | 5-min average ≥ 85% |
RunningTaskCount | ECS | Running task count | Different from desired |
CPUUtilization | RDS | DB CPU | 5-min average ≥ 80% |
DatabaseConnections | RDS | Connection count | 80% of max_connections |
FreeStorageSpace | RDS | Free disk | < 5GB |
ReadLatency / WriteLatency | RDS | Disk latency | > 50ms |
Custom Metrics #
Metrics emitted from the app. Embedded Metric Format (EMF) in logs — without separate API calls.
import json, time, logging
def emit_metric(metric_name, value, unit="Count", **dims):
payload = {
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "BlogApp",
"Dimensions": [list(dims.keys())],
"Metrics": [{"Name": metric_name, "Unit": unit}],
}],
},
metric_name: value,
**dims,
}
logging.info(json.dumps(payload))
emit_metric("PostCreated", 1, env="prod")
emit_metric("CommentCreated", 1, env="prod")
emit_metric("LoginFailed", 1, source_ip="...")CloudWatch parses the logs and automatically creates the BlogApp/PostCreated metric. No PutMetricData API call required — saves both cost and latency.
4) Alarms — calling people when thresholds break #
aws cloudwatch put-metric-alarm \
--alarm-name "blog-alb-5xx-burst" \
--metric-name HTTPCode_Target_5XX_Count \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 60 \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--threshold 5 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--dimensions Name=LoadBalancer,Value=app/blog-alb/abc123 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:ops-alertsKey options:
| Option | Meaning |
|---|---|
period | Data point unit (60 = 1 minute) |
evaluation-periods | How many points to evaluate |
datapoints-to-alarm | How many of those crossing fires the alarm |
treat-missing-data | When data is missing — notBreaching recommended |
comparison-operator | >= / > / < / <= |
The 5/3 pattern (“3 out of the last 5 data points cross the threshold”) filters out momentary spikes while still catching real incidents.
Composite Alarm #
Combine multiple alarms. “ALB 5xx alarm AND task running fine” → real backend problem.
aws cloudwatch put-composite-alarm \
--alarm-name "blog-real-incident" \
--alarm-rule "ALARM('blog-alb-5xx-burst') AND OK('blog-running-tasks-low')"OK() matches the normal (non-alarm) state — when one alarm is firing but the other is OK, the composite alarm stays quiet, reducing noise.
5) SNS → Slack — reaching humans #
CloudWatch Alarm
│
▼
SNS Topic (ops-alerts)
│
├── Email subscription (ops team)
├── SMS subscription (oncall)
├── Lambda subscription ← convert to Slack webhook
└── PagerDuty / OpsGenieSNS → Slack Lambda #
import json, os, urllib.request
WEBHOOK = os.environ["SLACK_WEBHOOK"]
def handler(event, context):
for record in event["Records"]:
msg = json.loads(record["Sns"]["Message"])
text = (
f":rotating_light: *{msg['AlarmName']}*\n"
f"Region: {msg['Region']}\n"
f"State: {msg['NewStateValue']} (was {msg['OldStateValue']})\n"
f"Reason: {msg['NewStateReason']}\n"
)
req = urllib.request.Request(
WEBHOOK,
data=json.dumps({"text": text}).encode(),
headers={"Content-Type": "application/json"},
)
urllib.request.urlopen(req)Pattern from Advanced #3 Lambda. Add an SNS subscription that calls Lambda, and you’re done.
Alarm message format #
A good alarm message has:
- What broke (alarm name)
- How much broke (threshold / actual value)
- Where (region, service)
- When (timestamp)
- Links — direct to console / dashboard / Logs Insights
Links matter most. At 3am, the on-call engineer reading Slack clicks once and lands directly in context.
6) X-Ray — distributed tracing #
“5xx is up” — Metrics tells. “Why is 5xx up?” — Logs do. “Where did this request spend 5 seconds?” — X-Ray answers.
Request: POST /posts 4.2s
│
├── ALB 0.01s
│
└── ECS api 4.15s
│
├── auth.verify_token 0.05s
│
├── db.posts.insert 3.80s ← suspect
│ └── RDS PostgreSQL 3.78s
│ └── (slow query)
│
└── notify.publish (SNS) 0.30s
└── SNS:Publish 0.28sFastAPI/Django integration #
pip install aws-xray-sdkfrom aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.fastapi.middleware import XRayMiddleware
from aws_xray_sdk.ext.sqlalchemy.query import XRayQuery
xray_recorder.configure(service="blog-api")
app = FastAPI()
app.add_middleware(XRayMiddleware, recorder=xray_recorder)
# SQLAlchemy tracing
from aws_xray_sdk.ext.sqlalchemy_core import unpatch
# (auto-patch on engine creation)Sidecar — X-Ray Daemon #
In ECS, the X-Ray Daemon container runs as a sidecar in the same task definition:
{
"containerDefinitions": [
{ "name": "api", ... },
{
"name": "xray-daemon",
"image": "public.ecr.aws/xray/aws-xray-daemon:latest",
"portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
"essential": false
}
]
}App sends traces to 127.0.0.1:2000, daemon batches them to X-Ray service. Task role needs xray:PutTraceSegments.
Where it shines most #
| Case | X-Ray value |
|---|---|
| Single container + single DB | Moderate — Logs alone may suffice |
| Multiple microservice calls | Very big — see which step is slow in one line |
| External API dependency | Very big — verify external is actually slow |
| Lambda + DynamoDB | Very big — separate Lambda cold start, external calls |
Sampling #
Tracing every request is expensive. Use sampling rules for 5–10% only:
{
"version": 2,
"rules": [{
"description": "Default",
"service_name": "*",
"http_method": "*",
"url_path": "*",
"fixed_target": 1,
"rate": 0.05
}],
"default": { "fixed_target": 1, "rate": 0.05 }
}Set health checks like /health to 0% so traces don’t drown in noise.
7) Dashboard — one screen #
Put operational signals in one CloudWatch Dashboard:
[1] Requests/s (ALB) [2] 5xx rate (ALB) [3] p99 latency
[4] ECS CPU (Service) [5] ECS Memory [6] Running tasks
[7] RDS CPU [8] RDS Connections [9] RDS FreeStorageresource "aws_cloudwatch_dashboard" "blog" {
dashboard_name = "blog-overview"
dashboard_body = jsonencode({
widgets = [
{
type = "metric",
x = 0, y = 0, width = 8, height = 6,
properties = {
metrics = [["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/blog-alb/abc123"]],
period = 60, stat = "Sum", region = "ap-northeast-2",
title = "Requests/min"
}
}
# ... 8 more
]
})
}Periodic review #
Once a week, the on-call engineer scans the dashboard to spot gradually worsening areas. Alarms catch only acute incidents; slow degradation is caught faster by human eyes.
Pitfalls — production monitoring #
1) Alert Fatigue — too many alarms #
30 alarms / day → soon everyone ignores alarms. Recommendation:
| Alarm tier | Frequency | Channel |
|---|---|---|
| Critical | 1–2/month | PagerDuty / SMS |
| Warning | 1–2/week | Slack #ops |
| Info | Frequent | Slack #ops-info (quiet channel) |
Keep fewer than 5 alarms that actually wake people up.
2) Logs growing infinitely #
Missing retention setup leads to bill shock 6 months later. Set retention on every log group — easily done all at once with Terraform.
3) Logs too small #
Right after an incident, “let’s check the logs” → 7-day retention has already expired. Trying to export after the fact is too late. Core log groups deserve 30+ days of retention.
4) X-Ray 100% sampling #
Cost explodes. 5–10% sampling + 100% on errors / slow requests (X-Ray sampling rules).
5) Alarms without SLOs #
Where did the alarm threshold come from — “I picked 80%”? Without an SLO (e.g., p99 latency < 500ms, 99% of the time), thresholds are arbitrary. Define your SLO first, then derive the threshold from it.
6) Dashboard exists, never viewed #
A dashboard built and never seen is the same as none. Add 30-min dashboard review to weekly oncall meetings.
7) Alarms don’t reach people #
Email only → buried in the inbox. Use SMS / PagerDuty / Slack mentions — channels that actually reach you.
Wrapping up #
What we covered in this post:
- 4 pillars — Metrics / Logs / Traces / Events
- CloudWatch Logs — automatic awslogs, retention setup, structured JSON, 7 operational queries
- CloudWatch Metrics — Container Insights enabled, ECS / RDS / ALB core metrics and thresholds
- EMF — emit metrics through logs without PutMetricData
- Alarms —
period × evaluation × datapointspattern,treat-missing-data - Composite Alarm — reduce noise
- SNS → Lambda → Slack — reach humans with alarms
- X-Ray — distributed tracing, sidecar daemon, sampling for cost control
- Dashboard — 9-widget single screen, IaC-ified
- Pitfalls — alert fatigue, retention, sampling, missing SLO, ignored dashboard, alarm channels
Next — Cost and track wrap-up #
Now the system runs well, and alarms fire on incidents. One last topic remains — how much is it costing, and a retrospective of 27 posts of track.
In #6 Cost optimization and dashboards — wrapping up the track we’ll cover Cost Explorer analysis, Savings Plans / Spot Fargate, Right Sizing, tag enforcement, cost dashboards, and how the 27 posts of the AWS track come together as one system.