AWS Basics #7: CloudWatch Intro — Logs and Metrics
#1 through #6 gave us the AWS setup foundation. Now for the other axis of operations — knowing what’s running where, and what it’s doing.
CloudWatch is AWS’s observability standard. Almost every AWS service emits metrics into CloudWatch by default, and logs land in CloudWatch Logs. That’s where production sight begins.
This post covers CloudWatch’s four components — Logs / Metrics / Alarms / Dashboards — in one go.
Big picture — the four components of CloudWatch #
| Component | What it is | Common use |
|---|---|---|
| Logs | Store / search text logs | Logs from EC2 / Lambda / ECS / API Gateway |
| Metrics | Time-series numbers (CPU%, request counts, etc.) | Every AWS service auto-emits |
| Alarms | Alert / act when a metric crosses a threshold | Production alerts, autoscaling |
| Dashboards | Pages of graphs / widgets | At-a-glance per team / service |
The four interlock: logs → metrics → alarms → dashboards.
CloudWatch Logs #
Log group and log stream #
Log Group — usually one application / service
└── Log Stream — usually one process / container
└── Log Event — one line| Item | Example |
|---|---|
| Log Group | /aws/lambda/my-function, /ecs/my-service, /var/log/myapp |
| Log Stream | Lambda execution-environment ID, ECS Task ID, EC2 instance ID |
| Log Event | One line of text + timestamp |
Lambda / ECS Fargate sends logs to CloudWatch Logs automatically. EC2 needs the CloudWatch Agent or a sidecar agent (fluent-bit, etc.).
Retention — the most important setting #
Default retention is forever. Leave it alone and logs pile up indefinitely → cost runaway. Right after signup / for every new log group, set retention.
| Item | Recommended retention |
|---|---|
| General application logs | 30–90 days |
| Debug / development logs | 7 days |
| Security / audit logs (CloudTrail) | 1–7 years (cheaper to ship to S3) |
| Lambda logs | 14–30 days |
aws logs put-retention-policy \
--log-group-name /aws/lambda/my-function \
--retention-in-days 30aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text \
| tr '\t' '\n' \
| while read name; do
aws logs put-retention-policy --log-group-name "$name" --retention-in-days 30
doneThis one line cuts CloudWatch costs by more than half.
Auto-applying retention to new log groups #
Two approaches.
Approach 1: EventBridge + Lambda — react to CreateLogGroup events and apply automatically (common in production).
Approach 2: Force via policy — deny creation if retention isn’t set explicitly (slightly heavy-handed).
Lambda log shipping #
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def handler(event, context):
logger.info("Received event: %s", event)
return {"ok": True}stdout / stderr flows automatically into CloudWatch Logs. The log group is named /aws/lambda/<function-name>.
EC2 / ECS — CloudWatch Agent #
EC2 isn’t automatic. Install the CloudWatch Agent.
sudo yum install -y amazon-cloudwatch-agent # AL
# or
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.debConfigure at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json:
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/myapp/*.log",
"log_group_name": "/myapp/server",
"log_stream_name": "{instance_id}",
"retention_in_days": 30
}
]
}
}
},
"metrics": {
"metrics_collected": {
"mem": { "measurement": ["mem_used_percent"] },
"disk": { "measurement": ["used_percent"], "resources": ["*"] }
}
}
}ECS Fargate ships logs automatically when the container definition uses the awslogs driver — covered in detail in Advanced #1.
Logs Insights — query-based search #
The search / analysis tool for CloudWatch Logs. SQL-ish syntax of its own.
fields @timestamp, @message
| sort @timestamp desc
| limit 100fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration), max(@duration), count(*) by bin(5m)fields @timestamp, status, path
| filter status >= 500
| stats count(*) as errors by path
| sort errors descCommon commands:
| Command | What it is |
|---|---|
fields | Fields to show |
filter | Conditional filter |
parse | Extract fields from a string |
stats | Aggregate (count, avg, max, percentile) |
sort | Sort |
limit | Maximum results |
bin(5m) | Time bucket |
Logs Insights cost note #
Queries bill by GB scanned (~$0.005/GB). An unbounded time range over a big log group is a cost incident. Always tighten the time range.
CloudWatch Metrics #
A metric is a time-series number. Almost every AWS service emits them automatically.
Frequently watched metrics #
| Service | Frequently watched |
|---|---|
| EC2 | CPUUtilization, NetworkIn/Out, DiskReadOps |
| RDS | CPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency |
| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
| ECS | CPUUtilization, MemoryUtilization (per service / task) |
| ALB | RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count |
| API Gateway | Count, Latency, 4XXError, 5XXError |
| S3 | BucketSizeBytes, NumberOfObjects (once a day) |
| DynamoDB | ConsumedReadCapacity/WriteCapacity, ThrottledRequests |
Metric dimensions #
The same metric splits by dimension.
Service: AWS/EC2
Metric: CPUUtilization
Dimensions:
- InstanceId: i-1234567890
- InstanceId: i-2345678901
- InstanceId: i-3456789012Different dimensions are different metrics. Metric count = cost, so dimension explosion = cost explosion.
Statistic #
| Statistic | What it is |
|---|---|
Sum | Sum — Invocations, RequestCount |
Average | Mean — CPU, Latency |
Maximum | Max — spike detection |
Minimum | Min |
p95 / p99 | Percentiles — Latency |
SampleCount | Number of data points |
In most cases, Average + p95 is meaningful. p99 / p99.9 directly affect SLAs / user experience.
Standard vs high-resolution #
| Kind | Resolution | Cost |
|---|---|---|
| Standard | 1 minute | Standard |
| High-res | 1 second | Expensive — only short spikes |
Standard 1-minute is enough for most cases.
Sending custom metrics #
The application sends them directly.
import boto3
cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_data(
Namespace="MyApp",
MetricData=[{
"MetricName": "OrderCreated",
"Value": 1,
"Unit": "Count",
"Dimensions": [
{"Name": "Environment", "Value": "prod"},
{"Name": "Region", "Value": "ap-northeast-2"},
],
}],
)Cost: $0.30 per metric per month. Each combination of dimensions is a separate metric — never use high-cardinality dimensions like user ID.
EMF (Embedded Metric Format) — the Lambda pattern #
In Lambda, put_metric_data itself adds cost / latency. Write a special JSON shape into the log and CloudWatch auto-converts it into a metric.
import json
print(json.dumps({
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "MyApp",
"Dimensions": [["Environment"]],
"Metrics": [{"Name": "OrderCreated", "Unit": "Count"}],
}],
},
"Environment": "prod",
"OrderCreated": 1,
}))SDKs like aws-embedded-metrics-python make this nicer.
Metric Filter — make a metric from logs #
Turn information already in logs (ERROR occurrences, response times, etc.) into a metric.
CloudWatch → Log groups → pick a group → Metric filters → Create
- Filter pattern: ERROR
- Metric namespace: MyApp
- Metric name: ErrorCount
- Metric value: 1Each ERROR now bumps the metric by +1. Use it in alarms / dashboards.
ERROR # contains ERROR
[..., level="ERROR", ...] # field on a structured log
{ $.level = "ERROR" } # JSON log keyCloudWatch Alarms #
An action when a metric crosses a threshold. The home base of alerting.
First alarm — Lambda Errors #
CloudWatch → Alarms → Create alarm
- Metric: AWS/Lambda → Errors → Function: my-function
- Statistic: Sum
- Period: 1 minute
- Threshold: > 0 for 1 datapoint within 5 minutes
- Action: SNS → notification topic
- Name: lambda-my-function-errorsAlarm states #
| State | Meaning |
|---|---|
OK | Within threshold |
ALARM | Threshold breached — actions fire |
INSUFFICIENT_DATA | Not enough data — newly created / no metric data |
Whether to treat INSUFFICIENT_DATA as alarming is optional. It’s a state new alarms see often during evaluation, so you can usually ignore it.
Composite Alarms #
AND / OR combinations of multiple alarms. “ALB 5xx ≥ 1% AND CPU > 80%” kinds of patterns.
ALARM("alb-5xx") AND ALARM("ec2-high-cpu")Effective at reducing false positives.
Alarm actions #
| Action | What it is |
|---|---|
| SNS Topic | Fanout to email / Slack / SMS / Lambda, etc. |
| EC2 Action | Stop / terminate / reboot / recover an instance |
| Auto Scaling | Scale ASG in / out |
| Systems Manager | Create an OpsItem |
90% of production is SNS → Slack / email.
Anomaly Detection #
Auto-learn a baseline (band) → alarm when out of band. Effective on patterned metrics like traffic / CPU. Fewer false positives than static thresholds.
SNS integration — how alerts are routed #
Most alarms head to an SNS Topic and fan out from there.
| Subscription | Where to |
|---|---|
| HTTPS | Slack incoming webhook |
| Lambda | Transform and forward |
| SMS | Phone (rarely) |
| SQS | A queue |
Slack integration — the Lambda pattern #
Either send to the webhook directly or use AWS Chatbot.
import json, os, urllib.request
WEBHOOK = os.environ["SLACK_WEBHOOK"]
def handler(event, context):
msg = json.loads(event["Records"][0]["Sns"]["Message"])
payload = json.dumps({
"text": f"🚨 *{msg['AlarmName']}* — {msg['NewStateValue']}",
"blocks": [...]
}).encode()
req = urllib.request.Request(WEBHOOK, data=payload,
headers={"Content-Type": "application/json"})
urllib.request.urlopen(req)Deeper coverage in Advanced #4 API Gateway + Lambda and Advanced #5 EventBridge / SQS / SNS.
CloudWatch Dashboards #
Pages of widgets. At-a-glance per team / service. Definable as JSON / managed as code.
Common dashboards #
| Kind | What it is |
|---|---|
| Service dashboard | Core metrics for one service (requests, latency, errors, infra) |
| Infrastructure dashboard | EC2/RDS CPU / memory / network |
| Business dashboard | Signups / payments / orders, etc. |
| On-call dashboard | Active alarms / recent incidents / KPIs |
Widget kinds #
- Metric graphs (line, stacked, number)
- Logs (Logs Insights query results)
- Text (Markdown — dashboard guidance)
- Alarm status
A good dashboard = “thirty seconds on this page tells you the system’s state.”
Settings to turn on right after signup #
By the end of this series the following should be in place.
| Item | Where |
|---|---|
| Auto-apply retention to new log groups | Console / EventBridge + Lambda |
| Lambda Errors alarm | Per function |
| RDS FreeStorageSpace alarm | Per DB |
| ALB 5xx alarm | Per LB |
| Billing alarm (#3) | Per account |
| GuardDuty findings alarm (#6) | Per account |
These six are the alerting baseline for small operations.
Common pitfalls #
1) Forever-retention logs #
The most common cost incident. Set retention per log group + automate for new groups. One line that cuts cost by more than half.
2) High-cardinality custom metrics #
Dimensions: [{Name: 'UserId', Value: user_id}] — 10K users → 10K metric × dimension combinations. $3000+/month. Per-user data belongs in logs (Logs Insights).
3) Unbounded-time Logs Insights queries #
A big log group with no time range = GB-scale scan cost. Always tighten the time range.
4) Alarm Period too short #
1-minute, 1 datapoint alarms → an alert per transient spike. Usually 5 minutes / 3 datapoints balances noise.
5) Alarm without an action #
Create an alarm without SNS / action and it just turns red in the console — no one notices. Attach an action when you create it.
6) Dashboards built then ignored #
Right after building everyone looks; weeks in it’s forgotten. Embed in on-call / daily standup as the first stop.
7) Skipping Metric Math #
Metric Math computes ratios / sums / transformations across metrics. “5xx / total = error rate” kinds of things. Used well, dashboards / alarms get a lot cleaner.
Wrap-up #
What we covered:
- The four components of CloudWatch — Logs / Metrics / Alarms / Dashboards
- Logs — group → stream → event. Set retention right after signup. CloudWatch Agent for EC2 collection
- Logs Insights —
fields / filter / stats / sort / parse+bin(5m). Tight time range - Metrics — AWS services emit automatically. Dimensions / statistics (Avg, p95, p99). No high-cardinality dimensions
- Metric Filter — extract metrics from logs (ERROR, etc.)
- EMF — metrics through logs in Lambda (an alternative to put_metric_data)
- Alarms — threshold + Period + Datapoints. SNS / EC2 actions / ASG. Composite / Anomaly
- SNS — alarm fanout. Slack / email / Lambda
- Dashboards — service / infra / business / on-call
- Pitfalls — forever retention, high-cardinality dimensions, unbounded Insights queries, Period too short, no action, ignored dashboards
Next series — AWS Intermediate #
This wraps up the seven AWS Basics. Console / account / IAM / cost / CLI / SSO / security / CloudWatch — the toolbox for getting started on AWS is in one place.
Now to actually build resources. The seven posts of AWS Intermediate cover the core pieces of backend operations.
AWS Intermediate #1 EC2 and VPC basics threads EC2 (the virtual machine) together with the VPC it lives in — Subnet / Internet Gateway / Route Table / Security Group / NACL — onto a single line.