CloudWatch Intro — Logs / Metrics
The structure of CloudWatch Logs / Metrics / Alarms / Dashboards, log groups and retention, Metric Filters, and the basics of Logs Insights queries — the observability tool that becomes the eye of all operations.
Chapter 2 through Chapter 6 gathered the foundation of AWS setup. Accounts, IAM, cost, CLI, SSO, and security formed the mental map and daily setup before entering the console. Now we cover the other axis of operations: the tool for seeing what is doing what, and where.
CloudWatch is AWS’s observability standard. Almost every service in AWS sends metrics to CloudWatch by default, and CloudWatch Logs receives the logs too. Operations’ first field of view starts here.
This chapter, the last of Part 1, sorts out CloudWatch’s four components — Logs / Metrics / Alarms / Dashboards — in one go. Deeper observability continues in Chapter 26 monitoring and X-Ray, together with distributed tracing.
The big picture — CloudWatch’s four components #
| Component | What | Common use |
|---|---|---|
| Logs | Text log storage / search | Logs of EC2 / Lambda / ECS / API Gateway |
| Metrics | Time-series numbers (CPU%, request count, etc.) | Every AWS service sends automatically |
| Alarms | Notify / act when a metric crosses a threshold | Operational alerts, auto-scaling |
| Dashboards | A page of graphs / widgets | An at-a-glance view per team / service |
These four are woven into one flow. Logs → Metrics → Alarms → Dashboards.
CloudWatch Logs #
Log groups and log streams #
Log Group — usually per application / service
└── Log Stream — usually per process / container
└── Log Event — a single entry| Item | Example |
|---|---|
| Log Group | /aws/lambda/my-function, /ecs/my-service, /var/log/myapp |
| Log Stream | Lambda execution-environment ID, ECS Task ID, EC2 instance ID |
| Log Event | One line of text + timestamp |
Lambda and ECS Fargate automatically send logs to CloudWatch Logs. EC2 needs a CloudWatch Agent or an auxiliary tool (fluent-bit, etc.).
Retention — the most important setting #
The default retention is forever. Leave it as-is and logs pile up forever, exploding the cost. Right after sign-up, and for every new log group, you have to set retention.
| Item | Recommended retention |
|---|---|
| General application logs | 30 ~ 90 days |
| Debug / development logs | 7 days |
| Security / audit logs (CloudTrail) | 1 ~ 7 years (cheaper sent to S3) |
| Lambda logs | 14 ~ 30 days |
aws logs put-retention-policy \
--log-group-name /aws/lambda/my-function \
--retention-in-days 30aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text \
| tr '\t' '\n' \
| while read name; do
aws logs put-retention-policy --log-group-name "$name" --retention-in-days 30
doneThis setting prevents more than half of CloudWatch cost incidents (the log-flood incident in Chapter 3 cost management is resolved by this one setting).
Auto retention for new log groups #
There are two ways to apply retention automatically to newly created log groups. The first is with EventBridge + Lambda, applying it automatically on receiving a CreateLogGroup event, which is often used in practice. The second is enforcing it with a policy that denies creation if retention isn’t specified, which is slightly overkill.
Sending Lambda logs #
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def handler(event, context):
logger.info("Received event: %s", event)
return {"ok": True}stdout / stderr flow automatically to CloudWatch Logs. The log group name is /aws/lambda/<function-name>.
EC2 / ECS — CloudWatch Agent #
EC2 isn’t automatic. It needs a CloudWatch Agent install.
sudo yum install -y amazon-cloudwatch-agent # AL
# or
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.debThe config file is /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json.
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/myapp/*.log",
"log_group_name": "/myapp/server",
"log_stream_name": "{instance_id}",
"retention_in_days": 30
}
]
}
}
},
"metrics": {
"metrics_collected": {
"mem": { "measurement": ["mem_used_percent"] },
"disk": { "measurement": ["used_percent"], "resources": ["*"] }
}
}
}ECS Fargate is automatic if you write the awslogs driver in the container definition. This is covered in detail in Chapter 15 ECS and Fargate.
Logs Insights — search with queries #
The search / analysis tool of CloudWatch Logs. It uses its own SQL-like syntax.
fields @timestamp, @message
| sort @timestamp desc
| limit 100fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration), max(@duration), count(*) by bin(5m)fields @timestamp, status, path
| filter status >= 500
| stats count(*) as errors by path
| sort errors descCommonly used commands:
| Command | What |
|---|---|
fields | Fields to show |
filter | Condition filter |
parse | Extract fields from a string |
stats | Aggregate (count, avg, max, percentile) |
sort | Sort |
limit | Max number of results |
bin(5m) | Time bucket |
Logs Insights cost caution #
You’re charged per GB scanned at query time (~$0.005/GB). Querying a large log group with an unlimited time range causes a cost incident. Always keep the time range narrow.
CloudWatch Metrics #
A metric is a time-series number. Almost every AWS service sends automatically.
Commonly viewed metrics #
| Service | Commonly viewed metrics |
|---|---|
| EC2 | CPUUtilization, NetworkIn/Out, DiskReadOps |
| RDS | CPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency |
| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
| ECS | CPUUtilization, MemoryUtilization (per Service / Task) |
| ALB | RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count |
| API Gateway | Count, Latency, 4XXError, 5XXError |
| S3 | BucketSizeBytes, NumberOfObjects (once a day) |
| DynamoDB | ConsumedReadCapacity/WriteCapacity, ThrottledRequests |
A metric’s dimensions #
The same metric is split by dimensions.
Service: AWS/EC2
Metric: CPUUtilization
Dimensions:
- InstanceId: i-1234567890
- InstanceId: i-2345678901
- InstanceId: i-3456789012Different dimensions are separate metrics. The number of metrics is the cost, so if dimensions explode you get a cost incident.
Statistic #
| Statistic | What |
|---|---|
Sum | Total — Invocations, RequestCount |
Average | Mean — CPU, Latency |
Maximum | Max — spike detection |
Minimum | Min |
p95 / p99 | Percentile — Latency |
SampleCount | Number of data points |
In most cases Average + p95 are meaningful. p99 / p99.9 directly affect the SLA and user experience.
Standard vs high-resolution #
| Kind | Resolution | Cost |
|---|---|---|
| Standard | 1 minute | Standard |
| High-resolution | 1 second | Expensive — only for short spikes |
Standard 1-minute is plenty in most cases.
Sending custom metrics #
The application sends directly.
import boto3
cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_data(
Namespace="MyApp",
MetricData=[{
"MetricName": "OrderCreated",
"Value": 1,
"Unit": "Count",
"Dimensions": [
{"Name": "Environment", "Value": "prod"},
{"Name": "Region", "Value": "ap-northeast-2"},
],
}],
)The cost is $0.30 / month per metric, separate per combination of dimensions. Never use high-cardinality dimensions like a user ID.
EMF (Embedded Metric Format) — a pattern for Lambda #
In Lambda, the put_metric_data call itself is a cost and latency burden. Write a specific JSON format to the log and CloudWatch automatically converts it to a metric.
import json
print(json.dumps({
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "MyApp",
"Dimensions": [["Environment"]],
"Metrics": [{"Name": "OrderCreated", "Unit": "Count"}],
}],
},
"Environment": "prod",
"OrderCreated": 1,
}))An SDK like aws-embedded-metrics-python helps with this more cleanly.
Metric Filter — making metrics from logs #
Convert information already in logs (an ERROR occurred, response time, etc.) into a metric.
CloudWatch → Log groups → select a group → Metric filters → Create
- Filter pattern: ERROR
- Metric namespace: MyApp
- Metric name: ErrorCount
- Metric value: 1Now every time an ERROR occurs, the metric goes +1. You can use it in alarms or dashboards.
ERROR # contains the word ERROR
[..., level="ERROR", ...] # a field of a structured log
{ $.level = "ERROR" } # a key of a JSON logCloudWatch Alarms #
When a metric crosses a threshold, it acts. The home base of alerts.
A first alarm — Lambda Errors #
CloudWatch → Alarms → Create alarm
- Metric: AWS/Lambda → Errors → Function: my-function
- Statistic: Sum
- Period: 1 minute
- Threshold: > 0 for 1 datapoint within 5 minutes
- Action: SNS → notification topic
- Name: lambda-my-function-errorsAn alarm’s states #
| State | Meaning |
|---|---|
OK | Within threshold |
ALARM | Threshold exceeded — action fires |
INSUFFICIENT_DATA | Not enough data — newly created / metric not coming |
Whether to treat INSUFFICIENT_DATA as an alarm (notify) is optional. It’s a state that often appears while evaluating a new alarm, so it’s generally ignored.
Composite Alarms #
Combine multiple alarms with AND / OR. A pattern like “ALB 5xx ≥ 1% AND CPU > 80%.”
ALARM("alb-5xx") AND ALARM("ec2-high-cpu")Effective at reducing false positives.
Alarm actions #
| Action | What |
|---|---|
| SNS Topic | Fan out to email / Slack / SMS / Lambda, etc. |
| EC2 Action | Instance stop / terminate / reboot / recover |
| Auto Scaling | ASG scale in / out |
| Systems Manager | Create an OpsItem |
90% of operations is SNS → Slack / email.
Anomaly Detection #
After automatically learning a baseline (band), it raises an alarm when outside it. Effective for metrics with patterns like traffic or CPU, with fewer false positives than a static threshold.
SNS integration — how to send alerts #
Most alarms go to an SNS Topic, and from there fan out to the next step.
| Subscription | To where |
|---|---|
| HTTPS | Slack incoming webhook |
| Lambda | To another step after processing |
| SMS | Phone (rarely) |
| SQS | To a queue |
Slack integration — the Lambda pattern #
Send directly with a webhook, or send with AWS Chatbot.
import json, os, urllib.request
WEBHOOK = os.environ["SLACK_WEBHOOK"]
def handler(event, context):
msg = json.loads(event["Records"][0]["Sns"]["Message"])
payload = json.dumps({
"text": f"🚨 *{msg['AlarmName']}* — {msg['NewStateValue']}",
"blocks": [...]
}).encode()
req = urllib.request.Request(WEBHOOK, data=payload,
headers={"Content-Type": "application/json"})
urllib.request.urlopen(req)Covered more deeply in Chapter 18 API Gateway and Lambda and Chapter 19 EventBridge / SQS / SNS.
CloudWatch Dashboards #
A page of widgets. Used for an at-a-glance view per team / service. You can define it in JSON and manage it as code.
Commonly made dashboards #
| Kind | What |
|---|---|
| Service dashboard | A service’s core metrics (requests, latency, errors, infra) |
| Infra dashboard | CPU / memory / network of EC2/RDS |
| Business dashboard | Business metrics like sign-ups / payments / orders |
| On-call dashboard | Active alarms / recent incidents / key indicators |
Widget kinds #
- Metric graphs (line, stacked, number)
- Logs (Logs Insights query results)
- Text (Markdown — dashboard guidance)
- Alarm status
A good dashboard is one where “looking at this one page for 30 seconds tells you the system’s state.”
Settings to turn on right after sign-up #
By the time Part 1 ends, the following settings should be in place.
| Item | Where |
|---|---|
| Auto-apply retention to new log groups | Console / EventBridge + Lambda |
| Lambda Errors alarm | Per function |
| RDS FreeStorageSpace alarm | Per DB |
| ALB 5xx alarm | Per LB |
| Billing alarm (Chapter 3 cost management) | Account-level |
| GuardDuty findings alarm (Chapter 6 security basics) | Account-level |
These six are the alerting foundation of a small operation.
Common pitfalls #
- Log retention forever — the most common cost incident. Specify retention per log group and automate new groups. This single setting saves more than half the cost.
- High-cardinality custom metrics — set it like
Dimensions: [{Name: 'UserId', Value: user_id}]and with 10,000 users you get 10,000 metrics × dimension combinations for $3000+ a month. For per-user, logs (Logs Insights) are the answer. - Unlimited-time Logs Insights queries — not setting a time range on a large log group incurs GB-scale scan cost. Always keep the time range narrow.
- An alarm’s Period too short — set an alarm at 1 minute / 1 datapoint and it floods on every transient spike. Usually about 5 minutes / 3 datapoints reduces the noise appropriately.
- Not attaching an alarm action — make an alarm and not attach SNS or an action and it just turns red in the console. Nobody knows. Attach the action when you make it.
- Making a dashboard but not looking at it — everyone looks right after making it, but it’s forgotten over time. Settle it as the first check item of on-call or the daily standup.
- Not using Metric Math — there’s Metric Math, which can compute ratios / sums / transforms of multiple metrics. Calculations like “5xx / total requests = error rate.” Used well, dashboards and alarms get much cleaner.
Exercises #
- Read the bulk-apply script in §“Retention — the most important setting” as a dry-run in your environment, and explain in one paragraph how this setting prevents the “log flood” incident in Chapter 3 cost management.
- From the commonly-viewed-metrics table in §“CloudWatch Metrics,” pick the Lambda and ALB rows and write, connecting to the §“Settings to turn on right after sign-up” table, which metric you should put an alarm on for each to catch the first operational incident.
- Based on the §“High-cardinality custom metrics” pitfall, explain from a cost perspective why you should use Logs Insights instead of a custom metric when you want to track per-user order counts.
In short: CloudWatch is the eye of operations, with Logs · Metrics · Alarms · Dashboards forming one flow from logs → metrics → alarms → dashboards. If you do not specify log retention, cost can explode, so that setting belongs right after sign-up. For metrics, dimensions drive cost, so high-cardinality dimensions are forbidden, and an alarm is meaningful only if you attach an SNS action.
Next chapter #
With this, Part 1 Getting Started with AWS ends. Console / account / IAM / cost / CLI / SSO / security / CloudWatch — the toolbox you need to start something on AWS came together in one place. Now it’s time to make real resources. In Chapter 8 EC2 and VPC, the first chapter of Part 2, we sort out in one stroke the structure of the virtual machine EC2 and the virtual network VPC it lives in — Subnet / Internet Gateway / Route Table / Security Group / NACL.