7 Chapter

CloudWatch Intro — Logs / Metrics

The structure of CloudWatch Logs / Metrics / Alarms / Dashboards, log groups and retention, Metric Filters, and the basics of Logs Insights queries — the observability tool that becomes the eye of all operations.

Chapter 2 through Chapter 6 gathered the foundation of AWS setup. Accounts, IAM, cost, CLI, SSO, and security formed the mental map and daily setup before entering the console. Now we cover the other axis of operations: the tool for seeing what is doing what, and where.

CloudWatch is AWS’s observability standard. Almost every service in AWS sends metrics to CloudWatch by default, and CloudWatch Logs receives the logs too. Operations’ first field of view starts here.

This chapter, the last of Part 1, sorts out CloudWatch’s four components — Logs / Metrics / Alarms / Dashboards — in one go. Deeper observability continues in Chapter 26 monitoring and X-Ray, together with distributed tracing.

The big picture — CloudWatch’s four components #

Component	What	Common use
Logs	Text log storage / search	Logs of EC2 / Lambda / ECS / API Gateway
Metrics	Time-series numbers (CPU%, request count, etc.)	Every AWS service sends automatically
Alarms	Notify / act when a metric crosses a threshold	Operational alerts, auto-scaling
Dashboards	A page of graphs / widgets	An at-a-glance view per team / service

These four are woven into one flow. Logs → Metrics → Alarms → Dashboards.

CloudWatch Logs #

Log groups and log streams #

Structure

Log Group           — usually per application / service
  └── Log Stream    — usually per process / container
        └── Log Event — a single entry

Item	Example
Log Group	`/aws/lambda/my-function`, `/ecs/my-service`, `/var/log/myapp`
Log Stream	Lambda execution-environment ID, ECS Task ID, EC2 instance ID
Log Event	One line of text + timestamp

Lambda and ECS Fargate automatically send logs to CloudWatch Logs. EC2 needs a CloudWatch Agent or an auxiliary tool (fluent-bit, etc.).

Retention — the most important setting #

The default retention is forever. Leave it as-is and logs pile up forever, exploding the cost. Right after sign-up, and for every new log group, you have to set retention.

Item	Recommended retention
General application logs	30 ~ 90 days
Debug / development logs	7 days
Security / audit logs (CloudTrail)	1 ~ 7 years (cheaper sent to S3)
Lambda logs	14 ~ 30 days

Set retention

aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30

Apply to all log groups in bulk

aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text \
  | tr '\t' '\n' \
  | while read name; do
      aws logs put-retention-policy --log-group-name "$name" --retention-in-days 30
    done

This setting prevents more than half of CloudWatch cost incidents (the log-flood incident in Chapter 3 cost management is resolved by this one setting).

Auto retention for new log groups #

There are two ways to apply retention automatically to newly created log groups. The first is with EventBridge + Lambda, applying it automatically on receiving a CreateLogGroup event, which is often used in practice. The second is enforcing it with a policy that denies creation if retention isn’t specified, which is slightly overkill.

Sending Lambda logs #

Lambda — just print

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info("Received event: %s", event)
    return {"ok": True}

stdout / stderr flow automatically to CloudWatch Logs. The log group name is /aws/lambda/<function-name>.

EC2 / ECS — CloudWatch Agent #

EC2 isn’t automatic. It needs a CloudWatch Agent install.

Amazon Linux 2023 / Ubuntu

sudo yum install -y amazon-cloudwatch-agent       # AL
# or
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

The config file is /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json.

Simple config

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "/myapp/server",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          }
        ]
      }
    }
  },
  "metrics": {
    "metrics_collected": {
      "mem":  { "measurement": ["mem_used_percent"] },
      "disk": { "measurement": ["used_percent"], "resources": ["*"] }
    }
  }
}

ECS Fargate is automatic if you write the awslogs driver in the container definition. This is covered in detail in Chapter 15 ECS and Fargate.

Logs Insights — search with queries #

The search / analysis tool of CloudWatch Logs. It uses its own SQL-like syntax.

The simplest query

fields @timestamp, @message
| sort @timestamp desc
| limit 100

ERROR only

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

Request-time distribution (Lambda)

fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration), max(@duration), count(*) by bin(5m)

API Gateway — 5xx ratio

fields @timestamp, status, path
| filter status >= 500
| stats count(*) as errors by path
| sort errors desc

Commonly used commands:

Command	What
`fields`	Fields to show
`filter`	Condition filter
`parse`	Extract fields from a string
`stats`	Aggregate (count, avg, max, percentile)
`sort`	Sort
`limit`	Max number of results
`bin(5m)`	Time bucket

Logs Insights cost caution #

You’re charged per GB scanned at query time (~$0.005/GB). Querying a large log group with an unlimited time range causes a cost incident. Always keep the time range narrow.

CloudWatch Metrics #

A metric is a time-series number. Almost every AWS service sends automatically.

Commonly viewed metrics #

Service	Commonly viewed metrics
EC2	`CPUUtilization`, `NetworkIn/Out`, `DiskReadOps`
RDS	`CPUUtilization`, `DatabaseConnections`, `FreeStorageSpace`, `ReadLatency`
Lambda	`Invocations`, `Errors`, `Duration`, `Throttles`, `ConcurrentExecutions`
ECS	`CPUUtilization`, `MemoryUtilization` (per Service / Task)
ALB	`RequestCount`, `TargetResponseTime`, `HTTPCode_Target_5XX_Count`
API Gateway	`Count`, `Latency`, `4XXError`, `5XXError`
S3	`BucketSizeBytes`, `NumberOfObjects` (once a day)
DynamoDB	`ConsumedReadCapacity/WriteCapacity`, `ThrottledRequests`

A metric’s dimensions #

The same metric is split by dimensions.

Example: CPUUtilization

Service: AWS/EC2
Metric:  CPUUtilization
Dimensions:
  - InstanceId: i-1234567890
  - InstanceId: i-2345678901
  - InstanceId: i-3456789012

Different dimensions are separate metrics. The number of metrics is the cost, so if dimensions explode you get a cost incident.

Statistic #

Statistic	What
`Sum`	Total — Invocations, RequestCount
`Average`	Mean — CPU, Latency
`Maximum`	Max — spike detection
`Minimum`	Min
`p95` / `p99`	Percentile — Latency
`SampleCount`	Number of data points

In most cases Average + p95 are meaningful. p99 / p99.9 directly affect the SLA and user experience.

Standard vs high-resolution #

Kind	Resolution	Cost
Standard	1 minute	Standard
High-resolution	1 second	Expensive — only for short spikes

Standard 1-minute is plenty in most cases.

Sending custom metrics #

The application sends directly.

Python — boto3

import boto3

cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_data(
    Namespace="MyApp",
    MetricData=[{
        "MetricName": "OrderCreated",
        "Value": 1,
        "Unit": "Count",
        "Dimensions": [
            {"Name": "Environment", "Value": "prod"},
            {"Name": "Region", "Value": "ap-northeast-2"},
        ],
    }],
)

The cost is $0.30 / month per metric, separate per combination of dimensions. Never use high-cardinality dimensions like a user ID.

EMF (Embedded Metric Format) — a pattern for Lambda #

In Lambda, the put_metric_data call itself is a cost and latency burden. Write a specific JSON format to the log and CloudWatch automatically converts it to a metric.

EMF format

import json
print(json.dumps({
    "_aws": {
        "Timestamp": int(time.time() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp",
            "Dimensions": [["Environment"]],
            "Metrics": [{"Name": "OrderCreated", "Unit": "Count"}],
        }],
    },
    "Environment": "prod",
    "OrderCreated": 1,
}))

An SDK like aws-embedded-metrics-python helps with this more cleanly.

Metric Filter — making metrics from logs #

Convert information already in logs (an ERROR occurred, response time, etc.) into a metric.

Making it in the console

CloudWatch → Log groups → select a group → Metric filters → Create
- Filter pattern: ERROR
- Metric namespace: MyApp
- Metric name: ErrorCount
- Metric value: 1

Now every time an ERROR occurs, the metric goes +1. You can use it in alarms or dashboards.

Filter pattern examples

ERROR                       # contains the word ERROR
[..., level="ERROR", ...]   # a field of a structured log
{ $.level = "ERROR" }       # a key of a JSON log

CloudWatch Alarms #

When a metric crosses a threshold, it acts. The home base of alerts.

A first alarm — Lambda Errors #

Console

CloudWatch → Alarms → Create alarm
- Metric: AWS/Lambda → Errors → Function: my-function
- Statistic: Sum
- Period: 1 minute
- Threshold: > 0 for 1 datapoint within 5 minutes
- Action: SNS → notification topic
- Name: lambda-my-function-errors

An alarm’s states #

State	Meaning
`OK`	Within threshold
`ALARM`	Threshold exceeded — action fires
`INSUFFICIENT_DATA`	Not enough data — newly created / metric not coming

Whether to treat INSUFFICIENT_DATA as an alarm (notify) is optional. It’s a state that often appears while evaluating a new alarm, so it’s generally ignored.

Composite Alarms #

Combine multiple alarms with AND / OR. A pattern like “ALB 5xx ≥ 1% AND CPU > 80%.”

Combination example

ALARM("alb-5xx") AND ALARM("ec2-high-cpu")

Effective at reducing false positives.

Alarm actions #

Action	What
SNS Topic	Fan out to email / Slack / SMS / Lambda, etc.
EC2 Action	Instance stop / terminate / reboot / recover
Auto Scaling	ASG scale in / out
Systems Manager	Create an OpsItem

90% of operations is SNS → Slack / email.

Anomaly Detection #

After automatically learning a baseline (band), it raises an alarm when outside it. Effective for metrics with patterns like traffic or CPU, with fewer false positives than a static threshold.

Most alarms go to an SNS Topic, and from there fan out to the next step.

Subscription	To where
Email	Email
HTTPS	Slack incoming webhook
Lambda	To another step after processing
SMS	Phone (rarely)
SQS	To a queue

Slack integration — the Lambda pattern #

Send directly with a webhook, or send with AWS Chatbot.

Lambda — SNS → Slack conversion

import json, os, urllib.request

WEBHOOK = os.environ["SLACK_WEBHOOK"]

def handler(event, context):
    msg = json.loads(event["Records"][0]["Sns"]["Message"])
    payload = json.dumps({
        "text": f"🚨 *{msg['AlarmName']}* — {msg['NewStateValue']}",
        "blocks": [...]
    }).encode()
    req = urllib.request.Request(WEBHOOK, data=payload,
                                  headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req)

Covered more deeply in Chapter 18 API Gateway and Lambda and Chapter 19 EventBridge / SQS / SNS.

CloudWatch Dashboards #

A page of widgets. Used for an at-a-glance view per team / service. You can define it in JSON and manage it as code.

Commonly made dashboards #

Kind	What
Service dashboard	A service’s core metrics (requests, latency, errors, infra)
Infra dashboard	CPU / memory / network of EC2/RDS
Business dashboard	Business metrics like sign-ups / payments / orders
On-call dashboard	Active alarms / recent incidents / key indicators

Metric graphs (line, stacked, number)
Logs (Logs Insights query results)
Text (Markdown — dashboard guidance)
Alarm status

A good dashboard is one where “looking at this one page for 30 seconds tells you the system’s state.”

By the time Part 1 ends, the following settings should be in place.

Item	Where
Auto-apply retention to new log groups	Console / EventBridge + Lambda
Lambda Errors alarm	Per function
RDS FreeStorageSpace alarm	Per DB
ALB 5xx alarm	Per LB
Billing alarm (Chapter 3 cost management)	Account-level
GuardDuty findings alarm (Chapter 6 security basics)	Account-level

These six are the alerting foundation of a small operation.

Common pitfalls #

Log retention forever — the most common cost incident. Specify retention per log group and automate new groups. This single setting saves more than half the cost.
High-cardinality custom metrics — set it like Dimensions: [{Name: 'UserId', Value: user_id}] and with 10,000 users you get 10,000 metrics × dimension combinations for $3000+ a month. For per-user, logs (Logs Insights) are the answer.
Unlimited-time Logs Insights queries — not setting a time range on a large log group incurs GB-scale scan cost. Always keep the time range narrow.
An alarm’s Period too short — set an alarm at 1 minute / 1 datapoint and it floods on every transient spike. Usually about 5 minutes / 3 datapoints reduces the noise appropriately.
Not attaching an alarm action — make an alarm and not attach SNS or an action and it just turns red in the console. Nobody knows. Attach the action when you make it.
Making a dashboard but not looking at it — everyone looks right after making it, but it’s forgotten over time. Settle it as the first check item of on-call or the daily standup.
Not using Metric Math — there’s Metric Math, which can compute ratios / sums / transforms of multiple metrics. Calculations like “5xx / total requests = error rate.” Used well, dashboards and alarms get much cleaner.

Exercises #

Read the bulk-apply script in §“Retention — the most important setting” as a dry-run in your environment, and explain in one paragraph how this setting prevents the “log flood” incident in Chapter 3 cost management.
From the commonly-viewed-metrics table in §“CloudWatch Metrics,” pick the Lambda and ALB rows and write, connecting to the §“Settings to turn on right after sign-up” table, which metric you should put an alarm on for each to catch the first operational incident.
Based on the §“High-cardinality custom metrics” pitfall, explain from a cost perspective why you should use Logs Insights instead of a custom metric when you want to track per-user order counts.

In short: CloudWatch is the eye of operations, with Logs · Metrics · Alarms · Dashboards forming one flow from logs → metrics → alarms → dashboards. If you do not specify log retention, cost can explode, so that setting belongs right after sign-up. For metrics, dimensions drive cost, so high-cardinality dimensions are forbidden, and an alarm is meaningful only if you attach an SNS action.

Next chapter #

With this, Part 1 Getting Started with AWS ends. Console / account / IAM / cost / CLI / SSO / security / CloudWatch — the toolbox you need to start something on AWS came together in one place. Now it’s time to make real resources. In Chapter 8 EC2 and VPC, the first chapter of Part 2, we sort out in one stroke the structure of the virtual machine EC2 and the virtual network VPC it lives in — Subnet / Internet Gateway / Route Table / Security Group / NACL.