Contents
7 Chapter

CloudWatch Intro — Logs / Metrics

The structure of CloudWatch Logs / Metrics / Alarms / Dashboards, log groups and retention, Metric Filters, and the basics of Logs Insights queries — the observability tool that becomes the eye of all operations.

Chapter 2 through Chapter 6 gathered the foundation of AWS setup. Accounts, IAM, cost, CLI, SSO, and security formed the mental map and daily setup before entering the console. Now we cover the other axis of operations: the tool for seeing what is doing what, and where.

CloudWatch is AWS’s observability standard. Almost every service in AWS sends metrics to CloudWatch by default, and CloudWatch Logs receives the logs too. Operations’ first field of view starts here.

This chapter, the last of Part 1, sorts out CloudWatch’s four components — Logs / Metrics / Alarms / Dashboards — in one go. Deeper observability continues in Chapter 26 monitoring and X-Ray, together with distributed tracing.

The big picture — CloudWatch’s four components #

ComponentWhatCommon use
LogsText log storage / searchLogs of EC2 / Lambda / ECS / API Gateway
MetricsTime-series numbers (CPU%, request count, etc.)Every AWS service sends automatically
AlarmsNotify / act when a metric crosses a thresholdOperational alerts, auto-scaling
DashboardsA page of graphs / widgetsAn at-a-glance view per team / service

These four are woven into one flow. Logs → Metrics → Alarms → Dashboards.

CloudWatch Logs #

Log groups and log streams #

Structure
Log Group           — usually per application / service
  └── Log Stream    — usually per process / container
        └── Log Event — a single entry
ItemExample
Log Group/aws/lambda/my-function, /ecs/my-service, /var/log/myapp
Log StreamLambda execution-environment ID, ECS Task ID, EC2 instance ID
Log EventOne line of text + timestamp

Lambda and ECS Fargate automatically send logs to CloudWatch Logs. EC2 needs a CloudWatch Agent or an auxiliary tool (fluent-bit, etc.).

Retention — the most important setting #

The default retention is forever. Leave it as-is and logs pile up forever, exploding the cost. Right after sign-up, and for every new log group, you have to set retention.

ItemRecommended retention
General application logs30 ~ 90 days
Debug / development logs7 days
Security / audit logs (CloudTrail)1 ~ 7 years (cheaper sent to S3)
Lambda logs14 ~ 30 days
Set retention
aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30
Apply to all log groups in bulk
aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text \
  | tr '\t' '\n' \
  | while read name; do
      aws logs put-retention-policy --log-group-name "$name" --retention-in-days 30
    done

This setting prevents more than half of CloudWatch cost incidents (the log-flood incident in Chapter 3 cost management is resolved by this one setting).

Auto retention for new log groups #

There are two ways to apply retention automatically to newly created log groups. The first is with EventBridge + Lambda, applying it automatically on receiving a CreateLogGroup event, which is often used in practice. The second is enforcing it with a policy that denies creation if retention isn’t specified, which is slightly overkill.

Sending Lambda logs #

Lambda — just print
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info("Received event: %s", event)
    return {"ok": True}

stdout / stderr flow automatically to CloudWatch Logs. The log group name is /aws/lambda/<function-name>.

EC2 / ECS — CloudWatch Agent #

EC2 isn’t automatic. It needs a CloudWatch Agent install.

Amazon Linux 2023 / Ubuntu
sudo yum install -y amazon-cloudwatch-agent       # AL
# or
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

The config file is /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json.

Simple config
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "/myapp/server",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          }
        ]
      }
    }
  },
  "metrics": {
    "metrics_collected": {
      "mem":  { "measurement": ["mem_used_percent"] },
      "disk": { "measurement": ["used_percent"], "resources": ["*"] }
    }
  }
}

ECS Fargate is automatic if you write the awslogs driver in the container definition. This is covered in detail in Chapter 15 ECS and Fargate.

Logs Insights — search with queries #

The search / analysis tool of CloudWatch Logs. It uses its own SQL-like syntax.

The simplest query
fields @timestamp, @message
| sort @timestamp desc
| limit 100
ERROR only
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
Request-time distribution (Lambda)
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration), max(@duration), count(*) by bin(5m)
API Gateway — 5xx ratio
fields @timestamp, status, path
| filter status >= 500
| stats count(*) as errors by path
| sort errors desc

Commonly used commands:

CommandWhat
fieldsFields to show
filterCondition filter
parseExtract fields from a string
statsAggregate (count, avg, max, percentile)
sortSort
limitMax number of results
bin(5m)Time bucket

Logs Insights cost caution #

You’re charged per GB scanned at query time (~$0.005/GB). Querying a large log group with an unlimited time range causes a cost incident. Always keep the time range narrow.

CloudWatch Metrics #

A metric is a time-series number. Almost every AWS service sends automatically.

Commonly viewed metrics #

ServiceCommonly viewed metrics
EC2CPUUtilization, NetworkIn/Out, DiskReadOps
RDSCPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency
LambdaInvocations, Errors, Duration, Throttles, ConcurrentExecutions
ECSCPUUtilization, MemoryUtilization (per Service / Task)
ALBRequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count
API GatewayCount, Latency, 4XXError, 5XXError
S3BucketSizeBytes, NumberOfObjects (once a day)
DynamoDBConsumedReadCapacity/WriteCapacity, ThrottledRequests

A metric’s dimensions #

The same metric is split by dimensions.

Example: CPUUtilization
Service: AWS/EC2
Metric:  CPUUtilization
Dimensions:
  - InstanceId: i-1234567890
  - InstanceId: i-2345678901
  - InstanceId: i-3456789012

Different dimensions are separate metrics. The number of metrics is the cost, so if dimensions explode you get a cost incident.

Statistic #

StatisticWhat
SumTotal — Invocations, RequestCount
AverageMean — CPU, Latency
MaximumMax — spike detection
MinimumMin
p95 / p99Percentile — Latency
SampleCountNumber of data points

In most cases Average + p95 are meaningful. p99 / p99.9 directly affect the SLA and user experience.

Standard vs high-resolution #

KindResolutionCost
Standard1 minuteStandard
High-resolution1 secondExpensive — only for short spikes

Standard 1-minute is plenty in most cases.

Sending custom metrics #

The application sends directly.

Python — boto3
import boto3

cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_data(
    Namespace="MyApp",
    MetricData=[{
        "MetricName": "OrderCreated",
        "Value": 1,
        "Unit": "Count",
        "Dimensions": [
            {"Name": "Environment", "Value": "prod"},
            {"Name": "Region", "Value": "ap-northeast-2"},
        ],
    }],
)

The cost is $0.30 / month per metric, separate per combination of dimensions. Never use high-cardinality dimensions like a user ID.

EMF (Embedded Metric Format) — a pattern for Lambda #

In Lambda, the put_metric_data call itself is a cost and latency burden. Write a specific JSON format to the log and CloudWatch automatically converts it to a metric.

EMF format
import json
print(json.dumps({
    "_aws": {
        "Timestamp": int(time.time() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp",
            "Dimensions": [["Environment"]],
            "Metrics": [{"Name": "OrderCreated", "Unit": "Count"}],
        }],
    },
    "Environment": "prod",
    "OrderCreated": 1,
}))

An SDK like aws-embedded-metrics-python helps with this more cleanly.

Metric Filter — making metrics from logs #

Convert information already in logs (an ERROR occurred, response time, etc.) into a metric.

Making it in the console
CloudWatch → Log groups → select a group → Metric filters → Create
- Filter pattern: ERROR
- Metric namespace: MyApp
- Metric name: ErrorCount
- Metric value: 1

Now every time an ERROR occurs, the metric goes +1. You can use it in alarms or dashboards.

Filter pattern examples
ERROR                       # contains the word ERROR
[..., level="ERROR", ...]   # a field of a structured log
{ $.level = "ERROR" }       # a key of a JSON log

CloudWatch Alarms #

When a metric crosses a threshold, it acts. The home base of alerts.

A first alarm — Lambda Errors #

Console
CloudWatch → Alarms → Create alarm
- Metric: AWS/Lambda → Errors → Function: my-function
- Statistic: Sum
- Period: 1 minute
- Threshold: > 0 for 1 datapoint within 5 minutes
- Action: SNS → notification topic
- Name: lambda-my-function-errors

An alarm’s states #

StateMeaning
OKWithin threshold
ALARMThreshold exceeded — action fires
INSUFFICIENT_DATANot enough data — newly created / metric not coming

Whether to treat INSUFFICIENT_DATA as an alarm (notify) is optional. It’s a state that often appears while evaluating a new alarm, so it’s generally ignored.

Composite Alarms #

Combine multiple alarms with AND / OR. A pattern like “ALB 5xx ≥ 1% AND CPU > 80%.”

Combination example
ALARM("alb-5xx") AND ALARM("ec2-high-cpu")

Effective at reducing false positives.

Alarm actions #

ActionWhat
SNS TopicFan out to email / Slack / SMS / Lambda, etc.
EC2 ActionInstance stop / terminate / reboot / recover
Auto ScalingASG scale in / out
Systems ManagerCreate an OpsItem

90% of operations is SNS → Slack / email.

Anomaly Detection #

After automatically learning a baseline (band), it raises an alarm when outside it. Effective for metrics with patterns like traffic or CPU, with fewer false positives than a static threshold.

SNS integration — how to send alerts #

Most alarms go to an SNS Topic, and from there fan out to the next step.

SubscriptionTo where
EmailEmail
HTTPSSlack incoming webhook
LambdaTo another step after processing
SMSPhone (rarely)
SQSTo a queue

Slack integration — the Lambda pattern #

Send directly with a webhook, or send with AWS Chatbot.

Lambda — SNS → Slack conversion
import json, os, urllib.request

WEBHOOK = os.environ["SLACK_WEBHOOK"]

def handler(event, context):
    msg = json.loads(event["Records"][0]["Sns"]["Message"])
    payload = json.dumps({
        "text": f"🚨 *{msg['AlarmName']}* — {msg['NewStateValue']}",
        "blocks": [...]
    }).encode()
    req = urllib.request.Request(WEBHOOK, data=payload,
                                  headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req)

Covered more deeply in Chapter 18 API Gateway and Lambda and Chapter 19 EventBridge / SQS / SNS.

CloudWatch Dashboards #

A page of widgets. Used for an at-a-glance view per team / service. You can define it in JSON and manage it as code.

Commonly made dashboards #

KindWhat
Service dashboardA service’s core metrics (requests, latency, errors, infra)
Infra dashboardCPU / memory / network of EC2/RDS
Business dashboardBusiness metrics like sign-ups / payments / orders
On-call dashboardActive alarms / recent incidents / key indicators

Widget kinds #

  • Metric graphs (line, stacked, number)
  • Logs (Logs Insights query results)
  • Text (Markdown — dashboard guidance)
  • Alarm status

A good dashboard is one where “looking at this one page for 30 seconds tells you the system’s state.”

Settings to turn on right after sign-up #

By the time Part 1 ends, the following settings should be in place.

ItemWhere
Auto-apply retention to new log groupsConsole / EventBridge + Lambda
Lambda Errors alarmPer function
RDS FreeStorageSpace alarmPer DB
ALB 5xx alarmPer LB
Billing alarm (Chapter 3 cost management)Account-level
GuardDuty findings alarm (Chapter 6 security basics)Account-level

These six are the alerting foundation of a small operation.

Common pitfalls #

  • Log retention forever — the most common cost incident. Specify retention per log group and automate new groups. This single setting saves more than half the cost.
  • High-cardinality custom metrics — set it like Dimensions: [{Name: 'UserId', Value: user_id}] and with 10,000 users you get 10,000 metrics × dimension combinations for $3000+ a month. For per-user, logs (Logs Insights) are the answer.
  • Unlimited-time Logs Insights queries — not setting a time range on a large log group incurs GB-scale scan cost. Always keep the time range narrow.
  • An alarm’s Period too short — set an alarm at 1 minute / 1 datapoint and it floods on every transient spike. Usually about 5 minutes / 3 datapoints reduces the noise appropriately.
  • Not attaching an alarm action — make an alarm and not attach SNS or an action and it just turns red in the console. Nobody knows. Attach the action when you make it.
  • Making a dashboard but not looking at it — everyone looks right after making it, but it’s forgotten over time. Settle it as the first check item of on-call or the daily standup.
  • Not using Metric Math — there’s Metric Math, which can compute ratios / sums / transforms of multiple metrics. Calculations like “5xx / total requests = error rate.” Used well, dashboards and alarms get much cleaner.

Exercises #

  1. Read the bulk-apply script in §“Retention — the most important setting” as a dry-run in your environment, and explain in one paragraph how this setting prevents the “log flood” incident in Chapter 3 cost management.
  2. From the commonly-viewed-metrics table in §“CloudWatch Metrics,” pick the Lambda and ALB rows and write, connecting to the §“Settings to turn on right after sign-up” table, which metric you should put an alarm on for each to catch the first operational incident.
  3. Based on the §“High-cardinality custom metrics” pitfall, explain from a cost perspective why you should use Logs Insights instead of a custom metric when you want to track per-user order counts.

In short: CloudWatch is the eye of operations, with Logs · Metrics · Alarms · Dashboards forming one flow from logs → metrics → alarms → dashboards. If you do not specify log retention, cost can explode, so that setting belongs right after sign-up. For metrics, dimensions drive cost, so high-cardinality dimensions are forbidden, and an alarm is meaningful only if you attach an SNS action.

Next chapter #

With this, Part 1 Getting Started with AWS ends. Console / account / IAM / cost / CLI / SSO / security / CloudWatch — the toolbox you need to start something on AWS came together in one place. Now it’s time to make real resources. In Chapter 8 EC2 and VPC, the first chapter of Part 2, we sort out in one stroke the structure of the virtual machine EC2 and the virtual network VPC it lives in — Subnet / Internet Gateway / Route Table / Security Group / NACL.

X