AWS Basics #7: CloudWatch Intro — Logs and Metrics

Infrastructure AWS CloudWatch Logs Monitoring

Friday, April 17, 2026

10 min read

#1 through #6 gave us the AWS setup foundation. Now for the other axis of operations — knowing what’s running where, and what it’s doing.

CloudWatch is AWS’s observability standard. Almost every AWS service emits metrics into CloudWatch by default, and logs land in CloudWatch Logs. That’s where production sight begins.

This post covers CloudWatch’s four components — Logs / Metrics / Alarms / Dashboards — in one go.

Big picture — the four components of CloudWatch #

Component	What it is	Common use
Logs	Store / search text logs	Logs from EC2 / Lambda / ECS / API Gateway
Metrics	Time-series numbers (CPU%, request counts, etc.)	Every AWS service auto-emits
Alarms	Alert / act when a metric crosses a threshold	Production alerts, autoscaling
Dashboards	Pages of graphs / widgets	At-a-glance per team / service

The four interlock: logs → metrics → alarms → dashboards.

CloudWatch Logs #

Log group and log stream #

Structure

Log Group           — usually one application / service
  └── Log Stream    — usually one process / container
        └── Log Event — one line

Item	Example
Log Group	`/aws/lambda/my-function`, `/ecs/my-service`, `/var/log/myapp`
Log Stream	Lambda execution-environment ID, ECS Task ID, EC2 instance ID
Log Event	One line of text + timestamp

Lambda / ECS Fargate sends logs to CloudWatch Logs automatically. EC2 needs the CloudWatch Agent or a sidecar agent (fluent-bit, etc.).

Retention — the most important setting #

Default retention is forever. Leave it alone and logs pile up indefinitely → cost runaway. Right after signup / for every new log group, set retention.

Item	Recommended retention
General application logs	30–90 days
Debug / development logs	7 days
Security / audit logs (CloudTrail)	1–7 years (cheaper to ship to S3)
Lambda logs	14–30 days

Set retention

aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30

Apply to every log group at once

aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text \
  | tr '\t' '\n' \
  | while read name; do
      aws logs put-retention-policy --log-group-name "$name" --retention-in-days 30
    done

This one line cuts CloudWatch costs by more than half.

Auto-applying retention to new log groups #

Two approaches.

Approach 1: EventBridge + Lambda — react to CreateLogGroup events and apply automatically (common in production).

Approach 2: Force via policy — deny creation if retention isn’t set explicitly (slightly heavy-handed).

Lambda log shipping #

Lambda — just print

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info("Received event: %s", event)
    return {"ok": True}

stdout / stderr flows automatically into CloudWatch Logs. The log group is named /aws/lambda/<function-name>.

EC2 / ECS — CloudWatch Agent #

EC2 isn’t automatic. Install the CloudWatch Agent.

Amazon Linux 2023 / Ubuntu

sudo yum install -y amazon-cloudwatch-agent       # AL
# or
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

Configure at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json:

Simple config

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "/myapp/server",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          }
        ]
      }
    }
  },
  "metrics": {
    "metrics_collected": {
      "mem":  { "measurement": ["mem_used_percent"] },
      "disk": { "measurement": ["used_percent"], "resources": ["*"] }
    }
  }
}

ECS Fargate ships logs automatically when the container definition uses the awslogs driver — covered in detail in Advanced #1.

Logs Insights — query-based search #

The search / analysis tool for CloudWatch Logs. SQL-ish syntax of its own.

Simplest query

fields @timestamp, @message
| sort @timestamp desc
| limit 100

Errors only

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

Request-time distribution (Lambda)

fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration), max(@duration), count(*) by bin(5m)

API Gateway — 5xx ratio

fields @timestamp, status, path
| filter status >= 500
| stats count(*) as errors by path
| sort errors desc

Common commands:

Command	What it is
`fields`	Fields to show
`filter`	Conditional filter
`parse`	Extract fields from a string
`stats`	Aggregate (count, avg, max, percentile)
`sort`	Sort
`limit`	Maximum results
`bin(5m)`	Time bucket

Logs Insights cost note #

Queries bill by GB scanned (~$0.005/GB). An unbounded time range over a big log group is a cost incident. Always tighten the time range.

CloudWatch Metrics #

A metric is a time-series number. Almost every AWS service emits them automatically.

Frequently watched metrics #

Service	Frequently watched
EC2	`CPUUtilization`, `NetworkIn/Out`, `DiskReadOps`
RDS	`CPUUtilization`, `DatabaseConnections`, `FreeStorageSpace`, `ReadLatency`
Lambda	`Invocations`, `Errors`, `Duration`, `Throttles`, `ConcurrentExecutions`
ECS	`CPUUtilization`, `MemoryUtilization` (per service / task)
ALB	`RequestCount`, `TargetResponseTime`, `HTTPCode_Target_5XX_Count`
API Gateway	`Count`, `Latency`, `4XXError`, `5XXError`
S3	`BucketSizeBytes`, `NumberOfObjects` (once a day)
DynamoDB	`ConsumedReadCapacity/WriteCapacity`, `ThrottledRequests`

Metric dimensions #

The same metric splits by dimension.

Example: CPUUtilization

Service: AWS/EC2
Metric:  CPUUtilization
Dimensions:
  - InstanceId: i-1234567890
  - InstanceId: i-2345678901
  - InstanceId: i-3456789012

Different dimensions are different metrics. Metric count = cost, so dimension explosion = cost explosion.

Statistic #

Statistic	What it is
`Sum`	Sum — Invocations, RequestCount
`Average`	Mean — CPU, Latency
`Maximum`	Max — spike detection
`Minimum`	Min
`p95` / `p99`	Percentiles — Latency
`SampleCount`	Number of data points

In most cases, Average + p95 is meaningful. p99 / p99.9 directly affect SLAs / user experience.

Standard vs high-resolution #

Kind	Resolution	Cost
Standard	1 minute	Standard
High-res	1 second	Expensive — only short spikes

Standard 1-minute is enough for most cases.

Sending custom metrics #

The application sends them directly.

Python — boto3

import boto3

cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_data(
    Namespace="MyApp",
    MetricData=[{
        "MetricName": "OrderCreated",
        "Value": 1,
        "Unit": "Count",
        "Dimensions": [
            {"Name": "Environment", "Value": "prod"},
            {"Name": "Region", "Value": "ap-northeast-2"},
        ],
    }],
)

Cost: $0.30 per metric per month. Each combination of dimensions is a separate metric — never use high-cardinality dimensions like user ID.

EMF (Embedded Metric Format) — the Lambda pattern #

In Lambda, put_metric_data itself adds cost / latency. Write a special JSON shape into the log and CloudWatch auto-converts it into a metric.

EMF format

import json
print(json.dumps({
    "_aws": {
        "Timestamp": int(time.time() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp",
            "Dimensions": [["Environment"]],
            "Metrics": [{"Name": "OrderCreated", "Unit": "Count"}],
        }],
    },
    "Environment": "prod",
    "OrderCreated": 1,
}))

SDKs like aws-embedded-metrics-python make this nicer.

Metric Filter — make a metric from logs #

Turn information already in logs (ERROR occurrences, response times, etc.) into a metric.

Create in the console

CloudWatch → Log groups → pick a group → Metric filters → Create
- Filter pattern: ERROR
- Metric namespace: MyApp
- Metric name: ErrorCount
- Metric value: 1

Each ERROR now bumps the metric by +1. Use it in alarms / dashboards.

Filter pattern examples

ERROR                       # contains ERROR
[..., level="ERROR", ...]   # field on a structured log
{ $.level = "ERROR" }       # JSON log key

CloudWatch Alarms #

An action when a metric crosses a threshold. The home base of alerting.

First alarm — Lambda Errors #

Console

CloudWatch → Alarms → Create alarm
- Metric: AWS/Lambda → Errors → Function: my-function
- Statistic: Sum
- Period: 1 minute
- Threshold: > 0 for 1 datapoint within 5 minutes
- Action: SNS → notification topic
- Name: lambda-my-function-errors

Alarm states #

State	Meaning
`OK`	Within threshold
`ALARM`	Threshold breached — actions fire
`INSUFFICIENT_DATA`	Not enough data — newly created / no metric data

Whether to treat INSUFFICIENT_DATA as alarming is optional. It’s a state new alarms see often during evaluation, so you can usually ignore it.

Composite Alarms #

AND / OR combinations of multiple alarms. “ALB 5xx ≥ 1% AND CPU > 80%” kinds of patterns.

Combination example

ALARM("alb-5xx") AND ALARM("ec2-high-cpu")

Effective at reducing false positives.

Alarm actions #

Action	What it is
SNS Topic	Fanout to email / Slack / SMS / Lambda, etc.
EC2 Action	Stop / terminate / reboot / recover an instance
Auto Scaling	Scale ASG in / out
Systems Manager	Create an OpsItem

90% of production is SNS → Slack / email.

Anomaly Detection #

Auto-learn a baseline (band) → alarm when out of band. Effective on patterned metrics like traffic / CPU. Fewer false positives than static thresholds.

Most alarms head to an SNS Topic and fan out from there.

Subscription	Where to
Email	Email
HTTPS	Slack incoming webhook
Lambda	Transform and forward
SMS	Phone (rarely)
SQS	A queue

Slack integration — the Lambda pattern #

Either send to the webhook directly or use AWS Chatbot.

Lambda — SNS → Slack

import json, os, urllib.request

WEBHOOK = os.environ["SLACK_WEBHOOK"]

def handler(event, context):
    msg = json.loads(event["Records"][0]["Sns"]["Message"])
    payload = json.dumps({
        "text": f"🚨 *{msg['AlarmName']}* — {msg['NewStateValue']}",
        "blocks": [...]
    }).encode()
    req = urllib.request.Request(WEBHOOK, data=payload,
                                  headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req)

Deeper coverage in Advanced #4 API Gateway + Lambda and Advanced #5 EventBridge / SQS / SNS.

CloudWatch Dashboards #

Pages of widgets. At-a-glance per team / service. Definable as JSON / managed as code.

Common dashboards #

Kind	What it is
Service dashboard	Core metrics for one service (requests, latency, errors, infra)
Infrastructure dashboard	EC2/RDS CPU / memory / network
Business dashboard	Signups / payments / orders, etc.
On-call dashboard	Active alarms / recent incidents / KPIs

Metric graphs (line, stacked, number)
Logs (Logs Insights query results)
Text (Markdown — dashboard guidance)
Alarm status

A good dashboard = “thirty seconds on this page tells you the system’s state.”

By the end of this series the following should be in place.

Item	Where
Auto-apply retention to new log groups	Console / EventBridge + Lambda
Lambda Errors alarm	Per function
RDS FreeStorageSpace alarm	Per DB
ALB 5xx alarm	Per LB
Billing alarm (#3)	Per account
GuardDuty findings alarm (#6)	Per account

These six are the alerting baseline for small operations.

Common pitfalls #

1) Forever-retention logs #

The most common cost incident. Set retention per log group + automate for new groups. One line that cuts cost by more than half.

2) High-cardinality custom metrics #

Dimensions: [{Name: 'UserId', Value: user_id}] — 10K users → 10K metric × dimension combinations. $3000+/month. Per-user data belongs in logs (Logs Insights).

3) Unbounded-time Logs Insights queries #

A big log group with no time range = GB-scale scan cost. Always tighten the time range.

4) Alarm Period too short #

1-minute, 1 datapoint alarms → an alert per transient spike. Usually 5 minutes / 3 datapoints balances noise.

5) Alarm without an action #

Create an alarm without SNS / action and it just turns red in the console — no one notices. Attach an action when you create it.

6) Dashboards built then ignored #

Right after building everyone looks; weeks in it’s forgotten. Embed in on-call / daily standup as the first stop.

7) Skipping Metric Math #

Metric Math computes ratios / sums / transformations across metrics. “5xx / total = error rate” kinds of things. Used well, dashboards / alarms get a lot cleaner.

Wrap-up #

What we covered:

The four components of CloudWatch — Logs / Metrics / Alarms / Dashboards
Logs — group → stream → event. Set retention right after signup. CloudWatch Agent for EC2 collection
Logs Insights — fields / filter / stats / sort / parse + bin(5m). Tight time range
Metrics — AWS services emit automatically. Dimensions / statistics (Avg, p95, p99). No high-cardinality dimensions
Metric Filter — extract metrics from logs (ERROR, etc.)
EMF — metrics through logs in Lambda (an alternative to put_metric_data)
Alarms — threshold + Period + Datapoints. SNS / EC2 actions / ASG. Composite / Anomaly
SNS — alarm fanout. Slack / email / Lambda
Dashboards — service / infra / business / on-call
Pitfalls — forever retention, high-cardinality dimensions, unbounded Insights queries, Period too short, no action, ignored dashboards

Next series — AWS Intermediate #

This wraps up the seven AWS Basics. Console / account / IAM / cost / CLI / SSO / security / CloudWatch — the toolbox for getting started on AWS is in one place.

Now to actually build resources. The seven posts of AWS Intermediate cover the core pieces of backend operations.

AWS Intermediate #1 EC2 and VPC basics threads EC2 (the virtual machine) together with the VPC it lives in — Subnet / Internet Gateway / Route Table / Security Group / NACL — onto a single line.

Big picture — the four components of CloudWatch #

CloudWatch Logs #

Log group and log stream #

Retention — the most important setting #

Auto-applying retention to new log groups #

Lambda log shipping #

EC2 / ECS — CloudWatch Agent #

Logs Insights — query-based search #

Logs Insights cost note #

CloudWatch Metrics #

Frequently watched metrics #

Metric dimensions #

Statistic #

Standard vs high-resolution #

Sending custom metrics #

EMF (Embedded Metric Format) — the Lambda pattern #

Metric Filter — make a metric from logs #

CloudWatch Alarms #

First alarm — Lambda Errors #

Alarm states #

Composite Alarms #

Alarm actions #

Anomaly Detection #

SNS integration — how alerts are routed #

Slack integration — the Lambda pattern #

CloudWatch Dashboards #

Common dashboards #

Widget kinds #

Settings to turn on right after signup #

Common pitfalls #

1) Forever-retention logs #

2) High-cardinality custom metrics #

3) Unbounded-time Logs Insights queries #

4) Alarm Period too short #

5) Alarm without an action #

6) Dashboards built then ignored #

7) Skipping Metric Math #

Wrap-up #

Next series — AWS Intermediate #