AWS Basics #7: CloudWatch Intro — Logs and Metrics

10 min read

#1 through #6 gave us the AWS setup foundation. Now for the other axis of operations — knowing what’s running where, and what it’s doing.

CloudWatch is AWS’s observability standard. Almost every AWS service emits metrics into CloudWatch by default, and logs land in CloudWatch Logs. That’s where production sight begins.

This post covers CloudWatch’s four components — Logs / Metrics / Alarms / Dashboards — in one go.

Big picture — the four components of CloudWatch #

ComponentWhat it isCommon use
LogsStore / search text logsLogs from EC2 / Lambda / ECS / API Gateway
MetricsTime-series numbers (CPU%, request counts, etc.)Every AWS service auto-emits
AlarmsAlert / act when a metric crosses a thresholdProduction alerts, autoscaling
DashboardsPages of graphs / widgetsAt-a-glance per team / service

The four interlock: logs → metrics → alarms → dashboards.

CloudWatch Logs #

Log group and log stream #

Structure
Log Group           — usually one application / service
  └── Log Stream    — usually one process / container
        └── Log Event — one line
ItemExample
Log Group/aws/lambda/my-function, /ecs/my-service, /var/log/myapp
Log StreamLambda execution-environment ID, ECS Task ID, EC2 instance ID
Log EventOne line of text + timestamp

Lambda / ECS Fargate sends logs to CloudWatch Logs automatically. EC2 needs the CloudWatch Agent or a sidecar agent (fluent-bit, etc.).

Retention — the most important setting #

Default retention is forever. Leave it alone and logs pile up indefinitely → cost runaway. Right after signup / for every new log group, set retention.

ItemRecommended retention
General application logs30–90 days
Debug / development logs7 days
Security / audit logs (CloudTrail)1–7 years (cheaper to ship to S3)
Lambda logs14–30 days
Set retention
aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30
Apply to every log group at once
aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text \
  | tr '\t' '\n' \
  | while read name; do
      aws logs put-retention-policy --log-group-name "$name" --retention-in-days 30
    done

This one line cuts CloudWatch costs by more than half.

Auto-applying retention to new log groups #

Two approaches.

Approach 1: EventBridge + Lambda — react to CreateLogGroup events and apply automatically (common in production).

Approach 2: Force via policy — deny creation if retention isn’t set explicitly (slightly heavy-handed).

Lambda log shipping #

Lambda — just print
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info("Received event: %s", event)
    return {"ok": True}

stdout / stderr flows automatically into CloudWatch Logs. The log group is named /aws/lambda/<function-name>.

EC2 / ECS — CloudWatch Agent #

EC2 isn’t automatic. Install the CloudWatch Agent.

Amazon Linux 2023 / Ubuntu
sudo yum install -y amazon-cloudwatch-agent       # AL
# or
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

Configure at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json:

Simple config
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "/myapp/server",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          }
        ]
      }
    }
  },
  "metrics": {
    "metrics_collected": {
      "mem":  { "measurement": ["mem_used_percent"] },
      "disk": { "measurement": ["used_percent"], "resources": ["*"] }
    }
  }
}

ECS Fargate ships logs automatically when the container definition uses the awslogs driver — covered in detail in Advanced #1.

Logs Insights — query-based search #

The search / analysis tool for CloudWatch Logs. SQL-ish syntax of its own.

Simplest query
fields @timestamp, @message
| sort @timestamp desc
| limit 100
Errors only
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
Request-time distribution (Lambda)
fields @timestamp, @duration
| filter @type = "REPORT"
| stats avg(@duration), max(@duration), count(*) by bin(5m)
API Gateway — 5xx ratio
fields @timestamp, status, path
| filter status >= 500
| stats count(*) as errors by path
| sort errors desc

Common commands:

CommandWhat it is
fieldsFields to show
filterConditional filter
parseExtract fields from a string
statsAggregate (count, avg, max, percentile)
sortSort
limitMaximum results
bin(5m)Time bucket

Logs Insights cost note #

Queries bill by GB scanned (~$0.005/GB). An unbounded time range over a big log group is a cost incident. Always tighten the time range.

CloudWatch Metrics #

A metric is a time-series number. Almost every AWS service emits them automatically.

Frequently watched metrics #

ServiceFrequently watched
EC2CPUUtilization, NetworkIn/Out, DiskReadOps
RDSCPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency
LambdaInvocations, Errors, Duration, Throttles, ConcurrentExecutions
ECSCPUUtilization, MemoryUtilization (per service / task)
ALBRequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count
API GatewayCount, Latency, 4XXError, 5XXError
S3BucketSizeBytes, NumberOfObjects (once a day)
DynamoDBConsumedReadCapacity/WriteCapacity, ThrottledRequests

Metric dimensions #

The same metric splits by dimension.

Example: CPUUtilization
Service: AWS/EC2
Metric:  CPUUtilization
Dimensions:
  - InstanceId: i-1234567890
  - InstanceId: i-2345678901
  - InstanceId: i-3456789012

Different dimensions are different metrics. Metric count = cost, so dimension explosion = cost explosion.

Statistic #

StatisticWhat it is
SumSum — Invocations, RequestCount
AverageMean — CPU, Latency
MaximumMax — spike detection
MinimumMin
p95 / p99Percentiles — Latency
SampleCountNumber of data points

In most cases, Average + p95 is meaningful. p99 / p99.9 directly affect SLAs / user experience.

Standard vs high-resolution #

KindResolutionCost
Standard1 minuteStandard
High-res1 secondExpensive — only short spikes

Standard 1-minute is enough for most cases.

Sending custom metrics #

The application sends them directly.

Python — boto3
import boto3

cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_data(
    Namespace="MyApp",
    MetricData=[{
        "MetricName": "OrderCreated",
        "Value": 1,
        "Unit": "Count",
        "Dimensions": [
            {"Name": "Environment", "Value": "prod"},
            {"Name": "Region", "Value": "ap-northeast-2"},
        ],
    }],
)

Cost: $0.30 per metric per month. Each combination of dimensions is a separate metric — never use high-cardinality dimensions like user ID.

EMF (Embedded Metric Format) — the Lambda pattern #

In Lambda, put_metric_data itself adds cost / latency. Write a special JSON shape into the log and CloudWatch auto-converts it into a metric.

EMF format
import json
print(json.dumps({
    "_aws": {
        "Timestamp": int(time.time() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp",
            "Dimensions": [["Environment"]],
            "Metrics": [{"Name": "OrderCreated", "Unit": "Count"}],
        }],
    },
    "Environment": "prod",
    "OrderCreated": 1,
}))

SDKs like aws-embedded-metrics-python make this nicer.

Metric Filter — make a metric from logs #

Turn information already in logs (ERROR occurrences, response times, etc.) into a metric.

Create in the console
CloudWatch → Log groups → pick a group → Metric filters → Create
- Filter pattern: ERROR
- Metric namespace: MyApp
- Metric name: ErrorCount
- Metric value: 1

Each ERROR now bumps the metric by +1. Use it in alarms / dashboards.

Filter pattern examples
ERROR                       # contains ERROR
[..., level="ERROR", ...]   # field on a structured log
{ $.level = "ERROR" }       # JSON log key

CloudWatch Alarms #

An action when a metric crosses a threshold. The home base of alerting.

First alarm — Lambda Errors #

Console
CloudWatch → Alarms → Create alarm
- Metric: AWS/Lambda → Errors → Function: my-function
- Statistic: Sum
- Period: 1 minute
- Threshold: > 0 for 1 datapoint within 5 minutes
- Action: SNS → notification topic
- Name: lambda-my-function-errors

Alarm states #

StateMeaning
OKWithin threshold
ALARMThreshold breached — actions fire
INSUFFICIENT_DATANot enough data — newly created / no metric data

Whether to treat INSUFFICIENT_DATA as alarming is optional. It’s a state new alarms see often during evaluation, so you can usually ignore it.

Composite Alarms #

AND / OR combinations of multiple alarms. “ALB 5xx ≥ 1% AND CPU > 80%” kinds of patterns.

Combination example
ALARM("alb-5xx") AND ALARM("ec2-high-cpu")

Effective at reducing false positives.

Alarm actions #

ActionWhat it is
SNS TopicFanout to email / Slack / SMS / Lambda, etc.
EC2 ActionStop / terminate / reboot / recover an instance
Auto ScalingScale ASG in / out
Systems ManagerCreate an OpsItem

90% of production is SNS → Slack / email.

Anomaly Detection #

Auto-learn a baseline (band) → alarm when out of band. Effective on patterned metrics like traffic / CPU. Fewer false positives than static thresholds.

SNS integration — how alerts are routed #

Most alarms head to an SNS Topic and fan out from there.

SubscriptionWhere to
EmailEmail
HTTPSSlack incoming webhook
LambdaTransform and forward
SMSPhone (rarely)
SQSA queue

Slack integration — the Lambda pattern #

Either send to the webhook directly or use AWS Chatbot.

Lambda — SNS → Slack
import json, os, urllib.request

WEBHOOK = os.environ["SLACK_WEBHOOK"]

def handler(event, context):
    msg = json.loads(event["Records"][0]["Sns"]["Message"])
    payload = json.dumps({
        "text": f"🚨 *{msg['AlarmName']}* — {msg['NewStateValue']}",
        "blocks": [...]
    }).encode()
    req = urllib.request.Request(WEBHOOK, data=payload,
                                  headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req)

Deeper coverage in Advanced #4 API Gateway + Lambda and Advanced #5 EventBridge / SQS / SNS.

CloudWatch Dashboards #

Pages of widgets. At-a-glance per team / service. Definable as JSON / managed as code.

Common dashboards #

KindWhat it is
Service dashboardCore metrics for one service (requests, latency, errors, infra)
Infrastructure dashboardEC2/RDS CPU / memory / network
Business dashboardSignups / payments / orders, etc.
On-call dashboardActive alarms / recent incidents / KPIs

Widget kinds #

  • Metric graphs (line, stacked, number)
  • Logs (Logs Insights query results)
  • Text (Markdown — dashboard guidance)
  • Alarm status

A good dashboard = “thirty seconds on this page tells you the system’s state.”

Settings to turn on right after signup #

By the end of this series the following should be in place.

ItemWhere
Auto-apply retention to new log groupsConsole / EventBridge + Lambda
Lambda Errors alarmPer function
RDS FreeStorageSpace alarmPer DB
ALB 5xx alarmPer LB
Billing alarm (#3)Per account
GuardDuty findings alarm (#6)Per account

These six are the alerting baseline for small operations.

Common pitfalls #

1) Forever-retention logs #

The most common cost incident. Set retention per log group + automate for new groups. One line that cuts cost by more than half.

2) High-cardinality custom metrics #

Dimensions: [{Name: 'UserId', Value: user_id}] — 10K users → 10K metric × dimension combinations. $3000+/month. Per-user data belongs in logs (Logs Insights).

3) Unbounded-time Logs Insights queries #

A big log group with no time range = GB-scale scan cost. Always tighten the time range.

4) Alarm Period too short #

1-minute, 1 datapoint alarms → an alert per transient spike. Usually 5 minutes / 3 datapoints balances noise.

5) Alarm without an action #

Create an alarm without SNS / action and it just turns red in the console — no one notices. Attach an action when you create it.

6) Dashboards built then ignored #

Right after building everyone looks; weeks in it’s forgotten. Embed in on-call / daily standup as the first stop.

7) Skipping Metric Math #

Metric Math computes ratios / sums / transformations across metrics. “5xx / total = error rate” kinds of things. Used well, dashboards / alarms get a lot cleaner.

Wrap-up #

What we covered:

  • The four components of CloudWatch — Logs / Metrics / Alarms / Dashboards
  • Logs — group → stream → event. Set retention right after signup. CloudWatch Agent for EC2 collection
  • Logs Insightsfields / filter / stats / sort / parse + bin(5m). Tight time range
  • Metrics — AWS services emit automatically. Dimensions / statistics (Avg, p95, p99). No high-cardinality dimensions
  • Metric Filter — extract metrics from logs (ERROR, etc.)
  • EMF — metrics through logs in Lambda (an alternative to put_metric_data)
  • Alarms — threshold + Period + Datapoints. SNS / EC2 actions / ASG. Composite / Anomaly
  • SNS — alarm fanout. Slack / email / Lambda
  • Dashboards — service / infra / business / on-call
  • Pitfalls — forever retention, high-cardinality dimensions, unbounded Insights queries, Period too short, no action, ignored dashboards

Next series — AWS Intermediate #

This wraps up the seven AWS Basics. Console / account / IAM / cost / CLI / SSO / security / CloudWatch — the toolbox for getting started on AWS is in one place.

Now to actually build resources. The seven posts of AWS Intermediate cover the core pieces of backend operations.

AWS Intermediate #1 EC2 and VPC basics threads EC2 (the virtual machine) together with the VPC it lives in — Subnet / Internet Gateway / Route Table / Security Group / NACL — onto a single line.

X