AWS Certified CloudOps Engineer - Associate (SOA-C03) #2 Domain 1-1 Monitoring — CloudWatch Metrics, Alarms, and Dashboards

Wednesday, May 27, 2026

7 min read

In #1 Exam Introduction, I noted that among SOA-C03’s five domains, Monitoring, Logging, Recovery, and Performance is the largest at 22%. The starting point of that domain is CloudWatch. CloudWatch is not just a graphing tool — it is the observability layer that captures everything happening in your AWS environment as numbers (metrics) and records (logs). Nearly every operational action begins with “CloudWatch detects something.”

This post covers the first half of that observability layer: metrics, alarms, and dashboards. Logs and Logs Insights come in #3, and automated recovery after detection continues in #4.

What is a metric? #

A metric is a time series of numeric data points recorded over time. An EC2 instance’s CPUUtilization, an ELB’s RequestCount, and an SQS queue’s ApproximateNumberOfMessagesVisible are all metrics. CloudWatch organizes these metrics with the following structure.

Concept	Description	Example
Namespace	A grouping of metrics. Isolated per service	`AWS/EC2`, `AWS/RDS`, custom is `MyApp`
Metric name	The name of the measured item	`CPUUtilization`, `RequestCount`
Dimension	A key-value pair that identifies the metric	`InstanceId=i-0abc...`, `AutoScalingGroupName=web-asg`
Resolution	The interval between data points	Standard 60 seconds, high-resolution 1 second

The key is the dimension. Even the same CPUUtilization is recorded separately per InstanceId, and you can also group it by ASG name to view it. On the exam, the scenario “I want to set an alarm on just one specific instance” is ultimately a problem of narrowing the metric down by dimension.

Standard metrics vs. custom metrics #

Category	Standard metrics	Custom metrics
Provider	Automatically published by the AWS service	Published by the user via `PutMetricData`
Example	EC2 CPU, ELB latency	Application response time, queue backlog length
Memory/disk	Not in EC2 standard metrics	Must be published via the CloudWatch Agent to collect

One of the most common traps on the exam is that EC2 memory utilization and disk utilization are not standard metrics. They are not visible from outside the hypervisor. To watch memory and disk with alarms, you must install the CloudWatch Agent on the instance and publish them as custom metrics. This pattern is almost a staple.

Alarms: the core of detection #

An alarm is a mechanism that periodically evaluates whether a metric meets a defined condition and changes its state. There are three states.

State	Meaning
`OK`	The metric is within the threshold
`ALARM`	The metric breaches the threshold
`INSUFFICIENT_DATA`	There isn’t enough data to evaluate (right after startup, missing data, etc.)

The values you set when creating an alarm decide the answers.

Period: the unit of time over which the metric is aggregated. e.g., 60 seconds
Evaluation Periods: how many periods to look at when judging
Datapoints to Alarm: how many of the evaluation periods must breach to go to ALARM

For example, with Period=60s, Evaluation Periods=5, Datapoints to Alarm=3, the alarm fires when 3 of the last 5 minutes exceed the threshold. This M of N configuration is the standard way to prevent alarms from misfiring on transient spikes, and it appears frequently on the exam.

Handling missing data #

You also configure how the alarm behaves when metric data doesn’t arrive.

Option	Behavior
`missing` (default)	Ignore missing periods in evaluation
`notBreaching`	Treat missing as normal
`breaching`	Treat missing as a breach
`ignore`	Keep the current alarm state

For metrics where data arrives sparsely, such as batch jobs, setting this wrong leaves the alarm stuck in INSUFFICIENT_DATA. The answer to the scenario “it’s an intermittent metric but the alarm won’t fire” is usually adjusting the missing-data handling option.

Alarm actions #

When an alarm changes state, it can trigger an action.

SNS notification: the most common form. Fans out to email, Slack, or Lambda
Auto Scaling policy: increases or decreases the ASG’s capacity
EC2 action: stop, terminate, reboot, or recover the instance
Systems Manager action: create an OpsItem, run Automation

In particular, the EC2 automated recovery (recover) action recovers the same instance on new hardware when the instance is impaired by a hardware failure. “Automatically recover when an instance fails its system status check” is the staple pattern of attaching the recover action to a StatusCheckFailed_System metric alarm.

Composite alarms #

A composite alarm strings several single-metric alarms together and combines them with a logical expression (AND, OR, NOT).

ALARM("high-cpu") AND ALARM("high-latency")

The key benefit of a composite alarm is reducing alarm noise. If a notification arrives every time CPU briefly spikes, operators become desensitized. Combining alarms so that it only alerts “when both CPU and latency are high” means notifications only fire when there’s a real problem. It frequently appears as the answer to the scenario “there are too many notifications and I want to reduce them.”

Dashboards #

A dashboard is a composition that gathers multiple metric graphs and alarm states onto a single screen.

You can gather cross-region and cross-account widgets on one dashboard
Defined in JSON, so it can be reproduced as code via CloudFormation
Auto-refresh makes it usable as an operations status board (NOC)

On the exam, dashboards themselves carry little weight, but they show up as the answer to a requirement like “view metrics from multiple accounts and regions in one place.” Cross-account observation usually also configures CloudWatch cross-account observability.

Metric Math and anomaly detection #

Metric Math: combines multiple metrics with a formula to create a new metric. For example, you compute “success request rate = success count / total count” and set an alarm on that result.
Anomaly Detection: learns past patterns to build a normal-range band, and fires an alarm when the metric leaves that band. You use it instead of a fixed threshold on workloads whose traffic fluctuates by time of day.

The answer to the scenario “the difference between daytime and nighttime traffic is large, so a fixed threshold makes alarms inaccurate” is anomaly detection.

Exam question patterns #

EC2 memory/disk metrics aren’t visible → install the CloudWatch Agent (custom metrics)
Alarm misfires on a transient spike → raise Datapoints to Alarm for an M of N configuration
An intermittent metric but the alarm won’t fire → adjust the missing-data (treat missing data) option
Too many notifications → combine conditions with AND using a composite alarm
Time-of-day variation is large and you can’t fix a threshold → anomaly detection
You want to automatically revive on system status check failure → StatusCheckFailed_System alarm + EC2 recover action

Common Pitfalls #

1) Thinking memory/disk are in standard metrics #

EC2 standard metrics include CPU, network, and disk I/O, but not memory utilization or filesystem utilization inside the OS. The Agent is required.

2) Confusing Period and Evaluation Periods #

Period is “the time over which one data point is aggregated”; Evaluation Periods is “how many to look at.” The two multiplied together is the length of the actual judgment window.

3) Misunderstanding that a composite alarm evaluates metrics directly #

A composite alarm takes the states of other alarms as input, not metrics. The order is to first create single alarms and then combine them.

4) Ignoring the cost of high-resolution metrics #

A 1-second-resolution custom metric costs more than the standard 60 seconds. An answer like “everything at 1 second” is usually wrong in a cost-constrained scenario.

Summary #

What we covered in this post:

A metric is a time series organized by Namespace, Metric, Dimension, and Resolution. Narrow the target down by dimension
EC2 memory and disk are not standard metrics. Publish them as custom metrics via the CloudWatch Agent
Alarms judge with M of N via Period, Evaluation Periods, and Datapoints to Alarm. The missing-data handling option is separate
Alarm actions are SNS, Auto Scaling, EC2 recover, and Systems Manager. EC2 automated recovery is a staple for system status check failures
Reduce alarm noise by combining conditions with a composite alarm. Use anomaly detection for dynamic thresholds on highly variable workloads
Dashboards can be reproduced as code via their JSON definition, with integrated cross-account and cross-region observation

Next: Domain 1-2 CloudWatch Logs and Logs Insights #

Now that metrics have captured “what happened,” next come logs, the detailed records of those events.

In #3 Domain 1-2 Monitoring — CloudWatch Logs, Logs Insights, and the Agent, I’ll cover the structure of log groups and log streams, how to collect logs with the CloudWatch Agent, log-based metric filters, how to analyze large volumes of logs with Logs Insights queries, and how to set up subscription filters to deliver logs in real time.