AWS Certified CloudOps Engineer - Associate (SOA-C03) #2 Domain 1-1 Monitoring — CloudWatch Metrics, Alarms, and Dashboards
In #1 Exam Introduction, I noted that among SOA-C03’s five domains, Monitoring, Logging, Recovery, and Performance is the largest at 22%. The starting point of that domain is CloudWatch. CloudWatch is not just a graphing tool — it is the observability layer that captures everything happening in your AWS environment as numbers (metrics) and records (logs). Nearly every operational action begins with “CloudWatch detects something.”
This post covers the first half of that observability layer: metrics, alarms, and dashboards. Logs and Logs Insights come in #3, and automated recovery after detection continues in #4.
What is a metric? #
A metric is a time series of numeric data points recorded over time. An EC2 instance’s CPUUtilization, an ELB’s RequestCount, and an SQS queue’s ApproximateNumberOfMessagesVisible are all metrics. CloudWatch organizes these metrics with the following structure.
| Concept | Description | Example |
|---|---|---|
| Namespace | A grouping of metrics. Isolated per service | AWS/EC2, AWS/RDS, custom is MyApp |
| Metric name | The name of the measured item | CPUUtilization, RequestCount |
| Dimension | A key-value pair that identifies the metric | InstanceId=i-0abc..., AutoScalingGroupName=web-asg |
| Resolution | The interval between data points | Standard 60 seconds, high-resolution 1 second |
The key is the dimension. Even the same CPUUtilization is recorded separately per InstanceId, and you can also group it by ASG name to view it. On the exam, the scenario “I want to set an alarm on just one specific instance” is ultimately a problem of narrowing the metric down by dimension.
Standard metrics vs. custom metrics #
| Category | Standard metrics | Custom metrics |
|---|---|---|
| Provider | Automatically published by the AWS service | Published by the user via PutMetricData |
| Example | EC2 CPU, ELB latency | Application response time, queue backlog length |
| Memory/disk | Not in EC2 standard metrics | Must be published via the CloudWatch Agent to collect |
One of the most common traps on the exam is that EC2 memory utilization and disk utilization are not standard metrics. They are not visible from outside the hypervisor. To watch memory and disk with alarms, you must install the CloudWatch Agent on the instance and publish them as custom metrics. This pattern is almost a staple.
Alarms: the core of detection #
An alarm is a mechanism that periodically evaluates whether a metric meets a defined condition and changes its state. There are three states.
| State | Meaning |
|---|---|
OK | The metric is within the threshold |
ALARM | The metric breaches the threshold |
INSUFFICIENT_DATA | There isn’t enough data to evaluate (right after startup, missing data, etc.) |
The values you set when creating an alarm decide the answers.
- Period: the unit of time over which the metric is aggregated. e.g., 60 seconds
- Evaluation Periods: how many periods to look at when judging
- Datapoints to Alarm: how many of the evaluation periods must breach to go to
ALARM
For example, with Period=60s, Evaluation Periods=5, Datapoints to Alarm=3, the alarm fires when 3 of the last 5 minutes exceed the threshold. This M of N configuration is the standard way to prevent alarms from misfiring on transient spikes, and it appears frequently on the exam.
Handling missing data #
You also configure how the alarm behaves when metric data doesn’t arrive.
| Option | Behavior |
|---|---|
missing (default) | Ignore missing periods in evaluation |
notBreaching | Treat missing as normal |
breaching | Treat missing as a breach |
ignore | Keep the current alarm state |
For metrics where data arrives sparsely, such as batch jobs, setting this wrong leaves the alarm stuck in INSUFFICIENT_DATA. The answer to the scenario “it’s an intermittent metric but the alarm won’t fire” is usually adjusting the missing-data handling option.
Alarm actions #
When an alarm changes state, it can trigger an action.
- SNS notification: the most common form. Fans out to email, Slack, or Lambda
- Auto Scaling policy: increases or decreases the ASG’s capacity
- EC2 action: stop, terminate, reboot, or recover the instance
- Systems Manager action: create an OpsItem, run Automation
In particular, the EC2 automated recovery (recover) action recovers the same instance on new hardware when the instance is impaired by a hardware failure. “Automatically recover when an instance fails its system status check” is the staple pattern of attaching the recover action to a StatusCheckFailed_System metric alarm.
Composite alarms #
A composite alarm strings several single-metric alarms together and combines them with a logical expression (AND, OR, NOT).
ALARM("high-cpu") AND ALARM("high-latency")The key benefit of a composite alarm is reducing alarm noise. If a notification arrives every time CPU briefly spikes, operators become desensitized. Combining alarms so that it only alerts “when both CPU and latency are high” means notifications only fire when there’s a real problem. It frequently appears as the answer to the scenario “there are too many notifications and I want to reduce them.”
Dashboards #
A dashboard is a composition that gathers multiple metric graphs and alarm states onto a single screen.
- You can gather cross-region and cross-account widgets on one dashboard
- Defined in JSON, so it can be reproduced as code via CloudFormation
- Auto-refresh makes it usable as an operations status board (NOC)
On the exam, dashboards themselves carry little weight, but they show up as the answer to a requirement like “view metrics from multiple accounts and regions in one place.” Cross-account observation usually also configures CloudWatch cross-account observability.
Metric Math and anomaly detection #
- Metric Math: combines multiple metrics with a formula to create a new metric. For example, you compute “success request rate = success count / total count” and set an alarm on that result.
- Anomaly Detection: learns past patterns to build a normal-range band, and fires an alarm when the metric leaves that band. You use it instead of a fixed threshold on workloads whose traffic fluctuates by time of day.
The answer to the scenario “the difference between daytime and nighttime traffic is large, so a fixed threshold makes alarms inaccurate” is anomaly detection.
Exam question patterns #
- EC2 memory/disk metrics aren’t visible → install the CloudWatch Agent (custom metrics)
- Alarm misfires on a transient spike → raise
Datapoints to Alarmfor anM of Nconfiguration - An intermittent metric but the alarm won’t fire → adjust the missing-data (
treat missing data) option - Too many notifications → combine conditions with AND using a composite alarm
- Time-of-day variation is large and you can’t fix a threshold → anomaly detection
- You want to automatically revive on system status check failure →
StatusCheckFailed_Systemalarm + EC2 recover action
Common Pitfalls #
1) Thinking memory/disk are in standard metrics #
EC2 standard metrics include CPU, network, and disk I/O, but not memory utilization or filesystem utilization inside the OS. The Agent is required.
2) Confusing Period and Evaluation Periods #
Period is “the time over which one data point is aggregated”; Evaluation Periods is “how many to look at.” The two multiplied together is the length of the actual judgment window.
3) Misunderstanding that a composite alarm evaluates metrics directly #
A composite alarm takes the states of other alarms as input, not metrics. The order is to first create single alarms and then combine them.
4) Ignoring the cost of high-resolution metrics #
A 1-second-resolution custom metric costs more than the standard 60 seconds. An answer like “everything at 1 second” is usually wrong in a cost-constrained scenario.
Summary #
What we covered in this post:
- A metric is a time series organized by Namespace, Metric, Dimension, and Resolution. Narrow the target down by dimension
- EC2 memory and disk are not standard metrics. Publish them as custom metrics via the CloudWatch Agent
- Alarms judge with
M of Nvia Period, Evaluation Periods, and Datapoints to Alarm. The missing-data handling option is separate - Alarm actions are SNS, Auto Scaling, EC2 recover, and Systems Manager. EC2 automated recovery is a staple for system status check failures
- Reduce alarm noise by combining conditions with a composite alarm. Use anomaly detection for dynamic thresholds on highly variable workloads
- Dashboards can be reproduced as code via their JSON definition, with integrated cross-account and cross-region observation
Next: Domain 1-2 CloudWatch Logs and Logs Insights #
Now that metrics have captured “what happened,” next come logs, the detailed records of those events.
In #3 Domain 1-2 Monitoring — CloudWatch Logs, Logs Insights, and the Agent, I’ll cover the structure of log groups and log streams, how to collect logs with the CloudWatch Agent, log-based metric filters, how to analyze large volumes of logs with Logs Insights queries, and how to set up subscription filters to deliver logs in real time.