AWS Certified CloudOps Engineer - Associate (SOA-C03) #3 Domain 1-2 Monitoring — CloudWatch Logs, Logs Insights, Agent

6 min read

In #2 we used metrics to organize “what happened.” Metrics are numbers, so you can spot trends quickly, but they don’t tell you why it happened. That answer lives in the logs. This post organizes the structure of CloudWatch Logs and the operational flow of collecting, analyzing, and delivering logs.

Log structure: log groups and log streams #

CloudWatch Logs organizes logs in two levels.

ConceptDescriptionExample
Log GroupA bundle of logs of the same kind. The unit of retention, permissions, and encryption/aws/lambda/my-func, /var/log/nginx
Log StreamThe log flow of a single source (instance, container, function invocation)A stream per instance ID

The unit of configuration is the log group. You set retention, KMS encryption, and access permissions on the log group. Log streams are the actual lines within it, split by source.

Retention and cost #

The default retention of a log group is never expire. In other words, if you don’t configure it, logs pile up forever and storage cost keeps rising. In operations, setting a retention period (e.g., 30 days, 90 days) on each log group is the baseline. If you need long-term retention, the standard approach is to export to S3 and place it in a cheaper storage class.

For the “log cost keeps increasing” scenario, the first answer is setting retention, and long-term retention is S3 export + lifecycle.

CloudWatch Agent: collecting logs and OS metrics #

Log files and OS metrics from EC2 or on-premises servers don’t flow into CloudWatch by default. You have to install the CloudWatch Agent.

  • Log collection — sends files like /var/log/... to a designated log group
  • OS metric collection — publishes the memory and disk utilization we saw in #2 as custom metrics
  • Configuration — an agent config file (JSON). The recommended pattern is to store the config in Systems Manager Parameter Store and deploy it consistently to many instances
  • Permissions — the instance needs an IAM Role with CloudWatchAgentServerPolicy attached

When the exam says “deploy the same agent configuration consistently to multiple EC2 instances,” the answer is usually store the config in SSM Parameter Store + deploy the agent with SSM.

Metric Filter: extracting metrics from logs #

Logs are text, but you can convert the number of occurrences of a specific pattern into a metric. That’s a metric filter.

For example, you count the ERROR string in an application log to create an ErrorCount metric, and you set an alarm on that metric.

Filter pattern: "ERROR"
→ Metric: MyApp/ErrorCount
→ Alarm: notify via SNS when ERROR exceeds 10 in 5 minutes

The key point is that you can’t set an alarm on the log itself. The order is: turn the log into a metric, then set an alarm on that metric. The standard implementation of “notify when a specific message appears in the log” is a metric filter + alarm.

Logs Insights: querying logs #

Logs Insights is a tool that analyzes large volumes of logs with a SQL-like query.

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) by bin(5m)
| sort @timestamp desc

Where a metric filter “continuously aggregates a predefined pattern,” Logs Insights is a tool for digging in ad hoc after the fact. The answer to the troubleshooting scenario “analyze the logs from the time window of the incident to find the cause” is Logs Insights. You can query multiple log groups at once.

Subscription Filter: real-time log delivery #

A configuration that streams logs to another destination the moment they arrive. The destination can be one of the following.

DestinationUse
LambdaProcess, alert on, or deliver logs in real time
Kinesis Data StreamsProcess high-volume logs as a stream
Kinesis Data FirehoseLoad into S3, OpenSearch, etc.
OpenSearchLog search and visualization (Kibana)

The answer to “send logs to an analytics system in real time” or “gather logs from multiple accounts into one place” is a subscription filter. Cross-account log collection is also built with subscription filters.

Log security: encryption and access #

  • Encryption — attach a KMS key to the log group to encrypt at rest. Meets compliance requirements for sensitive logs
  • Access control — control read/write permissions on the log group with IAM
  • Relationship with CloudTrail — you can also send CloudTrail logs (the record of API calls) to CloudWatch Logs and set metric filters and alarms. For example, “notify on root account login” is implemented as CloudTrail → CloudWatch Logs → metric filter → alarm

Exam Question Patterns #

  • Log cost keeps increasing → set log group retention, long-term retention via S3 export
  • Deploy agent config consistently to multiple EC2 → SSM Parameter Store + SSM deployment
  • Notify when a specific message appears in the log → metric filter + alarm
  • Analyze logs from an incident time window after the fact → Logs Insights
  • Deliver logs to another system in real time → subscription filter (Lambda, Kinesis, OpenSearch)
  • Watch API events like root login or permission changes → CloudTrail → CloudWatch Logs → metric filter

Common Pitfalls #

1) Thinking you can set an alarm directly on a log #

CloudWatch alarms apply only to metrics. A log becomes an alarm target only after a metric filter turns it into a metric.

2) Assuming logs are collected without an agent #

Managed services like Lambda and ECS send logs automatically, but file logs from EC2 and on-premises require the CloudWatch Agent to flow in.

3) Assuming the default retention is short #

The default retention isn’t short — it’s never expire. You have to reduce it explicitly to manage cost.

4) Trying to use Logs Insights for continuous aggregation #

Logs Insights is an ad hoc query tool. For continuous monitoring and alarms, the right approach is to create a metric with a metric filter and set the alarm on it.

Summary #

What we covered in this post:

  • Logs consist of log groups (the configuration unit) and log streams (per source). Retention, encryption, and permissions go on the log group
  • Default retention is never expire → set retention to manage cost, long-term retention via S3 export
  • Use the CloudWatch Agent to collect EC2 file logs and memory/disk metrics. Deploy the config consistently via SSM Parameter Store
  • Use a metric filter to turn logs into metrics and alarm on them. You can’t alarm on the log itself
  • Logs Insights is for after-the-fact ad hoc queries (troubleshooting); a subscription filter is for real-time delivery (Lambda, Kinesis, OpenSearch)
  • Send CloudTrail logs to CloudWatch Logs to build API event alarms

Next: Domain 1-3 Auto Recovery and Performance Optimization #

Now that we’ve covered detection with metrics and logs, the next step is automated response after detection. In #4 Domain 1-3 Monitoring: Auto Recovery and Performance Optimization, I’ll cover how to react to events with EventBridge, how to automate recovery with Systems Manager Automation, EC2 auto recovery and Auto Scaling’s self-healing, and the flow of diagnosing and optimizing performance with Compute Optimizer and CloudWatch.

X