AWS Certified CloudOps Engineer - Associate (SOA-C03) #4 Domain 1-3 Monitoring — Automated Recovery and Performance Optimization

6 min read

We organized detection with the metrics in #2 and the logs in #3. The next step in operations is acting automatically after detection, without human hands. This is exactly why SOA-C03 so often includes the condition “without manual intervention.” This post organizes the building blocks of automated recovery, along with performance optimization, the last axis of the monitoring domain.

EventBridge — the hub that reacts to events #

EventBridge is a router that takes events happening in AWS through rules and sends them to targets. It’s the starting point of automated recovery.

ComponentDescription
Event sourceWhere events come from. AWS services, custom, SaaS partners
RuleFilters by event pattern, or triggers on a schedule (cron)
TargetWhat it’s sent to. Lambda, SSM Automation, SNS, Step Functions, etc.

Two triggering methods are key.

  • Event patternreacts to state changes, such as “when an EC2 instance changes to the stopped state” or “when a Health event occurs”
  • Scheduleruns periodically with a cron/rate expression. It directly inherits the role of the old CloudWatch Events

An EventBridge rule is the starting point of the scenario “when an instance reaches a certain state, automatically do something.”

Systems Manager Automation — recovery runbooks #

What handles the actual action after detection (EventBridge) is Systems Manager Automation. You define and run a step-by-step procedure called a runbook (Automation document).

  • AWS-provided runbooks: AWS-RestartEC2Instance, AWS-StopEC2Instance, and so on, ready to use immediately
  • Custom runbooks: bundle multiple steps (snapshot → recover → verify) into one definition
  • Permissions: runs with the IAM Role the runbook uses

A representative automated recovery flow looks like this.

CloudWatch Alarm (or EventBridge rule)
  → run SSM Automation runbook
  → actions such as restart, replace, or tag the instance
  → notify the result via SNS

A requirement like “when the disk fills up, automatically run a cleanup script” is implemented with EventBridge/alarm → SSM Automation runbook.

EC2 automated recovery and Auto Scaling self-healing #

Automated recovery also has simpler built-in mechanisms.

MechanismBehaviorWhen
EC2 automated recoveryRecovers the same instance on new hardware. Keeps the ID and IPSystem status check (hardware) failure
Auto Scaling health checkTerminates the unhealthy instance and replaces it with a new oneEC2 or ELB health check failure

The difference between the two is the exam point.

  • Automated recovery keeps the same instance alive. Use it for hardware failure of a single instance.
  • Auto Scaling discards the unhealthy instance and replaces it with a new one. Use it for self-healing of stateless workloads.

If state is held on the instance and replacement is difficult, automated recovery is the answer; if there’s no state and the instance can be replaced at any time, Auto Scaling is. Another regular point: you must set the Auto Scaling health check to the ELB health check to catch application-level failures too.

Performance optimization — the order of diagnosis #

In SOA-C03, performance isn’t a separate domain but is folded into the monitoring domain. The core is the flow of pinpointing the bottleneck with metrics and then choosing the right action.

ResourceBottleneck signal (metric)Representative action
ComputeSustained high CPUUtilizationScale up the instance type, expand the ASG, larger family
Memory(Agent) high memory utilizationMemory-optimized family (R series)
EBSHigh VolumeQueueLength, IOPS limitRaise gp3 IOPS, switch to io2
RDSHigh ReadIOPS, low FreeableMemoryRead Replica, scale up the instance, caching
NetworkBandwidth limitEnhanced networking, larger instance

EBS performance points #

EBS is an exam regular. gp3 can raise IOPS and throughput separately, independent of capacity, so it’s the first answer when gp2 falls short on performance. If you need very high, consistent IOPS, it’s io2/io2 Block Express. The answer to “the disk is slow but capacity is plenty” is usually to switch to gp3 and raise IOPS.

Compute Optimizer and the balance of cost and performance #

Compute Optimizer analyzes past metrics and recommends whether an instance is over- or under-provisioned. It targets EC2, ASG, EBS, and Lambda.

  • Over-provisioned — move a low-utilization instance to a smaller type → cost savings
  • Under-provisioned — move an instance at its utilization limit to a larger type → performance improvement

Performance and cost are two sides of the same diagnosis. As the answer to “reduce cost but keep performance,” rightsizing based on Compute Optimizer recommendations comes up often. It’s also worth remembering that to get memory recommendations, you need CloudWatch Agent memory metrics.

Exam question patterns #

  • Automatically react to a state change → EventBridge rule
  • Automate multi-step recovery after detection → SSM Automation runbook
  • Automatically recover from single-instance hardware failure → EC2 automated recovery (same instance)
  • Self-heal a stateless workload → Auto Scaling + ELB health check (replacement)
  • The disk is slow but capacity is plenty → switch to gp3 and raise IOPS
  • Reduce cost but keep performance → Compute Optimizer rightsizing

Common Pitfalls #

1) Treating automated recovery and Auto Scaling as the same #

Automated recovery keeps the same instance alive; Auto Scaling discards and replaces it. The split is whether state is held.

2) Leaving the Auto Scaling health check on EC2 only #

The default EC2 health check only checks whether the instance is alive. To catch application failures, you must turn on the ELB health check.

3) Always scaling up the instance for a performance problem #

If the bottleneck is EBS IOPS, scaling up the instance type misses the mark. You must pinpoint the bottleneck resource with metrics first.

4) Trying to run Lambda periodically without EventBridge #

Lambda itself has no scheduler. Periodic execution is triggered by an EventBridge schedule rule.

Summary #

What we covered in this post:

  • EventBridge is the start of automated recovery. Two triggers: event pattern (state change) and schedule (cron)
  • Automate multi-step action after detection with SSM Automation runbooks. Alarms and EventBridge are the triggers
  • EC2 automated recovery keeps the same instance alive (hardware failure); Auto Scaling discards and replaces (stateless)
  • Set the Auto Scaling health check to the ELB health check to self-heal even application failures
  • For performance, pinpoint the bottleneck resource with metrics, then act. For EBS, raising gp3 IOPS is the regular
  • Diagnose over- and under-provisioning with Compute Optimizer. Rightsize cost and performance together

Next: Domain 2-1 Multi-AZ and Auto Scaling #

We’ve finished the monitoring domain. Next is the second domain, reliability and business continuity.

In #5 Domain 2-1 Reliability — Multi-AZ, Auto Scaling, ELB Health Checks, I’ll tie together redundant configurations spanning Availability Zones, the capacity, policies, and lifecycle of Auto Scaling groups, ELB health checks and connection draining by type, and Route 53 failover.

X