AWS Certified CloudOps Engineer - Associate (SOA-C03) #5 Domain 2-1 Reliability: Multi-AZ, Auto Scaling, and ELB Health Checks

Saturday, May 30, 2026

5 min read

We finished the monitoring domain through #4. Starting with this post, we move on to the second domain, Reliability and Business Continuity (22%). The first axis of reliability is availability operations that keep “the service running even when failures occur.” Multi-AZ, Auto Scaling, and ELB, which you studied as design concepts in SAA, are revisited here from the angle of how they behave in operation and what to fix when something goes wrong.

The basics of availability: Availability Zone redundancy #

A Region is made up of multiple physically separated Availability Zones (AZs). For the service to survive even if an entire AZ fails, you must place resources across multiple AZs.

Configuration	Single-AZ	Multi-AZ
Impact of AZ failure	Service outage	Continues on another AZ
Typical use	Dev/temporary	Production workloads

The key point is that redundancy almost always starts at the AZ level. Multi-Region, which guards against an entire Region failure, is costly, and the default answer on the associate exam is usually Multi-AZ.

Auto Scaling group (ASG) operations #

An ASG is a mechanism that maintains a set capacity and adjusts the instance count according to load. Three capacity values are central to operations.

Value	Meaning
Minimum	The minimum instance count always maintained
Desired	The current target count. Policies adjust this value
Maximum	The upper limit you can scale up to

Scaling policies #

Policy	Behavior	When
Target Tracking	Automatically adjusts to keep a target metric value (e.g., CPU 50%)	Most recommended. Simple and stable
Step Scaling	Adjusts by set amounts per threshold band	When fine-grained control is needed
Scheduled	Changes capacity at set times	When traffic patterns are predictable

A predictable pattern like “traffic surges every morning” calls for Scheduled Scaling, while “adjust automatically based on load” calls for Target Tracking.

Lifecycle hooks #

A mechanism that inserts a specific task when an instance enters or leaves the ASG.

Launch hook: ensures configuration and registration complete before the instance enters service
Terminate hook: handles wrap-up such as log collection and connection cleanup before the instance is terminated

The answer to a requirement like “don’t lose the logs of a terminating instance” is a terminate lifecycle hook. It holds the instance in a wait state, finishes the wrap-up work, and then releases it.

ELB: types and health checks #

ELB distributes traffic only to healthy targets. Distinguishing the types is an exam point.

Type	Layer	Typical use
ALB	L7 (HTTP/HTTPS)	Path/host-based routing, web applications
NLB	L4 (TCP/UDP)	Ultra-high performance, static IP, low latency
GWLB	L3	Inline security appliances such as firewalls

Health checks #

ELB sends periodic health check requests to the instances in a target group to determine whether they are healthy.

Configure the healthy threshold, unhealthy threshold, interval, timeout, and path
No traffic is sent to targets judged unhealthy
ALB judges by HTTP path (e.g., /health) and status code; NLB judges by TCP connection

A scenario like “some instances aren’t responding but ELB keeps sending them traffic” is usually a health check path or threshold configuration problem.

Connection draining (Deregistration Delay) #

When an instance is removed from a target group or terminated, this is the time it waits to finish processing in-flight requests. If this value is 0, user requests are cut off during deployments and scale-downs. The answer to “some requests fail on every deployment” is the connection draining (deregistration delay) setting.

Route 53 failover #

Endpoint-level availability beyond AZs is configured with Route 53.

Routing policy	Behavior
Failover	If the primary is unhealthy, switch to the secondary
Weighted	Distribute by weight ratio (gradual rollout, A/B)
Latency	Route to the Region with the lowest latency
Multivalue	Return multiple healthy endpoints (simple distribution)

The premise of failover is a Route 53 health check. It periodically checks the endpoint and, if unhealthy, removes it from the DNS response and hands over to the secondary. The answer to “automatically switch to another Region during a Region failure” is Route 53 Failover routing + health checks. It’s worth remembering, too, that the switch is not instantaneous because of the DNS cache (TTL).

Exam question patterns #

Keep running even through an AZ failure → redundancy across multiple AZs
Predictable traffic surge → Scheduled Scaling
Adjust automatically with load → Target Tracking
Preserve the logs of a terminating instance → terminate lifecycle hook
Traffic keeps going to an unhealthy instance → check health check settings
Requests fail during deployment → connection draining (deregistration delay)
Automatic switch during a Region failure → Route 53 Failover + health checks

Common Pitfalls #

1) Thinking changing only Desired is permanent #

Scaling policies readjust Desired. To pin it, set Min and Max to the same value or adjust the policy.

2) Leaving the ASG health check as EC2 only #

A repeat of the pitfall seen in #4. To catch application failures too, you must enable the ELB health check.

3) Expecting HTTP path health checks on NLB #

NLB is L4. To filter by application response code, ALB is the right choice.

4) Assuming Route 53 failover is instantaneous #

The switch takes time because of DNS TTL and health check interval. Lower the TTL to make the switch faster.

Summary #

What we covered in this post:

Redundancy starts at the AZ level. The default availability answer at the associate level is Multi-AZ
An ASG maintains and adjusts capacity with Min, Desired, and Max. Policies are Target Tracking (recommended), Step, and Scheduled
Lifecycle hooks insert tasks on launch and terminate. Use the terminate hook to preserve logs
ELB comes as ALB (L7), NLB (L4), and GWLB. Health checks distribute only to healthy targets
Connection draining protects in-flight requests during deployments and scale-downs
Route 53 Failover + health checks handle endpoint- and Region-level switching. Tune switch speed with DNS TTL

Next: Domain 2-2 Backup, Restore, and Disaster Recovery #

Now that availability has taken care of “keeping it running,” next comes “recovering without losing data.”

In #6 Domain 2-2 Reliability: Backup, Restore, and Disaster Recovery (DR), I’ll cover EBS snapshots and AMIs, RDS backups and snapshots, how to centrally manage backups with AWS Backup, the meaning of RPO and RTO, and the DR strategies that progress through backup, pilot light, warm standby, and multi-site.