AWS Certified CloudOps Engineer - Associate (SOA-C03) #5 Domain 2-1 Reliability: Multi-AZ, Auto Scaling, and ELB Health Checks
We finished the monitoring domain through #4. Starting with this post, we move on to the second domain, Reliability and Business Continuity (22%). The first axis of reliability is availability operations that keep “the service running even when failures occur.” Multi-AZ, Auto Scaling, and ELB, which you studied as design concepts in SAA, are revisited here from the angle of how they behave in operation and what to fix when something goes wrong.
The basics of availability: Availability Zone redundancy #
A Region is made up of multiple physically separated Availability Zones (AZs). For the service to survive even if an entire AZ fails, you must place resources across multiple AZs.
| Configuration | Single-AZ | Multi-AZ |
|---|---|---|
| Impact of AZ failure | Service outage | Continues on another AZ |
| Typical use | Dev/temporary | Production workloads |
The key point is that redundancy almost always starts at the AZ level. Multi-Region, which guards against an entire Region failure, is costly, and the default answer on the associate exam is usually Multi-AZ.
Auto Scaling group (ASG) operations #
An ASG is a mechanism that maintains a set capacity and adjusts the instance count according to load. Three capacity values are central to operations.
| Value | Meaning |
|---|---|
| Minimum | The minimum instance count always maintained |
| Desired | The current target count. Policies adjust this value |
| Maximum | The upper limit you can scale up to |
Scaling policies #
| Policy | Behavior | When |
|---|---|---|
| Target Tracking | Automatically adjusts to keep a target metric value (e.g., CPU 50%) | Most recommended. Simple and stable |
| Step Scaling | Adjusts by set amounts per threshold band | When fine-grained control is needed |
| Scheduled | Changes capacity at set times | When traffic patterns are predictable |
A predictable pattern like “traffic surges every morning” calls for Scheduled Scaling, while “adjust automatically based on load” calls for Target Tracking.
Lifecycle hooks #
A mechanism that inserts a specific task when an instance enters or leaves the ASG.
- Launch hook: ensures configuration and registration complete before the instance enters service
- Terminate hook: handles wrap-up such as log collection and connection cleanup before the instance is terminated
The answer to a requirement like “don’t lose the logs of a terminating instance” is a terminate lifecycle hook. It holds the instance in a wait state, finishes the wrap-up work, and then releases it.
ELB: types and health checks #
ELB distributes traffic only to healthy targets. Distinguishing the types is an exam point.
| Type | Layer | Typical use |
|---|---|---|
| ALB | L7 (HTTP/HTTPS) | Path/host-based routing, web applications |
| NLB | L4 (TCP/UDP) | Ultra-high performance, static IP, low latency |
| GWLB | L3 | Inline security appliances such as firewalls |
Health checks #
ELB sends periodic health check requests to the instances in a target group to determine whether they are healthy.
- Configure the healthy threshold, unhealthy threshold, interval, timeout, and path
- No traffic is sent to targets judged unhealthy
- ALB judges by HTTP path (e.g.,
/health) and status code; NLB judges by TCP connection
A scenario like “some instances aren’t responding but ELB keeps sending them traffic” is usually a health check path or threshold configuration problem.
Connection draining (Deregistration Delay) #
When an instance is removed from a target group or terminated, this is the time it waits to finish processing in-flight requests. If this value is 0, user requests are cut off during deployments and scale-downs. The answer to “some requests fail on every deployment” is the connection draining (deregistration delay) setting.
Route 53 failover #
Endpoint-level availability beyond AZs is configured with Route 53.
| Routing policy | Behavior |
|---|---|
| Failover | If the primary is unhealthy, switch to the secondary |
| Weighted | Distribute by weight ratio (gradual rollout, A/B) |
| Latency | Route to the Region with the lowest latency |
| Multivalue | Return multiple healthy endpoints (simple distribution) |
The premise of failover is a Route 53 health check. It periodically checks the endpoint and, if unhealthy, removes it from the DNS response and hands over to the secondary. The answer to “automatically switch to another Region during a Region failure” is Route 53 Failover routing + health checks. It’s worth remembering, too, that the switch is not instantaneous because of the DNS cache (TTL).
Exam question patterns #
- Keep running even through an AZ failure → redundancy across multiple AZs
- Predictable traffic surge → Scheduled Scaling
- Adjust automatically with load → Target Tracking
- Preserve the logs of a terminating instance → terminate lifecycle hook
- Traffic keeps going to an unhealthy instance → check health check settings
- Requests fail during deployment → connection draining (deregistration delay)
- Automatic switch during a Region failure → Route 53 Failover + health checks
Common Pitfalls #
1) Thinking changing only Desired is permanent #
Scaling policies readjust Desired. To pin it, set Min and Max to the same value or adjust the policy.
2) Leaving the ASG health check as EC2 only #
A repeat of the pitfall seen in #4. To catch application failures too, you must enable the ELB health check.
3) Expecting HTTP path health checks on NLB #
NLB is L4. To filter by application response code, ALB is the right choice.
4) Assuming Route 53 failover is instantaneous #
The switch takes time because of DNS TTL and health check interval. Lower the TTL to make the switch faster.
Summary #
What we covered in this post:
- Redundancy starts at the AZ level. The default availability answer at the associate level is Multi-AZ
- An ASG maintains and adjusts capacity with Min, Desired, and Max. Policies are Target Tracking (recommended), Step, and Scheduled
- Lifecycle hooks insert tasks on launch and terminate. Use the terminate hook to preserve logs
- ELB comes as ALB (L7), NLB (L4), and GWLB. Health checks distribute only to healthy targets
- Connection draining protects in-flight requests during deployments and scale-downs
- Route 53 Failover + health checks handle endpoint- and Region-level switching. Tune switch speed with DNS TTL
Next: Domain 2-2 Backup, Restore, and Disaster Recovery #
Now that availability has taken care of “keeping it running,” next comes “recovering without losing data.”
In #6 Domain 2-2 Reliability: Backup, Restore, and Disaster Recovery (DR), I’ll cover EBS snapshots and AMIs, RDS backups and snapshots, how to centrally manage backups with AWS Backup, the meaning of RPO and RTO, and the DR strategies that progress through backup, pilot light, warm standby, and multi-site.