AWS Certified CloudOps Engineer - Associate (SOA-C03) #5 Domain 2-1 Reliability: Multi-AZ, Auto Scaling, and ELB Health Checks

5 min read

We finished the monitoring domain through #4. Starting with this post, we move on to the second domain, Reliability and Business Continuity (22%). The first axis of reliability is availability operations that keep “the service running even when failures occur.” Multi-AZ, Auto Scaling, and ELB, which you studied as design concepts in SAA, are revisited here from the angle of how they behave in operation and what to fix when something goes wrong.

The basics of availability: Availability Zone redundancy #

A Region is made up of multiple physically separated Availability Zones (AZs). For the service to survive even if an entire AZ fails, you must place resources across multiple AZs.

ConfigurationSingle-AZMulti-AZ
Impact of AZ failureService outageContinues on another AZ
Typical useDev/temporaryProduction workloads

The key point is that redundancy almost always starts at the AZ level. Multi-Region, which guards against an entire Region failure, is costly, and the default answer on the associate exam is usually Multi-AZ.

Auto Scaling group (ASG) operations #

An ASG is a mechanism that maintains a set capacity and adjusts the instance count according to load. Three capacity values are central to operations.

ValueMeaning
MinimumThe minimum instance count always maintained
DesiredThe current target count. Policies adjust this value
MaximumThe upper limit you can scale up to

Scaling policies #

PolicyBehaviorWhen
Target TrackingAutomatically adjusts to keep a target metric value (e.g., CPU 50%)Most recommended. Simple and stable
Step ScalingAdjusts by set amounts per threshold bandWhen fine-grained control is needed
ScheduledChanges capacity at set timesWhen traffic patterns are predictable

A predictable pattern like “traffic surges every morning” calls for Scheduled Scaling, while “adjust automatically based on load” calls for Target Tracking.

Lifecycle hooks #

A mechanism that inserts a specific task when an instance enters or leaves the ASG.

  • Launch hook: ensures configuration and registration complete before the instance enters service
  • Terminate hook: handles wrap-up such as log collection and connection cleanup before the instance is terminated

The answer to a requirement like “don’t lose the logs of a terminating instance” is a terminate lifecycle hook. It holds the instance in a wait state, finishes the wrap-up work, and then releases it.

ELB: types and health checks #

ELB distributes traffic only to healthy targets. Distinguishing the types is an exam point.

TypeLayerTypical use
ALBL7 (HTTP/HTTPS)Path/host-based routing, web applications
NLBL4 (TCP/UDP)Ultra-high performance, static IP, low latency
GWLBL3Inline security appliances such as firewalls

Health checks #

ELB sends periodic health check requests to the instances in a target group to determine whether they are healthy.

  • Configure the healthy threshold, unhealthy threshold, interval, timeout, and path
  • No traffic is sent to targets judged unhealthy
  • ALB judges by HTTP path (e.g., /health) and status code; NLB judges by TCP connection

A scenario like “some instances aren’t responding but ELB keeps sending them traffic” is usually a health check path or threshold configuration problem.

Connection draining (Deregistration Delay) #

When an instance is removed from a target group or terminated, this is the time it waits to finish processing in-flight requests. If this value is 0, user requests are cut off during deployments and scale-downs. The answer to “some requests fail on every deployment” is the connection draining (deregistration delay) setting.

Route 53 failover #

Endpoint-level availability beyond AZs is configured with Route 53.

Routing policyBehavior
FailoverIf the primary is unhealthy, switch to the secondary
WeightedDistribute by weight ratio (gradual rollout, A/B)
LatencyRoute to the Region with the lowest latency
MultivalueReturn multiple healthy endpoints (simple distribution)

The premise of failover is a Route 53 health check. It periodically checks the endpoint and, if unhealthy, removes it from the DNS response and hands over to the secondary. The answer to “automatically switch to another Region during a Region failure” is Route 53 Failover routing + health checks. It’s worth remembering, too, that the switch is not instantaneous because of the DNS cache (TTL).

Exam question patterns #

  • Keep running even through an AZ failure → redundancy across multiple AZs
  • Predictable traffic surge → Scheduled Scaling
  • Adjust automatically with load → Target Tracking
  • Preserve the logs of a terminating instance → terminate lifecycle hook
  • Traffic keeps going to an unhealthy instance → check health check settings
  • Requests fail during deployment → connection draining (deregistration delay)
  • Automatic switch during a Region failure → Route 53 Failover + health checks

Common Pitfalls #

1) Thinking changing only Desired is permanent #

Scaling policies readjust Desired. To pin it, set Min and Max to the same value or adjust the policy.

2) Leaving the ASG health check as EC2 only #

A repeat of the pitfall seen in #4. To catch application failures too, you must enable the ELB health check.

3) Expecting HTTP path health checks on NLB #

NLB is L4. To filter by application response code, ALB is the right choice.

4) Assuming Route 53 failover is instantaneous #

The switch takes time because of DNS TTL and health check interval. Lower the TTL to make the switch faster.

Summary #

What we covered in this post:

  • Redundancy starts at the AZ level. The default availability answer at the associate level is Multi-AZ
  • An ASG maintains and adjusts capacity with Min, Desired, and Max. Policies are Target Tracking (recommended), Step, and Scheduled
  • Lifecycle hooks insert tasks on launch and terminate. Use the terminate hook to preserve logs
  • ELB comes as ALB (L7), NLB (L4), and GWLB. Health checks distribute only to healthy targets
  • Connection draining protects in-flight requests during deployments and scale-downs
  • Route 53 Failover + health checks handle endpoint- and Region-level switching. Tune switch speed with DNS TTL

Next: Domain 2-2 Backup, Restore, and Disaster Recovery #

Now that availability has taken care of “keeping it running,” next comes “recovering without losing data.”

In #6 Domain 2-2 Reliability: Backup, Restore, and Disaster Recovery (DR), I’ll cover EBS snapshots and AMIs, RDS backups and snapshots, how to centrally manage backups with AWS Backup, the meaning of RPO and RTO, and the DR strategies that progress through backup, pilot light, warm standby, and multi-site.

X