AWS Certified Solutions Architect - Associate (SAA-C03) #7 Domain 2-2 Resilient Architectures — DR Patterns

5 min read

In #6 we nailed down high availability within a single Region. This time we deal with disaster recovery (DR) strategy for a bigger failure — a disaster that paralyzes an entire Region. A DR question is always a tug-of-war between “how fast must you recover (RTO), and how much data loss can you tolerate (RPO)” and “how much will you spend.”

RTO and RPO #

If you confuse these two metrics, you will get almost every DR question wrong.

MetricMeaningQuestion
RTO (Recovery Time Objective)The time it takes to recover“Within how many minutes/hours after a failure must recovery happen?”
RPO (Recovery Point Objective)The amount of data loss that can be tolerated“How much data can be lost since the last backup?”

RTO is the time axis, RPO is the data (point in time) axis. If RPO is 5 minutes, you can lose at most 5 minutes of data, so you must replicate/back up that frequently. The shorter the RTO, the more resources you must prepare in advance. Making both small drives the cost up.

The four DR strategies #

AWS’s standard DR strategies fall into four tiers based on the trade-off between cost and recovery speed. As you go from top to bottom, the cost rises and RTO/RPO shorten.

StrategySteady stateRTO/RPOCost
Backup & RestoreData backed up onlyLongest (hours)Cheapest
Pilot LightOnly the core (DB) running minimallyShort (tens of minutes)Low
Warm StandbyA scaled-down full environment always runningShorter (minutes)Medium
Multi-Site Active/ActiveBoth sides running at full scaleNear 0Most expensive

1) Backup & Restore #

You keep the data backed up, and on disaster you spin up fresh infrastructure from the backup to recover. In steady state you pay only backup storage costs. It is the cheapest but takes the longest to recover. If the clue is “minimize cost, and a long recovery time is acceptable,” this is the strategy.

2) Pilot Light #

Like the pilot flame that ignites an engine, you always keep only the core element (usually the database) replicated and on while leaving the rest (application servers, etc.) off. When disaster strikes, you quickly start up the parts that were off. Because the DB is already up to date, RPO and RTO are shorter than with Backup & Restore.

3) Warm Standby #

A scaled-down version of the entire environment is always running. When disaster strikes, you only need to scale up that environment to production size. If Pilot Light is “keep only the core on and the rest off,” Warm Standby is “keep the whole thing on, even if small.” It is faster and more expensive to that degree.

4) Multi-Site Active/Active (Hot Standby) #

Production scale runs simultaneously in both Regions and they share the actual traffic. Even if one Region dies, the other immediately takes all of it, so RTO/RPO is near 0. It is the most expensive but close to zero downtime.

Cross-Region implementation options #

These are the AWS features that actually build a DR strategy.

  • Route 53 Failover routing — detects a primary Region failure via health checks and switches DNS to the secondary Region. The standard for automatic DR switchover.
  • RDS cross-Region read replica — keep a read replica in another Region and promote it when disaster strikes.
  • Aurora Global Database — replicates to multiple Regions with sub-second lag. Fast promotion on a Region failure.
  • DynamoDB global tables — multi-Region active/active replication. Suited to the Multi-Site pattern.
  • S3 Cross-Region Replication (CRR) — automatically replicates objects to a bucket in another Region.

If the requirement is “automatically switch to another Region on a Region failure,” the combination of Route 53 Failover + cross-Region replication is the answer.

Exam question patterns #

  • Minimize cost, a long recovery time is acceptable.” → Backup & Restore
  • The DB is always replicated, app servers start up on disaster.” → Pilot Light
  • A scaled-down full environment is up and scales out on disaster.” → Warm Standby
  • Near zero downtime, cost is no object, RTO/RPO≈0.” → Multi-Site Active/Active
  • Automatic DNS switchover on a Region failure.” → Route 53 Failover routing
  • “Multi-Region active/active DB.” → DynamoDB global tables / Aurora Global
  • “Of RTO and RPO, which is the amount of data loss?” → RPO

Common pitfalls #

1) Swapping RTO and RPO #

RTO is time, RPO is data loss (point in time). “How many minutes of data at most can be lost” is RPO.

2) Confusing Pilot Light and Warm Standby #

Pilot Light keeps only the core (DB) on and the rest off. Warm Standby keeps the whole thing at a scaled-down size always on.

3) Recommending Multi-Site for every system #

Multi-Site is the most expensive. If the requirement is “minimize cost” or “recovery time has some slack,” it is over-engineered. You must pick the minimum-cost strategy that meets the RTO/RPO requirement.

4) Expecting automatic switchover without Route 53 #

Cross-Region automatic failover is usually handled by Route 53 health checks + Failover routing.

Summary #

What we nailed down in this post:

  • RTO = recovery time, RPO = amount of data loss. Making both small raises the cost
  • The four strategies — Backup & Restore (cheap , slow) → Pilot Light (core only) → Warm Standby (scaled-down full) → Multi-Site (full scale , zero downtime)
  • Choosing the minimum-cost strategy that matches the required RTO/RPO is the heart of the correct answer
  • Cross-Region — Route 53 Failover + RDS cross-Region replication / Aurora Global / DynamoDB global tables / S3 CRR

Next — Domain 2-3 Backup Strategy #

The foundation of any DR strategy is ultimately a backup you can trust. As the last topic of the resilience domain, we cover backup.

In #8 Domain 2-3 Backup Strategy we will organize EBS snapshots (incremental , cross-Region copy) and Data Lifecycle Manager, RDS automated backups and point-in-time recovery (PITR), and AWS Backup for centrally managing backups across multiple services, plus immutable backups (Vault Lock).

X