Contents
30 Chapter

Disaster Recovery & Backup — Backups · Cross-region DR · RTO/RPO

Designing how to bring data and services back when one AZ or one region collapses. Set RTO/RPO first, then learn to put backups in place with Terraform via RDS PITR · S3 versioning and Cross-Region Replication · AWS Backup, and round out the cross-region DR patterns Pilot Light · Warm Standby · Multi-Site with Route 53 failover.

In Chapter 1 we said Multi-AZ is the operational baseline for surviving an AZ failure, and in Chapter 11 RDS we covered the Multi-AZ option. This chapter deals with a question one level above that — does the backup actually restore, and what do you do when an entire region collapses.

Disaster recovery (DR) stays invisible most of the time but decides a company’s survival when an incident hits. The key is to set two numbers first — how long you can stay down (RTO), and how much data you can afford to lose (RPO). Look at this chapter’s backup·DR design alongside the account separation of Chapter 29 security governance and the operations checklist of the Part 6 capstone.

First, Two Numbers — RTO and RPO #

MetricMeaningQuestion
RTO (Recovery Time Objective)allowed time until recovery“Within how many hours must we be back up?”
RPO (Recovery Point Objective)allowed data loss“How much can we afford to lose since the last backup?”

RTO 4 hours / RPO 1 hour means: recover within 4 hours, but you may lose the hour of data just before the incident. The smaller these two numbers, the more sharply cost rises. So DR design starts not from “how robust” but from “what should RTO/RPO actually be for this service.” The answers for an internal admin tool and a payment system are completely different.

Backups — Not Losing Data #

The foundation of DR is backups. The tools differ by service.

RDS — Automated Backups and PITR #

  • Automated backups — within the retention period (up to 35 days), it leaves a daily snapshot + transaction logs. Turn it on with backup_retention_period.
  • PITR (Point-in-Time Recovery) — restores a new instance to an arbitrary point (down to the second) within the retention period. You can roll back to 5 minutes before you accidentally dropped a table.
  • Cross-region copy of manual / automated snapshots — keep a copy in another region to prepare for a region failure.
RDS backup retention + automated cross-region copy
resource "aws_db_instance" "main" {
  identifier              = "myapp-prod"
  backup_retention_period = 14                  # 14 days of PITR
  backup_window           = "17:00-17:30"       # UTC (KST 02:00)
  copy_tags_to_snapshot   = true
}

# Replicate automated backups to the DR region (separate region provider)
resource "aws_db_instance_automated_backups_replication" "dr" {
  provider               = aws.dr            # e.g., us-west-2
  source_db_instance_arn = aws_db_instance.main.arn
  retention_period       = 14
}

For recovery you specify a point in time and bring up a new instance. It’s safe to restore to a new instance rather than overwrite the original, then switch the app’s connection over.

Restore RDS to just before the mistake
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier myapp-prod \
  --target-db-instance-identifier myapp-restored \
  --restore-time 2026-05-24T09:25:00Z

For Aurora Serverless v2, a non-prod / idle DB can be set to a minimum capacity of 0 ACU (auto-pause) so that compute cost goes to 0 while it’s stopped (only storage is billed). That said, the read copy in the DR region that must always stay current must not be paused, so keep it on at a low minimum ACU.

S3 — Versioning and Cross-Region Replication #

  • Versioning — even if you overwrite or delete an object, the previous version remains. It’s the baseline against accidental deletion and ransomware.
  • Replication (CRR: Cross-Region Replication) — automatically replicates to a bucket in another region / account to prepare for a region failure. Versioning must be turned on.
  • Object Lock — locks regulated data as undeletable for a set period.
S3 versioning + cross-region replication
resource "aws_s3_bucket_versioning" "main" {
  bucket = aws_s3_bucket.main.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_replication_configuration" "crr" {
  bucket = aws_s3_bucket.main.id
  role   = aws_iam_role.replication.arn
  rule {
    id     = "to-dr"
    status = "Enabled"
    destination {
      bucket        = aws_s3_bucket.dr.arn   # DR-region bucket
      storage_class = "STANDARD_IA"
    }
  }
}

AWS Backup — Manage It All in One Place #

Setting up backups separately for each service is easy to miss. AWS Backup lets you set backup policies for RDS · EBS · EFS · DynamoDB · S3 and more in one place, and automates retention · cross-region copy · backup verification.

AWS Backup plan — daily backup + DR-region copy
resource "aws_backup_plan" "daily" {
  name = "daily-with-dr-copy"
  rule {
    rule_name         = "daily"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 17 * * ? *)"   # daily at UTC 17:00
    lifecycle { delete_after = 35 }
    copy_action {
      destination_vault_arn = aws_backup_vault.dr.arn   # DR-region vault
      lifecycle { delete_after = 35 }
    }
  }
}

# Auto-select backup targets by tag (all resources tagged Backup=true)
resource "aws_backup_selection" "tagged" {
  name         = "tagged-resources"
  plan_id      = aws_backup_plan.daily.id
  iam_role_arn = aws_iam_role.backup.arn
  selection_tag {
    type  = "STRINGEQUALS"
    key   = "Backup"
    value = "true"
  }
}

Combined with the Organizations from Chapter 29, you can apply backup policies across the whole organization in bulk, so new accounts also become backup targets automatically.

A backup is only a backup once you’ve restored it. A backup whose restore procedure you’ve never actually run, and whose RTO you’ve never measured, is just a “backup you believe you have.” We recommend a restore drill every quarter.

Cross-region DR Patterns #

An entire region collapsing is rare, but when it happens, Multi-AZ can’t stop it. How much you’ve prepared in another region splits this into four patterns. The further down, the smaller the RTO/RPO and the higher the cost.

PatternSteady-stateRTOCostFit
Backup & Restoreonly backups in another regionhours ~ dayslowestinternal tools, non-critical
Pilot Lightonly the minimal core (DB replication) runningtens of minutes ~ hourslowgeneral services
Warm Standbyalways running at reduced scaleminutes ~ tens of minutesmediumrevenue-impacting services
Multi-Site (Active-Active)two regions handle traffic simultaneouslynear 0highestservices where zero downtime is mandatory

Pilot Light — The Practical Starting Point #

For most services, the realistic first DR is Pilot Light.

  • Replicate only the DB to the DR region and keep it always current (RDS read replica or automated-backup replication).
  • Only define the app (ECS / Lambda) and the network as Terraform code, and normally leave it at desired_count = 0 or not applied at all.
  • On an incident, bring the app up in the DR region with Terraform (or raise desired_count), and switch traffic over with Route 53.
  • Steady-state cost is low — roughly the DB replication — and the RTO is “the time to bring the app up + DNS propagation.”

If the Chapter 32 capstone infrastructure is codified in Terraform, DR-region recovery simplifies to “apply the same module with the DR-region provider.” That’s why IaC is the prerequisite for DR (Chapter 25 Terraform intro).

Switching with Route 53 #

The switch for the regional cutover is Chapter 12 Route 53. With a health check + failover routing, when the primary region dies, DNS automatically moves to the DR-region.

Route 53 failover — primary/secondary
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary.example.com"
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "primary" {
  zone_id        = data.aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "primary"
  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.primary.id
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "secondary" {
  zone_id        = data.aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "secondary"
  failover_routing_policy { type = "SECONDARY" }
  alias {
    name                   = aws_lb.dr.dns_name      # DR-region ALB
    zone_id                = aws_lb.dr.zone_id
    evaluate_target_health = true
  }
}

In Multi-Site, instead of failover you use weighted or latency routing to always distribute traffic across the two regions.

A One-page Decision Flow #

  1. Write down the RTO / RPO per service first. There’s no company-wide single value.
  2. Lay backups (RDS PITR + S3 versioning + AWS Backup, including cross-region copy) across every service. This is the minimum line.
  3. If RTO/RPO can be tolerated in hours, Backup & Restore is enough.
  4. If you need minutes, step up to Pilot Light → Warm Standby.
  5. If zero downtime ties directly to revenue, consider Multi-Site, but it has the highest cost and operational complexity.
  6. Whatever the pattern, measure RTO for real with a restore drill.

Exercises #

  1. Write your service’s (or a hypothetical service’s) RTO and RPO as numbers, and write one paragraph on the rationale (the business impact when it’s down). Pick which of the four patterns in §“Cross-region DR Patterns” fits those numbers.
  2. An operator accidentally DROPped a prod table. Write which of RDS PITR and S3 versioning brings this incident back, and the procedure for restoring to a new instance without overwriting the original and then standing the service back up, basing it on the CLI in §“Backups.”
  3. Explain why Pilot Light is cheaper than Warm Standby from the angle of “what is running in steady-state” (DB only vs. a reduced app too), and write one paragraph on why the IaC of Chapter 25 Terraform intro becomes the prerequisite for Pilot Light.

In short: DR starts with setting the two numbers RTO (allowed downtime) and RPO (allowed data loss) per service, and the smaller these are, the more cost spikes. The foundation is backups — RDS uses backup_retention_period + PITR + automated-backup Region replication, S3 uses versioning + CRR, and AWS Backup bundles RDS/EBS/S3, and more, into one plan via tag selection and automates the DR-region copy too. Measure RTO for real with a restore drill. Cross-Region DR goes Backup & Restore → Pilot Light → Warm Standby → Multi-Site, with smaller RTO and higher cost as you go, and for most services the realistic starting point is Pilot Light, replicating only the DB and bringing the app up with Terraform. The switch is Route 53 health check + failover routing, and IaC is the prerequisite for DR.

Next Chapter #

In the next Chapter 31 Lambda in Depth, we add the production operations angle on top of Chapter 17 Lambda Basics. We cover cold starts and Provisioned Concurrency, Layers and container-image packaging, Lambda Powertools-based observability, combining with Step Functions, and the cost trade-offs.

X