Disaster Recovery & Backup — Backups · Cross-region DR · RTO/RPO
Designing how to bring data and services back when one AZ or one region collapses. Set RTO/RPO first, then learn to put backups in place with Terraform via RDS PITR · S3 versioning and Cross-Region Replication · AWS Backup, and round out the cross-region DR patterns Pilot Light · Warm Standby · Multi-Site with Route 53 failover.
In Chapter 1 we said Multi-AZ is the operational baseline for surviving an AZ failure, and in Chapter 11 RDS we covered the Multi-AZ option. This chapter deals with a question one level above that — does the backup actually restore, and what do you do when an entire region collapses.
Disaster recovery (DR) stays invisible most of the time but decides a company’s survival when an incident hits. The key is to set two numbers first — how long you can stay down (RTO), and how much data you can afford to lose (RPO). Look at this chapter’s backup·DR design alongside the account separation of Chapter 29 security governance and the operations checklist of the Part 6 capstone.
First, Two Numbers — RTO and RPO #
| Metric | Meaning | Question |
|---|---|---|
| RTO (Recovery Time Objective) | allowed time until recovery | “Within how many hours must we be back up?” |
| RPO (Recovery Point Objective) | allowed data loss | “How much can we afford to lose since the last backup?” |
RTO 4 hours / RPO 1 hour means: recover within 4 hours, but you may lose the hour of data just before the incident. The smaller these two numbers, the more sharply cost rises. So DR design starts not from “how robust” but from “what should RTO/RPO actually be for this service.” The answers for an internal admin tool and a payment system are completely different.
Backups — Not Losing Data #
The foundation of DR is backups. The tools differ by service.
RDS — Automated Backups and PITR #
- Automated backups — within the retention period (up to 35 days), it leaves a daily snapshot + transaction logs. Turn it on with
backup_retention_period. - PITR (Point-in-Time Recovery) — restores a new instance to an arbitrary point (down to the second) within the retention period. You can roll back to 5 minutes before you accidentally dropped a table.
- Cross-region copy of manual / automated snapshots — keep a copy in another region to prepare for a region failure.
resource "aws_db_instance" "main" {
identifier = "myapp-prod"
backup_retention_period = 14 # 14 days of PITR
backup_window = "17:00-17:30" # UTC (KST 02:00)
copy_tags_to_snapshot = true
}
# Replicate automated backups to the DR region (separate region provider)
resource "aws_db_instance_automated_backups_replication" "dr" {
provider = aws.dr # e.g., us-west-2
source_db_instance_arn = aws_db_instance.main.arn
retention_period = 14
}For recovery you specify a point in time and bring up a new instance. It’s safe to restore to a new instance rather than overwrite the original, then switch the app’s connection over.
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier myapp-prod \
--target-db-instance-identifier myapp-restored \
--restore-time 2026-05-24T09:25:00ZFor Aurora Serverless v2, a non-prod / idle DB can be set to a minimum capacity of 0 ACU (auto-pause) so that compute cost goes to 0 while it’s stopped (only storage is billed). That said, the read copy in the DR region that must always stay current must not be paused, so keep it on at a low minimum ACU.
S3 — Versioning and Cross-Region Replication #
- Versioning — even if you overwrite or delete an object, the previous version remains. It’s the baseline against accidental deletion and ransomware.
- Replication (CRR: Cross-Region Replication) — automatically replicates to a bucket in another region / account to prepare for a region failure. Versioning must be turned on.
- Object Lock — locks regulated data as undeletable for a set period.
resource "aws_s3_bucket_versioning" "main" {
bucket = aws_s3_bucket.main.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_replication_configuration" "crr" {
bucket = aws_s3_bucket.main.id
role = aws_iam_role.replication.arn
rule {
id = "to-dr"
status = "Enabled"
destination {
bucket = aws_s3_bucket.dr.arn # DR-region bucket
storage_class = "STANDARD_IA"
}
}
}AWS Backup — Manage It All in One Place #
Setting up backups separately for each service is easy to miss. AWS Backup lets you set backup policies for RDS · EBS · EFS · DynamoDB · S3 and more in one place, and automates retention · cross-region copy · backup verification.
resource "aws_backup_plan" "daily" {
name = "daily-with-dr-copy"
rule {
rule_name = "daily"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 17 * * ? *)" # daily at UTC 17:00
lifecycle { delete_after = 35 }
copy_action {
destination_vault_arn = aws_backup_vault.dr.arn # DR-region vault
lifecycle { delete_after = 35 }
}
}
}
# Auto-select backup targets by tag (all resources tagged Backup=true)
resource "aws_backup_selection" "tagged" {
name = "tagged-resources"
plan_id = aws_backup_plan.daily.id
iam_role_arn = aws_iam_role.backup.arn
selection_tag {
type = "STRINGEQUALS"
key = "Backup"
value = "true"
}
}Combined with the Organizations from Chapter 29, you can apply backup policies across the whole organization in bulk, so new accounts also become backup targets automatically.
A backup is only a backup once you’ve restored it. A backup whose restore procedure you’ve never actually run, and whose RTO you’ve never measured, is just a “backup you believe you have.” We recommend a restore drill every quarter.
Cross-region DR Patterns #
An entire region collapsing is rare, but when it happens, Multi-AZ can’t stop it. How much you’ve prepared in another region splits this into four patterns. The further down, the smaller the RTO/RPO and the higher the cost.
| Pattern | Steady-state | RTO | Cost | Fit |
|---|---|---|---|---|
| Backup & Restore | only backups in another region | hours ~ days | lowest | internal tools, non-critical |
| Pilot Light | only the minimal core (DB replication) running | tens of minutes ~ hours | low | general services |
| Warm Standby | always running at reduced scale | minutes ~ tens of minutes | medium | revenue-impacting services |
| Multi-Site (Active-Active) | two regions handle traffic simultaneously | near 0 | highest | services where zero downtime is mandatory |
Pilot Light — The Practical Starting Point #
For most services, the realistic first DR is Pilot Light.
- Replicate only the DB to the DR region and keep it always current (RDS read replica or automated-backup replication).
- Only define the app (ECS / Lambda) and the network as Terraform code, and normally leave it at
desired_count = 0or not applied at all. - On an incident, bring the app up in the DR region with Terraform (or raise
desired_count), and switch traffic over with Route 53. - Steady-state cost is low — roughly the DB replication — and the RTO is “the time to bring the app up + DNS propagation.”
If the Chapter 32 capstone infrastructure is codified in Terraform, DR-region recovery simplifies to “apply the same module with the DR-region provider.” That’s why IaC is the prerequisite for DR (Chapter 25 Terraform intro).
Switching with Route 53 #
The switch for the regional cutover is Chapter 12 Route 53. With a health check + failover routing, when the primary region dies, DNS automatically moves to the DR-region.
resource "aws_route53_health_check" "primary" {
fqdn = "primary.example.com"
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "primary" {
zone_id = data.aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary"
failover_routing_policy { type = "PRIMARY" }
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "secondary" {
zone_id = data.aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "secondary"
failover_routing_policy { type = "SECONDARY" }
alias {
name = aws_lb.dr.dns_name # DR-region ALB
zone_id = aws_lb.dr.zone_id
evaluate_target_health = true
}
}In Multi-Site, instead of failover you use weighted or latency routing to always distribute traffic across the two regions.
A One-page Decision Flow #
- Write down the RTO / RPO per service first. There’s no company-wide single value.
- Lay backups (RDS PITR + S3 versioning + AWS Backup, including cross-region copy) across every service. This is the minimum line.
- If RTO/RPO can be tolerated in hours, Backup & Restore is enough.
- If you need minutes, step up to Pilot Light → Warm Standby.
- If zero downtime ties directly to revenue, consider Multi-Site, but it has the highest cost and operational complexity.
- Whatever the pattern, measure RTO for real with a restore drill.
Exercises #
- Write your service’s (or a hypothetical service’s) RTO and RPO as numbers, and write one paragraph on the rationale (the business impact when it’s down). Pick which of the four patterns in §“Cross-region DR Patterns” fits those numbers.
- An operator accidentally
DROPped a prod table. Write which of RDS PITR and S3 versioning brings this incident back, and the procedure for restoring to a new instance without overwriting the original and then standing the service back up, basing it on the CLI in §“Backups.” - Explain why Pilot Light is cheaper than Warm Standby from the angle of “what is running in steady-state” (DB only vs. a reduced app too), and write one paragraph on why the IaC of Chapter 25 Terraform intro becomes the prerequisite for Pilot Light.
In short: DR starts with setting the two numbers RTO (allowed downtime) and RPO (allowed data loss) per service, and the smaller these are, the more cost spikes. The foundation is backups — RDS uses
backup_retention_period+ PITR + automated-backup Region replication, S3 uses versioning + CRR, and AWS Backup bundles RDS/EBS/S3, and more, into one plan via tag selection and automates the DR-region copy too. Measure RTO for real with a restore drill. Cross-Region DR goes Backup & Restore → Pilot Light → Warm Standby → Multi-Site, with smaller RTO and higher cost as you go, and for most services the realistic starting point is Pilot Light, replicating only the DB and bringing the app up with Terraform. The switch is Route 53 health check + failover routing, and IaC is the prerequisite for DR.
Next Chapter #
In the next Chapter 31 Lambda in Depth, we add the production operations angle on top of Chapter 17 Lambda Basics. We cover cold starts and Provisioned Concurrency, Layers and container-image packaging, Lambda Powertools-based observability, combining with Step Functions, and the cost trade-offs.