AWS Certified CloudOps Engineer - Associate (SOA-C03) #6 Domain 2-2 Reliability — Backup, Restore, and Disaster Recovery (DR)

Sunday, May 31, 2026

6 min read

If #5 locked in the availability that keeps “the service running,” this post is about the backup and disaster recovery that ensures you can “restore data without losing it.” In the reliability domain, availability and data protection are a pair. The heaviest incident in operations is not an instance going down but data disappearing, so the exam asks about backup strategies and recovery objectives with significant weight.

The basic units of backup #

Resource	Backup means	Characteristics
EBS volume	Snapshot	Incremental storage, kept in S3, within the region
EC2 instance	AMI	Volume snapshot + metadata. Recreates the instance
RDS	Automated backup / manual snapshot	Supports point-in-time recovery (PITR)
EFS / DynamoDB / others	Per-service backup or AWS Backup	Can be centrally managed

Properties of EBS snapshots #

EBS snapshots are incremental. Only the first snapshot is full; subsequent ones store only the changed blocks. So even taking them frequently does not increase cost linearly. Snapshots are stored in S3 (not directly visible to the user), and you can copy them to another region or account to set up geographic distribution or isolation.

RDS backup: automated backup vs. snapshot #

Aspect	Automated backup	Manual snapshot
Retention	Up to 35 days, deleted after the retention period	Permanent until you delete it yourself
Point-in-time recovery (PITR)	Possible	Not possible (only the moment it was taken)
When the instance is deleted	Deleted along with it	Remains

A recurring operations theme is “I want to keep the backup even after deleting the instance.” Automated backups disappear with the instance, so if you need permanent retention you must take a separate manual snapshot.

AWS Backup: centralized backup management #

Configuring backups separately for each service leads to gaps and inconsistency. AWS Backup centrally manages the backups of multiple services under a single policy.

Component	Role
Backup Plan	A policy defining backup frequency, retention, and time window
Backup Vault	Where backups are stored. Access policy and encryption
Resource Assignment	Designates backup targets by tag or ID

The key benefits are as follows.

Tag-based bulk application — back up all resources carrying a specific tag under one policy
Cross-region / cross-account copy — include replication of backups to another region in the policy
Backup Vault Lock — lock backups so they cannot be modified or deleted (WORM), guarding against ransomware and accidental deletion. Addresses compliance requirements
Centralized reporting — see at a glance which resources fall outside policy

The answer to “enforce standard backup policy across dozens of accounts and multiple services and prove compliance” is AWS Backup (+ Organizations integration + Vault Lock).

RPO and RTO: recovery objectives #

Two metrics are the criteria for choosing a DR strategy.

Metric	Definition	In one line
RPO (Recovery Point Objective)	The acceptable amount of data loss in time	“How far into the past can you afford to go back?”
RTO (Recovery Time Objective)	The acceptable time to recover	“How quickly must you restore?”

The shorter the RPO, the more frequently you must take backups (or replicate in real time); the shorter the RTO, the more you must keep a standby environment running ahead of time. Making both short increases cost, so the exam framing is to choose the lowest-cost strategy that meets the requirement.

Four DR strategies #

Based on the trade-off between RPO/RTO and cost, DR splits into four strategies. Here is the framework from SAA revisited through an operations lens.

Strategy	Steady-state cost	RTO	Configuration
Backup & Restore	Lowest	Longest (hours)	Only backups in another region. Restore from scratch on failure
Pilot Light	Low	Long	Keep only the core (DB replication) on, the rest off. Start up on failure
Warm Standby	Medium	Short (minutes)	Keep a scaled-down full environment always on. Scale up on failure
Multi-Site (Hot)	Highest	Shortest (near zero)	Run both sides at full capacity simultaneously

The selection criterion is simple.

If minimum cost is the priority and slow recovery is acceptable → Backup & Restore
If fast recovery is the priority and you can accept the cost → Warm Standby or Multi-Site
The compromise between them is Pilot Light

When two conditions come together, like “RTO of a few minutes, cost minimized,” the correct answer is to pick the point that satisfies both at once (usually Pilot Light or Warm Standby).

Exam Question Patterns #

Keep the backup even after deleting the instance → manual snapshot (automated backups are deleted along with it)
Enforce standard policy on backups across multiple services → AWS Backup + tag-based policy
Make backups impossible to delete or alter → Backup Vault Lock (WORM)
Keep backups against a region failure → copy snapshots/backups to another region
Minimum cost, slow recovery acceptable → Backup & Restore
RTO of a few minutes required → Warm Standby
RPO/RTO near zero → Multi-Site

Common Pitfalls #

1) Thinking automated backups are retained permanently #

Automated backups disappear once the retention period (up to 35 days) passes or when you delete the instance. Permanent retention is a manual snapshot.

2) Misunderstanding snapshots as full copies #

EBS snapshots are incremental. Even taking them frequently only adds the changed blocks.

3) Choosing a DR strategy by cost alone #

The cheapest, Backup & Restore, is not always the answer. You must read the RTO/RPO requirement first and pick the lowest-cost strategy that meets it.

4) Assuming multi-region as the default #

Most availability requirements are satisfied by Multi-AZ. Multi-region DR is for when there is an explicit requirement such as region failure or regulation.

Summary #

What we covered in this post:

The backup units are EBS snapshot (incremental), AMI, and RDS backup. Snapshots can be copied to another region or account
RDS automated backups are deleted along with the instance; permanent retention is a manual snapshot
Use AWS Backup to centrally manage multiple services under a tag-based policy. Vault Lock for WORM
RPO (acceptable data loss) and RTO (acceptable recovery time) are the criteria for choosing a DR strategy
The four DR strategies: Backup & Restore , Pilot Light , Warm Standby , Multi-Site. A trade-off between cost and RTO
The key is to read the requirement first and pick the lowest-cost strategy that meets it

Next: Domain 3-1 CloudFormation and IaC #

With data protection done, we’ve finished the reliability domain. Next is the third domain, deployment, provisioning, and automation.

In #7 Domain 3-1 Deployment: CloudFormation in Depth and IaC, I’ll cover CloudFormation’s stack and template structure, change sets and drift detection, StackSets for deploying across multiple accounts and regions, and the relationship with other IaC tools like CDK and Terraform.