AWS Certified CloudOps Engineer - Associate (SOA-C03) #6 Domain 2-2 Reliability — Backup, Restore, and Disaster Recovery (DR)
If #5 locked in the availability that keeps “the service running,” this post is about the backup and disaster recovery that ensures you can “restore data without losing it.” In the reliability domain, availability and data protection are a pair. The heaviest incident in operations is not an instance going down but data disappearing, so the exam asks about backup strategies and recovery objectives with significant weight.
The basic units of backup #
| Resource | Backup means | Characteristics |
|---|---|---|
| EBS volume | Snapshot | Incremental storage, kept in S3, within the region |
| EC2 instance | AMI | Volume snapshot + metadata. Recreates the instance |
| RDS | Automated backup / manual snapshot | Supports point-in-time recovery (PITR) |
| EFS / DynamoDB / others | Per-service backup or AWS Backup | Can be centrally managed |
Properties of EBS snapshots #
EBS snapshots are incremental. Only the first snapshot is full; subsequent ones store only the changed blocks. So even taking them frequently does not increase cost linearly. Snapshots are stored in S3 (not directly visible to the user), and you can copy them to another region or account to set up geographic distribution or isolation.
RDS backup: automated backup vs. snapshot #
| Aspect | Automated backup | Manual snapshot |
|---|---|---|
| Retention | Up to 35 days, deleted after the retention period | Permanent until you delete it yourself |
| Point-in-time recovery (PITR) | Possible | Not possible (only the moment it was taken) |
| When the instance is deleted | Deleted along with it | Remains |
A recurring operations theme is “I want to keep the backup even after deleting the instance.” Automated backups disappear with the instance, so if you need permanent retention you must take a separate manual snapshot.
AWS Backup: centralized backup management #
Configuring backups separately for each service leads to gaps and inconsistency. AWS Backup centrally manages the backups of multiple services under a single policy.
| Component | Role |
|---|---|
| Backup Plan | A policy defining backup frequency, retention, and time window |
| Backup Vault | Where backups are stored. Access policy and encryption |
| Resource Assignment | Designates backup targets by tag or ID |
The key benefits are as follows.
- Tag-based bulk application — back up all resources carrying a specific tag under one policy
- Cross-region / cross-account copy — include replication of backups to another region in the policy
- Backup Vault Lock — lock backups so they cannot be modified or deleted (WORM), guarding against ransomware and accidental deletion. Addresses compliance requirements
- Centralized reporting — see at a glance which resources fall outside policy
The answer to “enforce standard backup policy across dozens of accounts and multiple services and prove compliance” is AWS Backup (+ Organizations integration + Vault Lock).
RPO and RTO: recovery objectives #
Two metrics are the criteria for choosing a DR strategy.
| Metric | Definition | In one line |
|---|---|---|
| RPO (Recovery Point Objective) | The acceptable amount of data loss in time | “How far into the past can you afford to go back?” |
| RTO (Recovery Time Objective) | The acceptable time to recover | “How quickly must you restore?” |
The shorter the RPO, the more frequently you must take backups (or replicate in real time); the shorter the RTO, the more you must keep a standby environment running ahead of time. Making both short increases cost, so the exam framing is to choose the lowest-cost strategy that meets the requirement.
Four DR strategies #
Based on the trade-off between RPO/RTO and cost, DR splits into four strategies. Here is the framework from SAA revisited through an operations lens.
| Strategy | Steady-state cost | RTO | Configuration |
|---|---|---|---|
| Backup & Restore | Lowest | Longest (hours) | Only backups in another region. Restore from scratch on failure |
| Pilot Light | Low | Long | Keep only the core (DB replication) on, the rest off. Start up on failure |
| Warm Standby | Medium | Short (minutes) | Keep a scaled-down full environment always on. Scale up on failure |
| Multi-Site (Hot) | Highest | Shortest (near zero) | Run both sides at full capacity simultaneously |
The selection criterion is simple.
- If minimum cost is the priority and slow recovery is acceptable → Backup & Restore
- If fast recovery is the priority and you can accept the cost → Warm Standby or Multi-Site
- The compromise between them is Pilot Light
When two conditions come together, like “RTO of a few minutes, cost minimized,” the correct answer is to pick the point that satisfies both at once (usually Pilot Light or Warm Standby).
Exam Question Patterns #
- Keep the backup even after deleting the instance → manual snapshot (automated backups are deleted along with it)
- Enforce standard policy on backups across multiple services → AWS Backup + tag-based policy
- Make backups impossible to delete or alter → Backup Vault Lock (WORM)
- Keep backups against a region failure → copy snapshots/backups to another region
- Minimum cost, slow recovery acceptable → Backup & Restore
- RTO of a few minutes required → Warm Standby
- RPO/RTO near zero → Multi-Site
Common Pitfalls #
1) Thinking automated backups are retained permanently #
Automated backups disappear once the retention period (up to 35 days) passes or when you delete the instance. Permanent retention is a manual snapshot.
2) Misunderstanding snapshots as full copies #
EBS snapshots are incremental. Even taking them frequently only adds the changed blocks.
3) Choosing a DR strategy by cost alone #
The cheapest, Backup & Restore, is not always the answer. You must read the RTO/RPO requirement first and pick the lowest-cost strategy that meets it.
4) Assuming multi-region as the default #
Most availability requirements are satisfied by Multi-AZ. Multi-region DR is for when there is an explicit requirement such as region failure or regulation.
Summary #
What we covered in this post:
- The backup units are EBS snapshot (incremental), AMI, and RDS backup. Snapshots can be copied to another region or account
- RDS automated backups are deleted along with the instance; permanent retention is a manual snapshot
- Use AWS Backup to centrally manage multiple services under a tag-based policy. Vault Lock for WORM
- RPO (acceptable data loss) and RTO (acceptable recovery time) are the criteria for choosing a DR strategy
- The four DR strategies: Backup & Restore , Pilot Light , Warm Standby , Multi-Site. A trade-off between cost and RTO
- The key is to read the requirement first and pick the lowest-cost strategy that meets it
Next: Domain 3-1 CloudFormation and IaC #
With data protection done, we’ve finished the reliability domain. Next is the third domain, deployment, provisioning, and automation.
In #7 Domain 3-1 Deployment: CloudFormation in Depth and IaC, I’ll cover CloudFormation’s stack and template structure, change sets and drift detection, StackSets for deploying across multiple accounts and regions, and the relationship with other IaC tools like CDK and Terraform.