AWS Certified CloudOps Engineer - Associate (SOA-C03) #6 Domain 2-2 Reliability — Backup, Restore, and Disaster Recovery (DR)

6 min read

If #5 locked in the availability that keeps “the service running,” this post is about the backup and disaster recovery that ensures you can “restore data without losing it.” In the reliability domain, availability and data protection are a pair. The heaviest incident in operations is not an instance going down but data disappearing, so the exam asks about backup strategies and recovery objectives with significant weight.

The basic units of backup #

ResourceBackup meansCharacteristics
EBS volumeSnapshotIncremental storage, kept in S3, within the region
EC2 instanceAMIVolume snapshot + metadata. Recreates the instance
RDSAutomated backup / manual snapshotSupports point-in-time recovery (PITR)
EFS / DynamoDB / othersPer-service backup or AWS BackupCan be centrally managed

Properties of EBS snapshots #

EBS snapshots are incremental. Only the first snapshot is full; subsequent ones store only the changed blocks. So even taking them frequently does not increase cost linearly. Snapshots are stored in S3 (not directly visible to the user), and you can copy them to another region or account to set up geographic distribution or isolation.

RDS backup: automated backup vs. snapshot #

AspectAutomated backupManual snapshot
RetentionUp to 35 days, deleted after the retention periodPermanent until you delete it yourself
Point-in-time recovery (PITR)PossibleNot possible (only the moment it was taken)
When the instance is deletedDeleted along with itRemains

A recurring operations theme is “I want to keep the backup even after deleting the instance.” Automated backups disappear with the instance, so if you need permanent retention you must take a separate manual snapshot.

AWS Backup: centralized backup management #

Configuring backups separately for each service leads to gaps and inconsistency. AWS Backup centrally manages the backups of multiple services under a single policy.

ComponentRole
Backup PlanA policy defining backup frequency, retention, and time window
Backup VaultWhere backups are stored. Access policy and encryption
Resource AssignmentDesignates backup targets by tag or ID

The key benefits are as follows.

  • Tag-based bulk application — back up all resources carrying a specific tag under one policy
  • Cross-region / cross-account copy — include replication of backups to another region in the policy
  • Backup Vault Lock — lock backups so they cannot be modified or deleted (WORM), guarding against ransomware and accidental deletion. Addresses compliance requirements
  • Centralized reporting — see at a glance which resources fall outside policy

The answer to “enforce standard backup policy across dozens of accounts and multiple services and prove compliance” is AWS Backup (+ Organizations integration + Vault Lock).

RPO and RTO: recovery objectives #

Two metrics are the criteria for choosing a DR strategy.

MetricDefinitionIn one line
RPO (Recovery Point Objective)The acceptable amount of data loss in time“How far into the past can you afford to go back?”
RTO (Recovery Time Objective)The acceptable time to recover“How quickly must you restore?”

The shorter the RPO, the more frequently you must take backups (or replicate in real time); the shorter the RTO, the more you must keep a standby environment running ahead of time. Making both short increases cost, so the exam framing is to choose the lowest-cost strategy that meets the requirement.

Four DR strategies #

Based on the trade-off between RPO/RTO and cost, DR splits into four strategies. Here is the framework from SAA revisited through an operations lens.

StrategySteady-state costRTOConfiguration
Backup & RestoreLowestLongest (hours)Only backups in another region. Restore from scratch on failure
Pilot LightLowLongKeep only the core (DB replication) on, the rest off. Start up on failure
Warm StandbyMediumShort (minutes)Keep a scaled-down full environment always on. Scale up on failure
Multi-Site (Hot)HighestShortest (near zero)Run both sides at full capacity simultaneously

The selection criterion is simple.

  • If minimum cost is the priority and slow recovery is acceptable → Backup & Restore
  • If fast recovery is the priority and you can accept the cost → Warm Standby or Multi-Site
  • The compromise between them is Pilot Light

When two conditions come together, like “RTO of a few minutes, cost minimized,” the correct answer is to pick the point that satisfies both at once (usually Pilot Light or Warm Standby).

Exam Question Patterns #

  • Keep the backup even after deleting the instance → manual snapshot (automated backups are deleted along with it)
  • Enforce standard policy on backups across multiple services → AWS Backup + tag-based policy
  • Make backups impossible to delete or alter → Backup Vault Lock (WORM)
  • Keep backups against a region failure → copy snapshots/backups to another region
  • Minimum cost, slow recovery acceptable → Backup & Restore
  • RTO of a few minutes required → Warm Standby
  • RPO/RTO near zero → Multi-Site

Common Pitfalls #

1) Thinking automated backups are retained permanently #

Automated backups disappear once the retention period (up to 35 days) passes or when you delete the instance. Permanent retention is a manual snapshot.

2) Misunderstanding snapshots as full copies #

EBS snapshots are incremental. Even taking them frequently only adds the changed blocks.

3) Choosing a DR strategy by cost alone #

The cheapest, Backup & Restore, is not always the answer. You must read the RTO/RPO requirement first and pick the lowest-cost strategy that meets it.

4) Assuming multi-region as the default #

Most availability requirements are satisfied by Multi-AZ. Multi-region DR is for when there is an explicit requirement such as region failure or regulation.

Summary #

What we covered in this post:

  • The backup units are EBS snapshot (incremental), AMI, and RDS backup. Snapshots can be copied to another region or account
  • RDS automated backups are deleted along with the instance; permanent retention is a manual snapshot
  • Use AWS Backup to centrally manage multiple services under a tag-based policy. Vault Lock for WORM
  • RPO (acceptable data loss) and RTO (acceptable recovery time) are the criteria for choosing a DR strategy
  • The four DR strategies: Backup & Restore , Pilot Light , Warm Standby , Multi-Site. A trade-off between cost and RTO
  • The key is to read the requirement first and pick the lowest-cost strategy that meets it

Next: Domain 3-1 CloudFormation and IaC #

With data protection done, we’ve finished the reliability domain. Next is the third domain, deployment, provisioning, and automation.

In #7 Domain 3-1 Deployment: CloudFormation in Depth and IaC, I’ll cover CloudFormation’s stack and template structure, change sets and drift detection, StackSets for deploying across multiple accounts and regions, and the relationship with other IaC tools like CDK and Terraform.

X