11 Chapter

RDS — managed DB, backups, parameter groups

AWS's managed relational DB service, RDS. A comparison with a DB on EC2, automated backups and snapshots and PITR, Multi-AZ, parameter / option groups, and the operational flow of minor vs major upgrades.

If Chapter 10 S3 was the object domain, this chapter is the turn for relational DBs. AWS’s managed relational DB service is RDS (Relational Database Service). You can launch and operate PostgreSQL / MySQL / MariaDB / Oracle / SQL Server / Aurora from one console.

In this chapter we start from RDS’s managed model and lay out, in order, automated backups and PITR, Multi-AZ and Read Replicas, parameter / option groups, and upgrades. The private subnet RDS lives in sits on top of the subnet design of Chapter 8 EC2 and VPC Basics, password management carries into Chapter 20 Secrets Manager and Parameter Store, and the big picture of backup and restore carries into Chapter 30 Disaster Recovery & Backup.

A DB on EC2 vs RDS #

When you first meet the cloud, everyone hesitates once. “Should I launch a single EC2 and install PostgreSQL myself, or go with RDS?”

Item	DB on EC2	RDS
Install / setup	Yourself	Console click
Patches / minor upgrades	Yourself	Click (or automatic)
Backups	Yourself (`pg_dump`, cron)	Automatic + PITR
Multi-AZ failover	Yourself (Patroni, etc.)	Turn on the option
Read Replica	Yourself (replication setup)	Console click
Monitoring	Yourself (`pg_stat_*`)	CloudWatch + Performance Insights
Cost	Instance cost only	Instance + license + managed premium
Freedom	Touch the OS / extensions / kernel freely	Limited (e.g., no superuser)

For an operational system, RDS is almost always the answer. A DB on EC2 is used only in special cases where an extension is unsupported on RDS or where OS-level tuning is needed.

Choosing an engine #

The engines RDS supports are as follows.

RDS engines

PostgreSQL  ── the first candidate for new projects. Rich JSONB / extensions
MySQL       ── the most common choice. Emphasis on compatibility
MariaDB     ── a MySQL fork. Almost identical to MySQL
Oracle      ── enterprise with expensive licenses
SQL Server  ── the Microsoft ecosystem
Aurora      ── AWS's own engine. PostgreSQL / MySQL compatible

The characteristics of Aurora #

Aurora is a cloud-native DB built by AWS. It’s PostgreSQL / MySQL wire-compatible, so you can move with almost no code changes.

	Aurora	RDS PostgreSQL/MySQL
Storage	Distributed (6 copies automatically)	EBS
Max size	128 TB auto-expanding	64 TB
Read Replica	Up to 15 (millisecond sync)	5 (asynchronous)
Failover time	< 30 seconds	1~2 minutes
Cost	About 20% pricier than RDS	Standard
New features	Serverless v2, Global Database	RDS standard

When operational scale and availability matter, Aurora; when cost and simplicity matter, RDS PostgreSQL.

Aurora Serverless v2 is usage-based, auto-scaling RDS. It’s attractive when traffic is uneven, and it has almost no cold start (v1’s drawback was resolved).

Launching an RDS instance #

Creating RDS PostgreSQL

aws rds create-db-instance \
  --db-instance-identifier my-postgres \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 16.4 \
  --master-username postgres \
  --master-user-password "very-strong-password" \
  --allocated-storage 20 \
  --storage-type gp3 \
  --vpc-security-group-ids sg-0abc... \
  --db-subnet-group-name my-db-subnet-group \
  --backup-retention-period 7 \
  --multi-az \
  --no-publicly-accessible

The frequently touched options are as follows.

Option	Role
`db-instance-class`	Instance type. `db.t3` (small role), `db.m5` (general), `db.r5` (memory)
`engine` / `engine-version`	Engine and version
`allocated-storage`	Disk GB. `storage-type=gp3` is the default
`multi-az`	Automatically places a Standby in another AZ
`publicly-accessible`	A public IP. `false` for operations
`backup-retention-period`	Automated backup retention days (0~35)

DB Subnet Group #

RDS requires you to pre-specify the subnets to place in a Multi-AZ configuration. That’s the DB Subnet Group. Usually you place it across two or more AZs of private subnets.

Creating a DB Subnet Group

aws rds create-db-subnet-group \
  --db-subnet-group-name my-db-subnet-group \
  --db-subnet-group-description "DB private subnets" \
  --subnet-ids subnet-0a1... subnet-0b2... subnet-0c3...

Where the DB lives is a private subnet (Chapter 8 VPC), not exposed directly to the internet. You allow only the app server SG in, at the SG level.

Automated backup — the core value of managed #

RDS’s real value is backups.

Automated Backup #

If backup-retention-period is greater than 0, automated backup is on.

A full backup happens once a day during the backup window time.
Transaction logs are saved every 5 minutes.
They’re retained for the retention period (1 ~ 35 days).
When the DB is deleted, the automated backups are deleted with it (you can prevent this with SkipFinalSnapshot=false).

Point-in-Time Recovery (PITR) #

An RDS with automated backup on can be restored to any point within the retention period. The precision is 5-minute units from the transaction log.

Restoring to a point 3 hours ago

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier my-postgres \
  --target-db-instance-identifier my-postgres-restored \
  --restore-time 2026-05-24T08:30:00Z

A restore takes the form of creating a new instance. The source is left as-is. “I need the state from just before 3:27 a.m. this morning” is possible.

Manual Snapshot #

A backup the user creates explicitly, separate from automated backups. It doesn’t disappear even when you delete the DB, and it has no retention deadline.

Manual snapshot

aws rds create-db-snapshot \
  --db-instance-identifier my-postgres \
  --db-snapshot-identifier my-postgres-2026-05-24-prerelease

The cases where you use a manual snapshot in operations are as follows.

Just before a major upgrade.
Just before a large migration.
A final snapshot when deleting a DB.
A copy to another region / account (a DR configuration).

Multi-AZ — high availability #

When you turn on the --multi-az option, RDS automatically replicates a Standby instance to another AZ.

The shape of Multi-AZ

   ┌──────────────────────────────────┐
   │           VPC                    │
   │                                  │
   │    AZ a              AZ b        │
   │    ┌──────┐          ┌──────┐    │
   │    │ Pri  │ ◀══════▶ │Stand │    │
   │    │mary  │  sync repl │ by   │    │
   │    └──────┘          └──────┘    │
   │       ▲                          │
   │       │ DNS endpoint             │
   │       │ (automatic failover)     │
   └───────┼──────────────────────────┘
           │
       app server

Synchronous replication — the Standby also receives every transaction up to commit.
Automatic failover on failure — within 30 seconds ~ 2 minutes the Standby becomes Primary and the DNS endpoint points to it.
No read distribution — the Standby isn’t used for reads either (a difference from Aurora).

The cost of Multi-AZ #

The price of redundancy is 2x the instance / storage cost. For learning or side projects, single AZ; for operations, Multi-AZ.

Multi-AZ Cluster (option) #

The newly introduced Multi-AZ DB Cluster for PostgreSQL / MySQL has standbys that are readable, and a failover time under 35 seconds. However, it uses three AZs (the cost of three instances).

Read Replica — read distribution #

A Read Replica is a read-only copy made by asynchronous replication. It distributes load to places with heavy read traffic.

Creating a Read Replica

aws rds create-db-instance-read-replica \
  --db-instance-identifier my-postgres-read-1 \
  --source-db-instance-identifier my-postgres \
  --availability-zone ap-northeast-2c

The characteristics are as follows.

Asynchronous replication — there’s a slight delay (usually ms ~ seconds).
Possible across regions too — used for global read distribution / DR.
Up to 5 — Aurora has 15.
Promote can split it off as a separate instance.

The suitability of a Read Replica #

Role	Suitability
Read traffic distribution	⭐⭐⭐
Analytics / reporting	⭐⭐⭐
Backup / DR	⭐⭐ (snapshots are safer)
Automatic failover	No — a Read Replica isn’t auto-promoted

If read traffic isn’t heavy, a Multi-AZ Cluster is simpler than a Read Replica.

Parameter groups and option groups #

In RDS, a DB engine’s settings (like max_connections, shared_buffers) are managed with a parameter group.

Parameter Group #

Creating a custom parameter group

aws rds create-db-parameter-group \
  --db-parameter-group-name my-postgres-16-params \
  --db-parameter-group-family postgres16 \
  --description "Custom params for my workload"

aws rds modify-db-parameter-group \
  --db-parameter-group-name my-postgres-16-params \
  --parameters \
    "ParameterName=max_connections,ParameterValue=200,ApplyMethod=pending-reboot" \
    "ParameterName=log_statement,ParameterValue=ddl,ApplyMethod=immediate"

There are two kinds of parameters.

Static — applied after a DB reboot (max_connections, etc.).
Dynamic — applied immediately (log_statement, etc.).

The frequently touched parameters are as follows.

Parameter	PostgreSQL	MySQL
Max connections	`max_connections`	`max_connections`
Query logging	`log_min_duration_statement`	`slow_query_log`
Memory	`shared_buffers`, `work_mem`	`innodb_buffer_pool_size`
Timezone	`timezone`	`time_zone`

Option Group #

Its role is to turn on engine-specific extra features (e.g., SQL Server’s SSIS, Oracle’s OEM). For PostgreSQL / MySQL it’s rarely touched.

Upgrades — the operational flow #

RDS splits engine version upgrades into two kinds.

Minor Upgrade — safe #

A minor upgrade like 16.3 → 16.4. Usually security patches and small improvements. If you turn on the auto-apply option, it happens automatically during the backup window.

Turning on automatic minor upgrades

aws rds modify-db-instance \
  --db-instance-identifier my-postgres \
  --auto-minor-version-upgrade \
  --apply-immediately

Downtime is 30 seconds ~ 5 minutes. With Multi-AZ it’s shorter (upgrade the Standby first then fail over, then the old Primary).

Major Upgrade — careful #

A major upgrade like PostgreSQL 16 → 17. It may break. The procedure is as follows.

Create a manual snapshot (for rollback).
Try the same version migration in a test environment.
If possible, upgrade a Read Replica first.
Schedule a downtime window outside operating hours.
Run aws rds modify-db-instance --engine-version 17.0.
Monitor the upgrade.
If a problem arises, restore a new instance from the snapshot.

Before a major upgrade, check compatibility issues first, such as PostgreSQL’s deprecated SQL or MySQL’s strict mode changes.

Blue/Green Deployment #

RDS’s Blue/Green Deployment is a way to reduce the downtime of a major upgrade or a large change. You create a replica (green) and pause only briefly at the cutover moment.

Creating Blue/Green

aws rds create-blue-green-deployment \
  --blue-green-deployment-name my-postgres-bg \
  --source arn:aws:rds:ap-northeast-2:123456789012:db:my-postgres \
  --target-engine-version 17.0

Performance Insights — performance monitoring #

RDS’s performance monitoring tool. You see, in graphs, which SQL takes the most time.

The look of Performance Insights

time axis ──▶
DB Load ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
        │ ── SELECT ... FROM users WHERE ...
        │ ── UPDATE products SET ...
        │ ── lock:relation

7 days are free; beyond that there’s an additional cost.
It analyzes slow queries / locks / waits.
Patterns like an application’s N+1 queries show up in the graph.

RDS Proxy — a connection pool #

When connecting to RDS from Lambda or containers, each TCP / TLS handshake is expensive. RDS Proxy creates a connection pool as a managed service.

The cases where you use it are as follows.

Lambda + RDS — it pools, with Proxy, the new connections created on every invocation (Chapter 18 API Gateway and Lambda).
Container auto-scaling — it prevents the connection surge as instances increase.
Automatic failover recovery.

The cost is per vCPU hour. It can be overkill for a small workload.

Common pitfalls #

Public RDS — If you create it with publicly-accessible=true and the SG is 0.0.0.0/0, brute-force attacks come within a few days. Operations always allow only the private subnet and the app SG.
Putting master-user-password in git — A plaintext password in a script or Terraform gets exposed. Manage it with Secrets Manager (Chapter 20).
Operating without Multi-AZ on — If you turn off Multi-AZ to save cost, the DB goes down for 1 ~ 2 hours during an AZ failure. Operations turn it on.
backup-retention 0 — If you turn off automated backups to save cost, PITR is turned off at the same time. Recovery during an incident becomes impossible. At least 7 days is recommended.
Deleting without a Final Snapshot — If you delete the DB quickly with --skip-final-snapshot, the data is permanently lost. For automation like terraform destroy, force a final snapshot.
Turning off Storage Auto-Scaling — Writes fail at dawn when the disk fills to 80%. Turn on auto-expansion with the --max-allocated-storage option.

Turning on Storage Auto-Scaling

aws rds modify-db-instance \
  --db-instance-identifier my-postgres \
  --max-allocated-storage 200

Mistaking a Read Replica for a failover target — A Read Replica doesn’t auto-failover. It needs a manual promote. Automatic failover is Multi-AZ.
Connection leaks — If the app doesn’t close connections and fills max_connections, new requests are rejected. Check PgBouncer / RDS Proxy or the app’s pool settings.

Exercises #

Without looking at the §“A DB on EC2 vs RDS” table, write down three operational tasks you don’t have to do yourself when you choose RDS. And conversely, note one special situation where a DB on EC2 is needed.
Compare the three — automated backup (PITR), manual snapshot, and Read Replica — from a backup / DR standpoint, and for the two situations “I need to undo a DELETE I ran wrong last night” and “I need a safety net just before a major upgrade”, pick what you’d use for each, based on §“Automated backup”. This comparison is expanded again in Chapter 30 Disaster Recovery & Backup.
For the create-db-instance command that launches an operational RDS, connect, one sentence each, which flag to set to which value to prevent the six items in §“Common pitfalls” (e.g., publicly-accessible, backup-retention-period, multi-az).

In short: RDS is a managed relational DB, and for an operational system RDS is almost always the answer; Aurora is AWS’s own engine that adds distributed storage and fast failover. Automated backup and PITR restore any point at 5-minute precision, and a manual snapshot survives even if you delete the DB. Multi-AZ is synchronous replication plus automatic failover, but the standby can’t be used for reads; a Read Replica is just an asynchronous read copy, not an automatic failover. The operational baseline is a private subnet, publicly-accessible=false, and 7+ days of backup.

Next chapter #

We’ve got the DB domain in hand. Next, Chapter 12 Route 53 moves on to DNS, the first point where users meet our system. We’ll lay out domain operations — domain registration and Hosted Zones, record kinds and Alias, and routing policies (Failover / Latency / Geolocation).