RDS — managed DB, backups, parameter groups
AWS's managed relational DB service, RDS. A comparison with a DB on EC2, automated backups and snapshots and PITR, Multi-AZ, parameter / option groups, and the operational flow of minor vs major upgrades.
If Chapter 10 S3 was the object domain, this chapter is the turn for relational DBs. AWS’s managed relational DB service is RDS (Relational Database Service). You can launch and operate PostgreSQL / MySQL / MariaDB / Oracle / SQL Server / Aurora from one console.
In this chapter we start from RDS’s managed model and lay out, in order, automated backups and PITR, Multi-AZ and Read Replicas, parameter / option groups, and upgrades. The private subnet RDS lives in sits on top of the subnet design of Chapter 8 EC2 and VPC Basics, password management carries into Chapter 20 Secrets Manager and Parameter Store, and the big picture of backup and restore carries into Chapter 30 Disaster Recovery & Backup.
A DB on EC2 vs RDS #
When you first meet the cloud, everyone hesitates once. “Should I launch a single EC2 and install PostgreSQL myself, or go with RDS?”
| Item | DB on EC2 | RDS |
|---|---|---|
| Install / setup | Yourself | Console click |
| Patches / minor upgrades | Yourself | Click (or automatic) |
| Backups | Yourself (pg_dump, cron) | Automatic + PITR |
| Multi-AZ failover | Yourself (Patroni, etc.) | Turn on the option |
| Read Replica | Yourself (replication setup) | Console click |
| Monitoring | Yourself (pg_stat_*) | CloudWatch + Performance Insights |
| Cost | Instance cost only | Instance + license + managed premium |
| Freedom | Touch the OS / extensions / kernel freely | Limited (e.g., no superuser) |
For an operational system, RDS is almost always the answer. A DB on EC2 is used only in special cases where an extension is unsupported on RDS or where OS-level tuning is needed.
Choosing an engine #
The engines RDS supports are as follows.
PostgreSQL ── the first candidate for new projects. Rich JSONB / extensions
MySQL ── the most common choice. Emphasis on compatibility
MariaDB ── a MySQL fork. Almost identical to MySQL
Oracle ── enterprise with expensive licenses
SQL Server ── the Microsoft ecosystem
Aurora ── AWS's own engine. PostgreSQL / MySQL compatibleThe characteristics of Aurora #
Aurora is a cloud-native DB built by AWS. It’s PostgreSQL / MySQL wire-compatible, so you can move with almost no code changes.
| Aurora | RDS PostgreSQL/MySQL | |
|---|---|---|
| Storage | Distributed (6 copies automatically) | EBS |
| Max size | 128 TB auto-expanding | 64 TB |
| Read Replica | Up to 15 (millisecond sync) | 5 (asynchronous) |
| Failover time | < 30 seconds | 1~2 minutes |
| Cost | About 20% pricier than RDS | Standard |
| New features | Serverless v2, Global Database | RDS standard |
When operational scale and availability matter, Aurora; when cost and simplicity matter, RDS PostgreSQL.
Aurora Serverless v2 is usage-based, auto-scaling RDS. It’s attractive when traffic is uneven, and it has almost no cold start (v1’s drawback was resolved).
Launching an RDS instance #
aws rds create-db-instance \
--db-instance-identifier my-postgres \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 16.4 \
--master-username postgres \
--master-user-password "very-strong-password" \
--allocated-storage 20 \
--storage-type gp3 \
--vpc-security-group-ids sg-0abc... \
--db-subnet-group-name my-db-subnet-group \
--backup-retention-period 7 \
--multi-az \
--no-publicly-accessibleThe frequently touched options are as follows.
| Option | Role |
|---|---|
db-instance-class | Instance type. db.t3 (small role), db.m5 (general), db.r5 (memory) |
engine / engine-version | Engine and version |
allocated-storage | Disk GB. storage-type=gp3 is the default |
multi-az | Automatically places a Standby in another AZ |
publicly-accessible | A public IP. false for operations |
backup-retention-period | Automated backup retention days (0~35) |
DB Subnet Group #
RDS requires you to pre-specify the subnets to place in a Multi-AZ configuration. That’s the DB Subnet Group. Usually you place it across two or more AZs of private subnets.
aws rds create-db-subnet-group \
--db-subnet-group-name my-db-subnet-group \
--db-subnet-group-description "DB private subnets" \
--subnet-ids subnet-0a1... subnet-0b2... subnet-0c3...Where the DB lives is a private subnet (Chapter 8 VPC), not exposed directly to the internet. You allow only the app server SG in, at the SG level.
Automated backup — the core value of managed #
RDS’s real value is backups.
Automated Backup #
If backup-retention-period is greater than 0, automated backup is on.
- A full backup happens once a day during the backup window time.
- Transaction logs are saved every 5 minutes.
- They’re retained for the retention period (1 ~ 35 days).
- When the DB is deleted, the automated backups are deleted with it (you can prevent this with
SkipFinalSnapshot=false).
Point-in-Time Recovery (PITR) #
An RDS with automated backup on can be restored to any point within the retention period. The precision is 5-minute units from the transaction log.
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier my-postgres \
--target-db-instance-identifier my-postgres-restored \
--restore-time 2026-05-24T08:30:00ZA restore takes the form of creating a new instance. The source is left as-is. “I need the state from just before 3:27 a.m. this morning” is possible.
Manual Snapshot #
A backup the user creates explicitly, separate from automated backups. It doesn’t disappear even when you delete the DB, and it has no retention deadline.
aws rds create-db-snapshot \
--db-instance-identifier my-postgres \
--db-snapshot-identifier my-postgres-2026-05-24-prereleaseThe cases where you use a manual snapshot in operations are as follows.
- Just before a major upgrade.
- Just before a large migration.
- A final snapshot when deleting a DB.
- A copy to another region / account (a DR configuration).
Multi-AZ — high availability #
When you turn on the --multi-az option, RDS automatically replicates a Standby instance to another AZ.
┌──────────────────────────────────┐
│ VPC │
│ │
│ AZ a AZ b │
│ ┌──────┐ ┌──────┐ │
│ │ Pri │ ◀══════▶ │Stand │ │
│ │mary │ sync repl │ by │ │
│ └──────┘ └──────┘ │
│ ▲ │
│ │ DNS endpoint │
│ │ (automatic failover) │
└───────┼──────────────────────────┘
│
app server- Synchronous replication — the Standby also receives every transaction up to commit.
- Automatic failover on failure — within 30 seconds ~ 2 minutes the Standby becomes Primary and the DNS endpoint points to it.
- No read distribution — the Standby isn’t used for reads either (a difference from Aurora).
The cost of Multi-AZ #
The price of redundancy is 2x the instance / storage cost. For learning or side projects, single AZ; for operations, Multi-AZ.
Multi-AZ Cluster (option) #
The newly introduced Multi-AZ DB Cluster for PostgreSQL / MySQL has standbys that are readable, and a failover time under 35 seconds. However, it uses three AZs (the cost of three instances).
Read Replica — read distribution #
A Read Replica is a read-only copy made by asynchronous replication. It distributes load to places with heavy read traffic.
aws rds create-db-instance-read-replica \
--db-instance-identifier my-postgres-read-1 \
--source-db-instance-identifier my-postgres \
--availability-zone ap-northeast-2cThe characteristics are as follows.
- Asynchronous replication — there’s a slight delay (usually ms ~ seconds).
- Possible across regions too — used for global read distribution / DR.
- Up to 5 — Aurora has 15.
- Promote can split it off as a separate instance.
The suitability of a Read Replica #
| Role | Suitability |
|---|---|
| Read traffic distribution | ⭐⭐⭐ |
| Analytics / reporting | ⭐⭐⭐ |
| Backup / DR | ⭐⭐ (snapshots are safer) |
| Automatic failover | No — a Read Replica isn’t auto-promoted |
If read traffic isn’t heavy, a Multi-AZ Cluster is simpler than a Read Replica.
Parameter groups and option groups #
In RDS, a DB engine’s settings (like max_connections, shared_buffers) are managed with a parameter group.
Parameter Group #
aws rds create-db-parameter-group \
--db-parameter-group-name my-postgres-16-params \
--db-parameter-group-family postgres16 \
--description "Custom params for my workload"
aws rds modify-db-parameter-group \
--db-parameter-group-name my-postgres-16-params \
--parameters \
"ParameterName=max_connections,ParameterValue=200,ApplyMethod=pending-reboot" \
"ParameterName=log_statement,ParameterValue=ddl,ApplyMethod=immediate"There are two kinds of parameters.
- Static — applied after a DB reboot (
max_connections, etc.). - Dynamic — applied immediately (
log_statement, etc.).
The frequently touched parameters are as follows.
| Parameter | PostgreSQL | MySQL |
|---|---|---|
| Max connections | max_connections | max_connections |
| Query logging | log_min_duration_statement | slow_query_log |
| Memory | shared_buffers, work_mem | innodb_buffer_pool_size |
| Timezone | timezone | time_zone |
Option Group #
Its role is to turn on engine-specific extra features (e.g., SQL Server’s SSIS, Oracle’s OEM). For PostgreSQL / MySQL it’s rarely touched.
Upgrades — the operational flow #
RDS splits engine version upgrades into two kinds.
Minor Upgrade — safe #
A minor upgrade like 16.3 → 16.4. Usually security patches and small improvements. If you turn on the auto-apply option, it happens automatically during the backup window.
aws rds modify-db-instance \
--db-instance-identifier my-postgres \
--auto-minor-version-upgrade \
--apply-immediatelyDowntime is 30 seconds ~ 5 minutes. With Multi-AZ it’s shorter (upgrade the Standby first then fail over, then the old Primary).
Major Upgrade — careful #
A major upgrade like PostgreSQL 16 → 17. It may break. The procedure is as follows.
- Create a manual snapshot (for rollback).
- Try the same version migration in a test environment.
- If possible, upgrade a Read Replica first.
- Schedule a downtime window outside operating hours.
- Run
aws rds modify-db-instance --engine-version 17.0. - Monitor the upgrade.
- If a problem arises, restore a new instance from the snapshot.
Before a major upgrade, check compatibility issues first, such as PostgreSQL’s deprecated SQL or MySQL’s strict mode changes.
Blue/Green Deployment #
RDS’s Blue/Green Deployment is a way to reduce the downtime of a major upgrade or a large change. You create a replica (green) and pause only briefly at the cutover moment.
aws rds create-blue-green-deployment \
--blue-green-deployment-name my-postgres-bg \
--source arn:aws:rds:ap-northeast-2:123456789012:db:my-postgres \
--target-engine-version 17.0Performance Insights — performance monitoring #
RDS’s performance monitoring tool. You see, in graphs, which SQL takes the most time.
time axis ──▶
DB Load ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
│ ── SELECT ... FROM users WHERE ...
│ ── UPDATE products SET ...
│ ── lock:relation- 7 days are free; beyond that there’s an additional cost.
- It analyzes slow queries / locks / waits.
- Patterns like an application’s N+1 queries show up in the graph.
RDS Proxy — a connection pool #
When connecting to RDS from Lambda or containers, each TCP / TLS handshake is expensive. RDS Proxy creates a connection pool as a managed service.
The cases where you use it are as follows.
- Lambda + RDS — it pools, with Proxy, the new connections created on every invocation (Chapter 18 API Gateway and Lambda).
- Container auto-scaling — it prevents the connection surge as instances increase.
- Automatic failover recovery.
The cost is per vCPU hour. It can be overkill for a small workload.
Common pitfalls #
- Public RDS — If you create it with
publicly-accessible=trueand the SG is0.0.0.0/0, brute-force attacks come within a few days. Operations always allow only the private subnet and the app SG. - Putting
master-user-passwordin git — A plaintext password in a script or Terraform gets exposed. Manage it with Secrets Manager (Chapter 20). - Operating without Multi-AZ on — If you turn off Multi-AZ to save cost, the DB goes down for 1 ~ 2 hours during an AZ failure. Operations turn it on.
- backup-retention 0 — If you turn off automated backups to save cost, PITR is turned off at the same time. Recovery during an incident becomes impossible. At least 7 days is recommended.
- Deleting without a Final Snapshot — If you delete the DB quickly with
--skip-final-snapshot, the data is permanently lost. For automation liketerraform destroy, force a final snapshot. - Turning off Storage Auto-Scaling — Writes fail at dawn when the disk fills to 80%. Turn on auto-expansion with the
--max-allocated-storageoption.
aws rds modify-db-instance \
--db-instance-identifier my-postgres \
--max-allocated-storage 200- Mistaking a Read Replica for a failover target — A Read Replica doesn’t auto-failover. It needs a manual promote. Automatic failover is Multi-AZ.
- Connection leaks — If the app doesn’t close connections and fills
max_connections, new requests are rejected. Check PgBouncer / RDS Proxy or the app’s pool settings.
Exercises #
- Without looking at the §“A DB on EC2 vs RDS” table, write down three operational tasks you don’t have to do yourself when you choose RDS. And conversely, note one special situation where a DB on EC2 is needed.
- Compare the three — automated backup (PITR), manual snapshot, and Read Replica — from a backup / DR standpoint, and for the two situations “I need to undo a DELETE I ran wrong last night” and “I need a safety net just before a major upgrade”, pick what you’d use for each, based on §“Automated backup”. This comparison is expanded again in Chapter 30 Disaster Recovery & Backup.
- For the
create-db-instancecommand that launches an operational RDS, connect, one sentence each, which flag to set to which value to prevent the six items in §“Common pitfalls” (e.g.,publicly-accessible,backup-retention-period,multi-az).
In short: RDS is a managed relational DB, and for an operational system RDS is almost always the answer; Aurora is AWS’s own engine that adds distributed storage and fast failover. Automated backup and PITR restore any point at 5-minute precision, and a manual snapshot survives even if you delete the DB. Multi-AZ is synchronous replication plus automatic failover, but the standby can’t be used for reads; a Read Replica is just an asynchronous read copy, not an automatic failover. The operational baseline is a private subnet,
publicly-accessible=false, and 7+ days of backup.
Next chapter #
We’ve got the DB domain in hand. Next, Chapter 12 Route 53 moves on to DNS, the first point where users meet our system. We’ll lay out domain operations — domain registration and Hosted Zones, record kinds and Alias, and routing policies (Failover / Latency / Geolocation).