AWS Intermediate #4: RDS — managed DB, backups, parameter groups
If #3 S3 was the object layer, now we move to the relational DB layer. AWS’s managed relational DB service is RDS (Relational Database Service). From a single console you can launch and operate PostgreSQL / MySQL / MariaDB / Oracle / SQL Server / Aurora.
In this post we line up the RDS managed model → automated backups and PITR → Multi-AZ and Read Replica → parameter / option groups → upgrades.
DB on EC2 vs RDS #
Everyone hesitates when first moving to the cloud. “Should I spin up an EC2 and install PostgreSQL myself, or go with RDS?”
| Item | DB on EC2 | RDS |
|---|---|---|
| Install / setup | DIY | Console click |
| Patches / minor upgrades | DIY | Click (or auto) |
| Backup | DIY (pg_dump, cron) | Auto + PITR |
| Multi-AZ failover | DIY (Patroni, etc.) | Toggle option |
| Read Replica | DIY (replication setup) | Console click |
| Monitoring | DIY (pg_stat_*) | CloudWatch + Performance Insights |
| Cost | Instance only | Instance + license + managed premium |
| Freedom | OS / extensions / kernel everything | Limited (e.g., no superuser) |
For production, RDS is the answer 99% of the time. DB-on-EC2 is for special cases — when an extension isn’t supported on RDS, or you need OS-level tuning.
Engine choice #
Engines RDS supports:
PostgreSQL ── First pick for new projects. JSONB / rich extensions
MySQL ── Most common choice. Compatibility-driven
MariaDB ── MySQL fork. Almost identical to MySQL
Oracle ── Enterprise with expensive license
SQL Server ── Microsoft ecosystem
Aurora ── AWS's own engine. PostgreSQL / MySQL compatibleWhere Aurora sits #
Aurora is AWS’s cloud-native DB. Wire-compatible with PostgreSQL / MySQL, so you can move with almost no code changes.
| Aurora | RDS PostgreSQL/MySQL | |
|---|---|---|
| Storage | Distributed (auto 6 copies) | EBS |
| Max size | 128 TB auto-scale | 64 TB |
| Read Replica | Up to 15 (millisecond sync) | 5 (async) |
| Failover time | < 30 sec | 1–2 min |
| Cost | ~20% more than RDS | Standard |
| New features | Serverless v2, Global Database | RDS basics |
If scale / availability matter most, Aurora. If cost / simplicity matter, RDS PostgreSQL.
Aurora Serverless v2 is usage-based auto-scaling RDS — attractive for workloads with uneven traffic. Cold starts are nearly gone (the v1 weakness fixed).
Launching an RDS instance #
aws rds create-db-instance \
--db-instance-identifier my-postgres \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 16.4 \
--master-username postgres \
--master-user-password "very-strong-password" \
--allocated-storage 20 \
--storage-type gp3 \
--vpc-security-group-ids sg-0abc... \
--db-subnet-group-name my-db-subnet-group \
--backup-retention-period 7 \
--multi-az \
--no-publicly-accessibleCommon options:
| Option | Description |
|---|---|
db-instance-class | Instance type. db.t3 (small), db.m5 (general), db.r5 (memory) |
engine / engine-version | Engine and version |
allocated-storage | Disk GB. storage-type=gp3 is the default |
multi-az | Standby auto-placed in another AZ |
publicly-accessible | Public IP. false in production |
backup-retention-period | Auto-backup retention days (0–35) |
DB Subnet Group #
RDS needs you to pre-specify subnets for Multi-AZ. That’s the DB Subnet Group. Usually two or more private subnets across AZs.
aws rds create-db-subnet-group \
--db-subnet-group-name my-db-subnet-group \
--db-subnet-group-description "DB private subnets" \
--subnet-ids subnet-0a1... subnet-0b2... subnet-0c3...The DB sits in a private subnet (#1 VPC) — never directly exposed to the internet. Only the app server SG comes in via SG-by-SG.
Automated backups — the core value of managed #
The real value of RDS lives in backups.
Automated Backup #
If backup-retention-period > 0, automated backups are on.
- Daily full backup (during the backup window)
- Transaction log every 5 minutes
- Kept for the retention period (1–35 days)
- Removed when DB is deleted (you can prevent this with
SkipFinalSnapshot=false)
Point-in-Time Recovery (PITR) #
RDS with automated backup on lets you restore to any point within the retention window. 5-minute precision via transaction logs.
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier my-postgres \
--target-db-instance-identifier my-postgres-restored \
--restore-time 2026-04-21T08:30:00ZRestore creates a new instance — the original stays intact. “I need exactly the state at 03:27 this morning” becomes entirely doable.
Manual Snapshot #
Backups separate from automated, taken explicitly. They survive even if the DB is deleted, with no retention limit.
aws rds create-db-snapshot \
--db-instance-identifier my-postgres \
--db-snapshot-identifier my-postgres-2026-04-21-prereleaseOperational uses:
- Snapshot right before a major upgrade
- Snapshot right before a big migration
- Final snapshot when deleting
- Copy across regions / accounts (for DR)
Multi-AZ — high availability #
With --multi-az, RDS auto-replicates a standby into another AZ.
┌──────────────────────────────────┐
│ VPC │
│ │
│ AZ a AZ b │
│ ┌──────┐ ┌──────┐ │
│ │ Pri │ ◀══════▶ │Stand │ │
│ │mary │ sync repl│ by │ │
│ └──────┘ └──────┘ │
│ ▲ │
│ │ DNS endpoint │
│ │ (auto failover) │
└───────┼──────────────────────────┘
│
App servers- Synchronous replication — standby has every committed transaction
- Auto failover on outage — within 30 sec to 2 min, standby becomes primary and the DNS endpoint repoints
- Reads not load-balanced — standby is not used for reads (different from Aurora)
Cost of Multi-AZ #
The cost of duplication is 2x instance / storage cost. Single AZ for learning / side projects, Multi-AZ for production.
Multi-AZ Cluster (option) #
The newer Multi-AZ DB Cluster for PostgreSQL / MySQL has readable standbys and failover under 35 seconds. But uses 3 AZs (3-instance cost).
Read Replica — read distribution #
A Read Replica is an asynchronously replicated read-only copy. Distributes read load on read-heavy workloads.
aws rds create-db-instance-read-replica \
--db-instance-identifier my-postgres-read-1 \
--source-db-instance-identifier my-postgres \
--availability-zone ap-northeast-2cProperties:
- Async replication — slight lag (usually ms–seconds)
- Cross-region possible — global read distribution / DR
- Up to 5 (Aurora has 15)
- Can be promoted to a standalone instance
Where Read Replica fits #
| Use | Fit |
|---|---|
| Read traffic distribution | ⭐⭐⭐ |
| Analytics / reporting | ⭐⭐⭐ |
| Backup / DR | ⭐⭐ (snapshots are safer) |
| Auto failover | ❌ — Read Replicas don’t auto-promote |
If read traffic isn’t huge, Multi-AZ Cluster is simpler than Read Replica.
Parameter group and option group #
DB engine settings (max_connections, shared_buffers, etc.) are managed in RDS via parameter groups.
Parameter Group #
aws rds create-db-parameter-group \
--db-parameter-group-name my-postgres-16-params \
--db-parameter-group-family postgres16 \
--description "Custom params for my workload"
aws rds modify-db-parameter-group \
--db-parameter-group-name my-postgres-16-params \
--parameters \
"ParameterName=max_connections,ParameterValue=200,ApplyMethod=pending-reboot" \
"ParameterName=log_statement,ParameterValue=ddl,ApplyMethod=immediate"Types:
- Static — applies after DB reboot (
max_connections, …) - Dynamic — applies immediately (
log_statement, …)
Common parameters:
| Parameter | PostgreSQL | MySQL |
|---|---|---|
| Max connections | max_connections | max_connections |
| Query logging | log_min_duration_statement | slow_query_log |
| Memory | shared_buffers, work_mem | innodb_buffer_pool_size |
| Timezone | timezone | time_zone |
Option Group #
The group for enabling engine-specific extras (e.g., SSIS for SQL Server, OEM for Oracle). Hardly used for PostgreSQL / MySQL.
Upgrades — operational work #
RDS splits engine versions in two.
Minor Upgrade — safe #
Like 16.3 → 16.4. Usually security patches + small improvements. Toggle auto-apply and they happen during the backup window.
aws rds modify-db-instance \
--db-instance-identifier my-postgres \
--auto-minor-version-upgrade \
--apply-immediatelyDowntime is 30 sec to 5 min. Shorter on Multi-AZ (standby first → failover → old primary).
Major Upgrade — careful #
Like PostgreSQL 16 → 17. Things can break. Procedure:
- Take a manual snapshot (for rollback)
- Try the same version migration in a test environment
- Upgrade Read Replicas first (when possible)
- Schedule downtime outside business hours
aws rds modify-db-instance --engine-version 17.0- Monitor the upgrade
- On issues, restore a new instance from the snapshot
Before a major upgrade, audit compatibility issues like PostgreSQL deprecated SQL / MySQL strict mode changes.
Blue/Green Deployment #
RDS’s Blue/Green Deployment is a newer approach to reducing downtime for major upgrades and large changes. A replica (green) is built in the background, and only the final cutover is brief.
aws rds create-blue-green-deployment \
--blue-green-deployment-name my-postgres-bg \
--source arn:aws:rds:ap-northeast-2:123456789012:db:my-postgres \
--target-engine-version 17.0Performance Insights — the performance tool #
RDS’s performance monitoring tool. It shows which SQL statements consume the most time, visualized on a graph.
Time axis ──▶
DB Load ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
│ ── SELECT ... FROM users WHERE ...
│ ── UPDATE products SET ...
│ ── lock:relation- 7 days free, anything more costs extra
- Slow queries / locks / wait analysis
- N+1 patterns we meet in Django Advanced #3 query optimization show up on the graph
RDS Proxy — connection pool #
When Lambda or containers connect to RDS, the overhead of a full TCP / TLS handshake on every invocation is costly. RDS Proxy is a managed connection pool that eliminates this.
Where it helps:
- Lambda + RDS — new connection per invocation → pool via Proxy
- Container auto-scaling — connections explode as instances multiply
- Auto-recovery on failover
Cost is per vCPU-hour — overkill for small workloads.
Common pitfalls #
1) Public RDS #
publicly-accessible=true and SG 0.0.0.0/0 → brute force in days. Production: always private subnet + only the app SG.
2) master-user-password in git
#
Plain password in scripts / Terraform → leaked. Use Secrets Manager (Advanced #6).
3) Multi-AZ off in production #
Cost-cut and turned Multi-AZ off → 1–2 hour DB outage during AZ failure. Production: turn it on.
4) backup-retention 0 #
Cost-cut and disabled automated backups → PITR is off too. Recovery impossible after an incident. Recommend at least 7 days.
5) Deleting without final snapshot #
Deleting with --skip-final-snapshot for speed → permanent data loss. Force final snapshot in automation like terraform destroy.
6) Storage Auto-Scaling off #
Disk hits 80% at 3am → write fails. Turn on auto-scaling with max-allocated-storage.
aws rds modify-db-instance \
--db-instance-identifier my-postgres \
--max-allocated-storage 2007) Read Replica as a failover #
Read Replicas don’t auto-failover. They need manual promote. Auto failover is Multi-AZ.
8) Connection leak #
App doesn’t close connections, fills max_connections → new requests rejected. Check PgBouncer / RDS Proxy or the app pool config.
Wrap-up #
What we took home this time:
- RDS = AWS’s managed relational DB. PostgreSQL / MySQL / Aurora are the common picks
- Aurora = AWS’s own engine. Distributed storage, faster failover, more RRs
- Place in private subnets via DB Subnet Group. publicly-accessible=false is the production default
- Automated backup + PITR = restore at 5-min precision to any point
- Manual Snapshot = explicit, survives DB deletion
- Multi-AZ = sync replication + auto failover, but standby is unreadable
- Read Replica = async copy, for read distribution / analytics. No auto failover
- Manage engine settings in parameter groups. Static / Dynamic difference
- Minor upgrade = can be auto. Major upgrade = use Blue/Green
- Reinforce with Performance Insights + RDS Proxy
- Pitfalls — public, password, Multi-AZ off, backup 0, missing final snapshot, storage auto-scale, RR-as-failover, connection leak
Next — Route 53 #
The DB piece is set. Now to the place where users first meet our system — DNS.
In #5 Route 53 — domains and DNS we’ll line up domain registration / Hosted Zones / record types and Aliases / routing policies (Failover / Latency / Geolocation).