AWS Intermediate #4: RDS — managed DB, backups, parameter groups

9 min read

If #3 S3 was the object layer, now we move to the relational DB layer. AWS’s managed relational DB service is RDS (Relational Database Service). From a single console you can launch and operate PostgreSQL / MySQL / MariaDB / Oracle / SQL Server / Aurora.

In this post we line up the RDS managed model → automated backups and PITR → Multi-AZ and Read Replica → parameter / option groups → upgrades.

DB on EC2 vs RDS #

Everyone hesitates when first moving to the cloud. “Should I spin up an EC2 and install PostgreSQL myself, or go with RDS?”

ItemDB on EC2RDS
Install / setupDIYConsole click
Patches / minor upgradesDIYClick (or auto)
BackupDIY (pg_dump, cron)Auto + PITR
Multi-AZ failoverDIY (Patroni, etc.)Toggle option
Read ReplicaDIY (replication setup)Console click
MonitoringDIY (pg_stat_*)CloudWatch + Performance Insights
CostInstance onlyInstance + license + managed premium
FreedomOS / extensions / kernel everythingLimited (e.g., no superuser)

For production, RDS is the answer 99% of the time. DB-on-EC2 is for special cases — when an extension isn’t supported on RDS, or you need OS-level tuning.

Engine choice #

Engines RDS supports:

RDS engines
PostgreSQL  ── First pick for new projects. JSONB / rich extensions
MySQL       ── Most common choice. Compatibility-driven
MariaDB     ── MySQL fork. Almost identical to MySQL
Oracle      ── Enterprise with expensive license
SQL Server  ── Microsoft ecosystem
Aurora      ── AWS's own engine. PostgreSQL / MySQL compatible

Where Aurora sits #

Aurora is AWS’s cloud-native DB. Wire-compatible with PostgreSQL / MySQL, so you can move with almost no code changes.

AuroraRDS PostgreSQL/MySQL
StorageDistributed (auto 6 copies)EBS
Max size128 TB auto-scale64 TB
Read ReplicaUp to 15 (millisecond sync)5 (async)
Failover time< 30 sec1–2 min
Cost~20% more than RDSStandard
New featuresServerless v2, Global DatabaseRDS basics

If scale / availability matter most, Aurora. If cost / simplicity matter, RDS PostgreSQL.

Aurora Serverless v2 is usage-based auto-scaling RDS — attractive for workloads with uneven traffic. Cold starts are nearly gone (the v1 weakness fixed).

Launching an RDS instance #

Create RDS PostgreSQL
aws rds create-db-instance \
  --db-instance-identifier my-postgres \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 16.4 \
  --master-username postgres \
  --master-user-password "very-strong-password" \
  --allocated-storage 20 \
  --storage-type gp3 \
  --vpc-security-group-ids sg-0abc... \
  --db-subnet-group-name my-db-subnet-group \
  --backup-retention-period 7 \
  --multi-az \
  --no-publicly-accessible

Common options:

OptionDescription
db-instance-classInstance type. db.t3 (small), db.m5 (general), db.r5 (memory)
engine / engine-versionEngine and version
allocated-storageDisk GB. storage-type=gp3 is the default
multi-azStandby auto-placed in another AZ
publicly-accessiblePublic IP. false in production
backup-retention-periodAuto-backup retention days (0–35)

DB Subnet Group #

RDS needs you to pre-specify subnets for Multi-AZ. That’s the DB Subnet Group. Usually two or more private subnets across AZs.

Create a DB Subnet Group
aws rds create-db-subnet-group \
  --db-subnet-group-name my-db-subnet-group \
  --db-subnet-group-description "DB private subnets" \
  --subnet-ids subnet-0a1... subnet-0b2... subnet-0c3...

The DB sits in a private subnet (#1 VPC) — never directly exposed to the internet. Only the app server SG comes in via SG-by-SG.

Automated backups — the core value of managed #

The real value of RDS lives in backups.

Automated Backup #

If backup-retention-period > 0, automated backups are on.

  • Daily full backup (during the backup window)
  • Transaction log every 5 minutes
  • Kept for the retention period (1–35 days)
  • Removed when DB is deleted (you can prevent this with SkipFinalSnapshot=false)

Point-in-Time Recovery (PITR) #

RDS with automated backup on lets you restore to any point within the retention window. 5-minute precision via transaction logs.

Restore to a point 3 hours ago
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier my-postgres \
  --target-db-instance-identifier my-postgres-restored \
  --restore-time 2026-04-21T08:30:00Z

Restore creates a new instance — the original stays intact. “I need exactly the state at 03:27 this morning” becomes entirely doable.

Manual Snapshot #

Backups separate from automated, taken explicitly. They survive even if the DB is deleted, with no retention limit.

Manual snapshot
aws rds create-db-snapshot \
  --db-instance-identifier my-postgres \
  --db-snapshot-identifier my-postgres-2026-04-21-prerelease

Operational uses:

  • Snapshot right before a major upgrade
  • Snapshot right before a big migration
  • Final snapshot when deleting
  • Copy across regions / accounts (for DR)

Multi-AZ — high availability #

With --multi-az, RDS auto-replicates a standby into another AZ.

Shape of Multi-AZ
   ┌──────────────────────────────────┐
   │           VPC                    │
   │                                  │
   │    AZ a              AZ b        │
   │    ┌──────┐          ┌──────┐    │
   │    │ Pri  │ ◀══════▶ │Stand │    │
   │    │mary  │  sync repl│ by   │    │
   │    └──────┘          └──────┘    │
   │       ▲                          │
   │       │ DNS endpoint             │
   │       │ (auto failover)          │
   └───────┼──────────────────────────┘
       App servers
  • Synchronous replication — standby has every committed transaction
  • Auto failover on outage — within 30 sec to 2 min, standby becomes primary and the DNS endpoint repoints
  • Reads not load-balanced — standby is not used for reads (different from Aurora)

Cost of Multi-AZ #

The cost of duplication is 2x instance / storage cost. Single AZ for learning / side projects, Multi-AZ for production.

Multi-AZ Cluster (option) #

The newer Multi-AZ DB Cluster for PostgreSQL / MySQL has readable standbys and failover under 35 seconds. But uses 3 AZs (3-instance cost).

Read Replica — read distribution #

A Read Replica is an asynchronously replicated read-only copy. Distributes read load on read-heavy workloads.

Create a Read Replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier my-postgres-read-1 \
  --source-db-instance-identifier my-postgres \
  --availability-zone ap-northeast-2c

Properties:

  • Async replication — slight lag (usually ms–seconds)
  • Cross-region possible — global read distribution / DR
  • Up to 5 (Aurora has 15)
  • Can be promoted to a standalone instance

Where Read Replica fits #

UseFit
Read traffic distribution⭐⭐⭐
Analytics / reporting⭐⭐⭐
Backup / DR⭐⭐ (snapshots are safer)
Auto failover❌ — Read Replicas don’t auto-promote

If read traffic isn’t huge, Multi-AZ Cluster is simpler than Read Replica.

Parameter group and option group #

DB engine settings (max_connections, shared_buffers, etc.) are managed in RDS via parameter groups.

Parameter Group #

Create a custom parameter group
aws rds create-db-parameter-group \
  --db-parameter-group-name my-postgres-16-params \
  --db-parameter-group-family postgres16 \
  --description "Custom params for my workload"

aws rds modify-db-parameter-group \
  --db-parameter-group-name my-postgres-16-params \
  --parameters \
    "ParameterName=max_connections,ParameterValue=200,ApplyMethod=pending-reboot" \
    "ParameterName=log_statement,ParameterValue=ddl,ApplyMethod=immediate"

Types:

  • Static — applies after DB reboot (max_connections, …)
  • Dynamic — applies immediately (log_statement, …)

Common parameters:

ParameterPostgreSQLMySQL
Max connectionsmax_connectionsmax_connections
Query logginglog_min_duration_statementslow_query_log
Memoryshared_buffers, work_meminnodb_buffer_pool_size
Timezonetimezonetime_zone

Option Group #

The group for enabling engine-specific extras (e.g., SSIS for SQL Server, OEM for Oracle). Hardly used for PostgreSQL / MySQL.

Upgrades — operational work #

RDS splits engine versions in two.

Minor Upgrade — safe #

Like 16.3 → 16.4. Usually security patches + small improvements. Toggle auto-apply and they happen during the backup window.

Enable auto minor upgrades
aws rds modify-db-instance \
  --db-instance-identifier my-postgres \
  --auto-minor-version-upgrade \
  --apply-immediately

Downtime is 30 sec to 5 min. Shorter on Multi-AZ (standby first → failover → old primary).

Major Upgrade — careful #

Like PostgreSQL 16 → 17. Things can break. Procedure:

  1. Take a manual snapshot (for rollback)
  2. Try the same version migration in a test environment
  3. Upgrade Read Replicas first (when possible)
  4. Schedule downtime outside business hours
  5. aws rds modify-db-instance --engine-version 17.0
  6. Monitor the upgrade
  7. On issues, restore a new instance from the snapshot

Before a major upgrade, audit compatibility issues like PostgreSQL deprecated SQL / MySQL strict mode changes.

Blue/Green Deployment #

RDS’s Blue/Green Deployment is a newer approach to reducing downtime for major upgrades and large changes. A replica (green) is built in the background, and only the final cutover is brief.

Create a Blue/Green deployment
aws rds create-blue-green-deployment \
  --blue-green-deployment-name my-postgres-bg \
  --source arn:aws:rds:ap-northeast-2:123456789012:db:my-postgres \
  --target-engine-version 17.0

Performance Insights — the performance tool #

RDS’s performance monitoring tool. It shows which SQL statements consume the most time, visualized on a graph.

Where Performance Insights sits
Time axis ──▶
DB Load ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
        │ ── SELECT ... FROM users WHERE ...
        │ ── UPDATE products SET ...
        │ ── lock:relation

RDS Proxy — connection pool #

When Lambda or containers connect to RDS, the overhead of a full TCP / TLS handshake on every invocation is costly. RDS Proxy is a managed connection pool that eliminates this.

Where it helps:

  • Lambda + RDS — new connection per invocation → pool via Proxy
  • Container auto-scaling — connections explode as instances multiply
  • Auto-recovery on failover

Cost is per vCPU-hour — overkill for small workloads.

Common pitfalls #

1) Public RDS #

publicly-accessible=true and SG 0.0.0.0/0 → brute force in days. Production: always private subnet + only the app SG.

2) master-user-password in git #

Plain password in scripts / Terraform → leaked. Use Secrets Manager (Advanced #6).

3) Multi-AZ off in production #

Cost-cut and turned Multi-AZ off → 1–2 hour DB outage during AZ failure. Production: turn it on.

4) backup-retention 0 #

Cost-cut and disabled automated backups → PITR is off too. Recovery impossible after an incident. Recommend at least 7 days.

5) Deleting without final snapshot #

Deleting with --skip-final-snapshot for speed → permanent data loss. Force final snapshot in automation like terraform destroy.

6) Storage Auto-Scaling off #

Disk hits 80% at 3am → write fails. Turn on auto-scaling with max-allocated-storage.

Enable Storage Auto-Scaling
aws rds modify-db-instance \
  --db-instance-identifier my-postgres \
  --max-allocated-storage 200

7) Read Replica as a failover #

Read Replicas don’t auto-failover. They need manual promote. Auto failover is Multi-AZ.

8) Connection leak #

App doesn’t close connections, fills max_connections → new requests rejected. Check PgBouncer / RDS Proxy or the app pool config.

Wrap-up #

What we took home this time:

  • RDS = AWS’s managed relational DB. PostgreSQL / MySQL / Aurora are the common picks
  • Aurora = AWS’s own engine. Distributed storage, faster failover, more RRs
  • Place in private subnets via DB Subnet Group. publicly-accessible=false is the production default
  • Automated backup + PITR = restore at 5-min precision to any point
  • Manual Snapshot = explicit, survives DB deletion
  • Multi-AZ = sync replication + auto failover, but standby is unreadable
  • Read Replica = async copy, for read distribution / analytics. No auto failover
  • Manage engine settings in parameter groups. Static / Dynamic difference
  • Minor upgrade = can be auto. Major upgrade = use Blue/Green
  • Reinforce with Performance Insights + RDS Proxy
  • Pitfalls — public, password, Multi-AZ off, backup 0, missing final snapshot, storage auto-scale, RR-as-failover, connection leak

Next — Route 53 #

The DB piece is set. Now to the place where users first meet our system — DNS.

In #5 Route 53 — domains and DNS we’ll line up domain registration / Hosted Zones / record types and Aliases / routing policies (Failover / Latency / Geolocation).

X