AWS in Practice #2: RDS Integration and Migration Operations
In #1 we put the blog API up on ECS Fargate, but its DB was in memory. This post moves that to RDS Postgres Multi-AZ and lays out how to run migrations without killing production traffic.
If Intermediate #4 RDS was the first RDS instance we created in the console, this is the production version — VPC isolation, Secrets injection, and migration patterns all in one go.
The big picture — the DB’s layout #
Fargate Task (Private Subnet)
│
│ DATABASE_URL (from Secrets Manager)
▼
sg-rds ← 5432 ← sg-fargate
│
▼
RDS Postgres Multi-AZ (DB Subnet Group)
Primary (AZ-a)
Standby (AZ-c) ← Promoted on failoverThree core principles:
- The DB isn’t visible from the internet — Private subnet or an isolated DB Subnet Group
- Passwords aren’t in code — Secrets Manager → Task Definition’s
secretsfield - Migrations are split into a deploy step — no
migrateon container start
1) Create the DB Subnet Group #
RDS can only live inside a DB Subnet Group. It needs subnets across at least 2 AZs (whether you turn on Multi-AZ or not).
aws rds create-db-subnet-group \
--db-subnet-group-name blog-db-subnets \
--db-subnet-group-description "Blog API DB subnets" \
--subnet-ids subnet-db-a subnet-db-cThese subnets must be without a route to the internet gateway. External access only goes through the ALB / Fargate.
2) Dedicated DB Security Group #
sg-rds inbound:
port 5432 ← sg-fargate ← Fargate tasks only
port 5432 ← sg-bastion ← (optional) operator jump host
sg-rds outbound:
not needed (narrow from default)Important: source as sg-fargate (another SG), not an IP range. No matter how many Fargate tasks you have, the rule applies automatically. No SG updates needed when CIDRs change.
aws ec2 create-security-group \
--group-name sg-rds \
--description "RDS allow from Fargate" \
--vpc-id $VPC_ID
aws ec2 authorize-security-group-ingress \
--group-id $SG_RDS \
--protocol tcp --port 5432 \
--source-group $SG_FARGATE3) Create the password in Secrets Manager #
Same pattern as Advanced #6. Two approaches:
| Approach | Description |
|---|---|
| Manual creation | Generate a strong random password and enter it in RDS |
| RDS auto-integration with Secrets Manager | Check “Manage in Secrets Manager” in the RDS console → password auto-generated + auto-rotation |
Auto-integration is recommended for production. Manual is for learning.
aws secretsmanager create-secret \
--name blog-api/db \
--secret-string '{
"username": "blog_admin",
"password": "S3cr3t-r4nd0m-32-bytes-...",
"engine": "postgres",
"host": "PLACEHOLDER",
"port": 5432,
"dbname": "blogdb"
}'After the DB is created, just update host.
4) Create the RDS instance #
aws rds create-db-instance \
--db-instance-identifier blog-db \
--db-instance-class db.t4g.small \
--engine postgres --engine-version 16.4 \
--allocated-storage 20 --storage-type gp3 \
--master-username blog_admin \
--master-user-password "S3cr3t-r4nd0m-32-bytes-..." \
--db-name blogdb \
--vpc-security-group-ids $SG_RDS \
--db-subnet-group-name blog-db-subnets \
--multi-az \
--publicly-accessible false \
--backup-retention-period 7 \
--deletion-protection \
--enable-performance-insights \
--storage-encrypted \
--auto-minor-version-upgradeProduction options:
| Option | Meaning |
|---|---|
db.t4g.small | ARM-based, ~20% cheaper than x86 (t3), good starting point for small workloads |
gp3 | Latest SSD, allows separate IOPS / throughput tuning |
--multi-az | Automatic standby — failover takes 60–120s |
--publicly-accessible false | No internet access. Always false |
--backup-retention-period 7 | 7-day automatic backups (PITR available) |
--deletion-protection | Prevents accidental deletion. Even from CLI, requires a separate option |
--enable-performance-insights | Per-query performance analysis (free for 7 days) |
--storage-encrypted | Disk encryption with KMS — always on |
What failover looks like #
How Multi-AZ behaves:
Primary (AZ-a) ─ sync replication ─▶ Standby (AZ-c)
│ │
× failure │
▼ ▼
Failover triggers Standby → Primary promoted
│
▼
DNS endpoint auto-updates to new Primary (60–120s)From the application’s view, the endpoint hostname stays the same (blog-db.xxxx.ap-northeast-2.rds.amazonaws.com). DNS TTL is short (~5s) so connections naturally route to the new Primary after failover. But the connection pool needs to validate disconnected connections — zombie connections in the pool produce 504s for a while.
5) Inject Secrets into the Task #
Don’t put the password in plain text in environment variables; reference the ARN through secrets in the Task Definition.
{
"containerDefinitions": [
{
"name": "api",
"image": "...",
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-AbCdEf:url::"
}
],
"environment": [
{ "name": "ENVIRONMENT", "value": "production" }
]
}
]
}Shape of valueFrom:
arn:aws:secretsmanager:<region>:<account>:secret:<name>-<random>:<json-key>::To pull just one key from the JSON, specify the key like :url::. To pull the whole JSON, leave the key part empty.
Recommended secret shape #
{
"url": "postgresql://blog_admin:PASSWORD@blog-db.xxxx.rds.amazonaws.com:5432/blogdb",
"username": "blog_admin",
"password": "PASSWORD",
"host": "blog-db.xxxx.rds.amazonaws.com",
"port": 5432,
"dbname": "blogdb"
}Having a pre-assembled url key makes things easier on the app side — read in one line.
IAM permissions #
executionRole needs access to Secrets.
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-*"
}]
}Without this the task ends up STOPPED with
ResourceInitializationError— exactly the pitfall covered in #1’s pitfalls.
6) Verify the first connection #
psql "postgresql://blog_admin@blog-db.xxxx.ap-northeast-2.rds.amazonaws.com:5432/blogdb"
# Enter password → SELECT 1;You can only access from inside the VPC, so you need one of: a jump host / VPC-mode CloudShell / Session Manager port forwarding.
Session Manager port forward (recommended) #
A clean way to reach RDS through an ECS task — no bastion host:
aws ecs update-service \
--cluster blog-cluster --service blog-api \
--enable-execute-command --force-new-deployment
# Get into the task
TASK=$(aws ecs list-tasks --cluster blog-cluster --service-name blog-api \
--query 'taskArns[0]' --output text)
aws ecs execute-command --cluster blog-cluster --task $TASK \
--container api --interactive --command "/bin/sh"Run psql directly inside the task — powerful for ad-hoc debugging. Needs SSM / iam:PassRole / ecs:ExecuteCommand permissions.
7) The operational side of migrations #
Most incidents come from migrations. Two things matter.
Where do you run them? #
A) Auto on container start (CMD before uvicorn/gunicorn)
─ Simple. Small teams.
─ Risk: N replicas run simultaneously → locks / race
─ Risk: failure means the container itself doesn't come up → rollback awkward
B) Separate ECS RunTask (deploy stage)
─ One execution. No locks/races.
─ Result tracked separately.
─ Recommended.
C) CodeDeploy lifecycle hook
─ As part of blue/green deploy
─ pre-traffic-shift or before-allow-trafficProduction recommendation is B (RunTask). CI pushes a new image → migration RunTask → on success, Service update.
aws ecs run-task \
--cluster blog-cluster \
--task-definition blog-api-migrate:5 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={
subnets=[subnet-aaa],
securityGroups=[sg-fargate],
assignPublicIp=ENABLED
}" \
--count 1 \
--started-by "deploy-$(git rev-parse --short HEAD)"blog-api-migrate is a separate Task Definition — same image, different command.
{
"family": "blog-api-migrate",
"containerDefinitions": [{
"name": "migrate",
"image": "...:v123",
"command": ["alembic", "upgrade", "head"],
"essential": true
}]
}For Django: ["python", "manage.py", "migrate", "--noinput"].
Backward Compatible — Blue/Green-friendly #
During a deployment, the previous version and the new version are running simultaneously (the very nature of a rolling deployment). For both versions to use the same DB without breaking, migrations must be backward and forward compatible.
| Risky change | Safe flow (3 steps) |
|---|---|
| Drop column | 1) Deploy code that stops using it → 2) drop column migration → 3) cleanup |
| Rename column | 1) Add new column + dual write → 2) read only the new column → 3) drop the old |
| Add NOT NULL | 1) Allow NULL + backfill default → 2) make code always fill → 3) add NOT NULL |
| Big index | CREATE INDEX CONCURRENTLY (PG) — no lock. Alembic op.create_index() option |
The principle: a migration completes within a bounded window, and the resulting DB schema must work correctly with both the old version and the new version of the code.
ALTER TABLE lock pitfalls
#
A common case in PostgreSQL:
ALTER TABLE posts ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'draft';In PG 11+, DEFAULT is instant (just metadata change). But some patterns still rewrite the entire row → big-table lock → production frozen.
The safe approach:
-- 1) Just add the column (allow NULL)
ALTER TABLE posts ADD COLUMN status VARCHAR(20);
-- 2) Backfill (in batches)
UPDATE posts SET status = 'draft' WHERE status IS NULL AND id BETWEEN 1 AND 10000;
-- ... repeat
-- 3) NOT NULL + default
ALTER TABLE posts ALTER COLUMN status SET DEFAULT 'draft';
ALTER TABLE posts ALTER COLUMN status SET NOT NULL;The same operational mindset as N+1 / index issues in Django Advanced #3 query optimization.
8) Backup and recovery #
RDS automated backups support PITR (Point-In-Time Recovery).
| Type | Description |
|---|---|
| Automated backup | During retention period (7–35 days), daily + WAL every 5 minutes |
| Manual snapshot | Explicit — independent of retention, kept until explicitly deleted |
| PITR | Restore to any point within the automated backup window (creates a new instance) |
aws rds create-db-snapshot \
--db-snapshot-identifier blog-db-pre-deploy-$(date +%Y%m%d-%H%M) \
--db-instance-identifier blog-dbTake one before a risky migration, and in the event of an incident, restore to a new instance → swap DNS, done.
9) Connection pools and IAM auth #
Two more pieces come in as production traffic grows.
RDS Proxy #
If each task maintains its own pool, task count × pool size = total connections. PostgreSQL connections are memory-heavy, so this can quickly exhaust the DB.
In the line of Advanced #1 ECS, RDS Proxy steps in.
Fargate (10 tasks × pool 20) ──▶ RDS Proxy ──▶ RDS
Total 200 connections ▼
Actual RDS connections: 20–50Pros: shorter failover (~30s), connection burst protection, IAM auth option. Cons: hourly cost (~$0.015 vCPU/h).
IAM auth #
Use IAM tokens instead of a password to access RDS. Access is controlled by the task role’s IAM policy. It’s safer, but you have to manage token TTL (15 min), ORM compatibility, and TLS enforcement — operations become a step more complex. Secrets Manager is sufficient for small systems.
Pitfalls — what often breaks in production #
1) password authentication failed — rotation issue
#
Secrets Manager auto-rotation is on, but tasks have the old password cached. The Task Definition’s secrets field fetches the ARN on each task start, but if the app reads the secret once at startup and holds it in memory, the value never refreshes.
Fix:
- Read once on app start + once more on connection failure
- Restart tasks (deployment) at rotation time
2) RDS Free Tier ended but still running #
Free Tier is 12 months — once over, ~$30/month is auto-billed. Catch it early with Basics #3 billing alerts.
3) Single-AZ in production #
To save costs, dropping --multi-az means AZ failure = DB down = all traffic down. Multi-AZ is recommended even in small production. Loss from one incident vs ~$30–50/month.
4) Forgot deletion-protection and accidentally deleted
#
One click in the console and RDS is gone. Always enable deletion-protection. Disabling it requires a separate update call → an intentional act.
5) terraform destroy wipes RDS too
#
Covered in #4 IaC. Protect with terraform’s lifecycle block.
resource "aws_db_instance" "blog" {
# ...
deletion_protection = true
skip_final_snapshot = false
final_snapshot_identifier = "blog-db-final-${formatdate("YYYYMMDD-hhmm", timestamp())}"
lifecycle {
prevent_destroy = true
}
}6) Migration lock that doesn’t release #
PostgreSQL’s ALTER TABLE waits indefinitely without a lock_timeout. A lock held during production traffic blocks all queries.
SET lock_timeout = '5s';
ALTER TABLE posts ADD COLUMN status TEXT;
-- If lock can't be acquired within 5s, error outIn Alembic: op.execute("SET lock_timeout = '5s'") at the start of the migration.
Wrapping up #
What we covered in this post:
- The DB’s layout — Private DB Subnet Group, only sg-fargate inbound on sg-rds
- Secrets Manager — no plaintext password, inject ARN via Task Definition’s
secrets, executionRole permissions - RDS production options — Multi-AZ, gp3, deletion-protection, storage-encrypted, performance-insights, auto-minor-upgrade
- Failover — 60–120s, endpoint stays the same, connection pool needs validation
- Session Manager — psql from a task without bastion,
enableExecuteCommand - Migration — separate RunTask not container start, blue/green-compatible, ALTER TABLE lock splitting
- Backups — automated PITR + manual snapshots before risky deploys
- Connection pools — RDS Proxy for big traffic, Secrets enough for small systems
- Pitfalls — password rotation cache, Free Tier expiry, single-AZ, missing deletion-protection, infinite migration lock
Next — CI/CD #
You’ve now run the build → ECR push → Service update → migration flow once by hand. You can’t keep doing that by hand.
In #3 CI/CD — GitHub Actions + ECR + ECS we automate this flow with GitHub Actions OIDC, plus auto-rollback on deploy failure (deployment circuit breaker), CodeDeploy’s blue/green options — making a single git push the end of deployment.