AWS in Practice #2: RDS Integration and Migration Operations

Infrastructure AWS RDS Postgres Migrations

Sunday, May 3, 2026

10 min read

In #1 we put the blog API up on ECS Fargate, but its DB was in memory. This post moves that to RDS Postgres Multi-AZ and lays out how to run migrations without killing production traffic.

If Intermediate #4 RDS was the first RDS instance we created in the console, this is the production version — VPC isolation, Secrets injection, and migration patterns all in one go.

The big picture — the DB’s layout #

What this post adds

Fargate Task (Private Subnet)
    │
    │ DATABASE_URL  (from Secrets Manager)
    ▼
sg-rds  ← 5432 ← sg-fargate
    │
    ▼
RDS Postgres Multi-AZ  (DB Subnet Group)
    Primary  (AZ-a)
    Standby  (AZ-c)   ← Promoted on failover

Three core principles:

The DB isn’t visible from the internet — Private subnet or an isolated DB Subnet Group
Passwords aren’t in code — Secrets Manager → Task Definition’s secrets field
Migrations are split into a deploy step — no migrate on container start

1) Create the DB Subnet Group #

RDS can only live inside a DB Subnet Group. It needs subnets across at least 2 AZs (whether you turn on Multi-AZ or not).

DB Subnet Group

aws rds create-db-subnet-group \
  --db-subnet-group-name blog-db-subnets \
  --db-subnet-group-description "Blog API DB subnets" \
  --subnet-ids subnet-db-a subnet-db-c

These subnets must be without a route to the internet gateway. External access only goes through the ALB / Fargate.

2) Dedicated DB Security Group #

The SG's layout

sg-rds  inbound:
   port 5432  ← sg-fargate     ← Fargate tasks only
   port 5432  ← sg-bastion     ← (optional) operator jump host

sg-rds  outbound:
   not needed (narrow from default)

Important: source as sg-fargate (another SG), not an IP range. No matter how many Fargate tasks you have, the rule applies automatically. No SG updates needed when CIDRs change.

Create the DB SG

aws ec2 create-security-group \
  --group-name sg-rds \
  --description "RDS allow from Fargate" \
  --vpc-id $VPC_ID

aws ec2 authorize-security-group-ingress \
  --group-id $SG_RDS \
  --protocol tcp --port 5432 \
  --source-group $SG_FARGATE

3) Create the password in Secrets Manager #

Same pattern as Advanced #6. Two approaches:

Approach	Description
Manual creation	Generate a strong random password and enter it in RDS
RDS auto-integration with Secrets Manager	Check “Manage in Secrets Manager” in the RDS console → password auto-generated + auto-rotation

Auto-integration is recommended for production. Manual is for learning.

Manual — create Secrets

aws secretsmanager create-secret \
  --name blog-api/db \
  --secret-string '{
    "username": "blog_admin",
    "password": "S3cr3t-r4nd0m-32-bytes-...",
    "engine": "postgres",
    "host": "PLACEHOLDER",
    "port": 5432,
    "dbname": "blogdb"
  }'

After the DB is created, just update host.

4) Create the RDS instance #

RDS Postgres Multi-AZ

aws rds create-db-instance \
  --db-instance-identifier blog-db \
  --db-instance-class db.t4g.small \
  --engine postgres --engine-version 16.4 \
  --allocated-storage 20 --storage-type gp3 \
  --master-username blog_admin \
  --master-user-password "S3cr3t-r4nd0m-32-bytes-..." \
  --db-name blogdb \
  --vpc-security-group-ids $SG_RDS \
  --db-subnet-group-name blog-db-subnets \
  --multi-az \
  --publicly-accessible false \
  --backup-retention-period 7 \
  --deletion-protection \
  --enable-performance-insights \
  --storage-encrypted \
  --auto-minor-version-upgrade

Production options:

Option	Meaning
`db.t4g.small`	ARM-based, ~20% cheaper than x86 (`t3`), good starting point for small workloads
`gp3`	Latest SSD, allows separate IOPS / throughput tuning
`--multi-az`	Automatic standby — failover takes 60–120s
`--publicly-accessible false`	No internet access. Always false
`--backup-retention-period 7`	7-day automatic backups (PITR available)
`--deletion-protection`	Prevents accidental deletion. Even from CLI, requires a separate option
`--enable-performance-insights`	Per-query performance analysis (free for 7 days)
`--storage-encrypted`	Disk encryption with KMS — always on

What failover looks like #

How Multi-AZ behaves:

Auto-failover during incident

Primary (AZ-a)  ─ sync replication ─▶  Standby (AZ-c)
    │                                    │
    × failure                            │
    ▼                                    ▼
Failover triggers                  Standby → Primary promoted
                                        │
                                        ▼
DNS endpoint auto-updates to new Primary (60–120s)

From the application’s view, the endpoint hostname stays the same (blog-db.xxxx.ap-northeast-2.rds.amazonaws.com). DNS TTL is short (~5s) so connections naturally route to the new Primary after failover. But the connection pool needs to validate disconnected connections — zombie connections in the pool produce 504s for a while.

5) Inject Secrets into the Task #

Don’t put the password in plain text in environment variables; reference the ARN through secrets in the Task Definition.

task-definition.json (excerpt)

{
  "containerDefinitions": [
    {
      "name": "api",
      "image": "...",
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-AbCdEf:url::"
        }
      ],
      "environment": [
        { "name": "ENVIRONMENT", "value": "production" }
      ]
    }
  ]
}

Shape of valueFrom:

arn:aws:secretsmanager:<region>:<account>:secret:<name>-<random>:<json-key>::

To pull just one key from the JSON, specify the key like :url::. To pull the whole JSON, leave the key part empty.

Recommended secret shape #

blog-api/db secret JSON

{
  "url": "postgresql://blog_admin:PASSWORD@blog-db.xxxx.rds.amazonaws.com:5432/blogdb",
  "username": "blog_admin",
  "password": "PASSWORD",
  "host": "blog-db.xxxx.rds.amazonaws.com",
  "port": 5432,
  "dbname": "blogdb"
}

Having a pre-assembled url key makes things easier on the app side — read in one line.

IAM permissions #

executionRole needs access to Secrets.

executionRole policy addition

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["secretsmanager:GetSecretValue"],
    "Resource": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-*"
  }]
}

Without this the task ends up STOPPED with ResourceInitializationError — exactly the pitfall covered in #1’s pitfalls.

6) Verify the first connection #

From bastion or CloudShell

psql "postgresql://blog_admin@blog-db.xxxx.ap-northeast-2.rds.amazonaws.com:5432/blogdb"
# Enter password → SELECT 1;

You can only access from inside the VPC, so you need one of: a jump host / VPC-mode CloudShell / Session Manager port forwarding.

Session Manager port forward (recommended) #

A clean way to reach RDS through an ECS task — no bastion host:

Enable enableExecuteCommand on the Fargate task

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --enable-execute-command --force-new-deployment

# Get into the task
TASK=$(aws ecs list-tasks --cluster blog-cluster --service-name blog-api \
   --query 'taskArns[0]' --output text)
aws ecs execute-command --cluster blog-cluster --task $TASK \
   --container api --interactive --command "/bin/sh"

Run psql directly inside the task — powerful for ad-hoc debugging. Needs SSM / iam:PassRole / ecs:ExecuteCommand permissions.

7) The operational side of migrations #

Most incidents come from migrations. Two things matter.

Where do you run them? #

Comparing options

A) Auto on container start (CMD before uvicorn/gunicorn)
   ─ Simple. Small teams.
   ─ Risk: N replicas run simultaneously → locks / race
   ─ Risk: failure means the container itself doesn't come up → rollback awkward

B) Separate ECS RunTask (deploy stage)
   ─ One execution. No locks/races.
   ─ Result tracked separately.
   ─ Recommended.

C) CodeDeploy lifecycle hook
   ─ As part of blue/green deploy
   ─ pre-traffic-shift or before-allow-traffic

Production recommendation is B (RunTask). CI pushes a new image → migration RunTask → on success, Service update.

Run a migration task once

aws ecs run-task \
  --cluster blog-cluster \
  --task-definition blog-api-migrate:5 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
       subnets=[subnet-aaa],
       securityGroups=[sg-fargate],
       assignPublicIp=ENABLED
     }" \
  --count 1 \
  --started-by "deploy-$(git rev-parse --short HEAD)"

blog-api-migrate is a separate Task Definition — same image, different command.

migrate task definition (excerpt)

{
  "family": "blog-api-migrate",
  "containerDefinitions": [{
    "name": "migrate",
    "image": "...:v123",
    "command": ["alembic", "upgrade", "head"],
    "essential": true
  }]
}

For Django: ["python", "manage.py", "migrate", "--noinput"].

Backward Compatible — Blue/Green-friendly #

During a deployment, the previous version and the new version are running simultaneously (the very nature of a rolling deployment). For both versions to use the same DB without breaking, migrations must be backward and forward compatible.

Risky change	Safe flow (3 steps)
Drop column	1) Deploy code that stops using it → 2) drop column migration → 3) cleanup
Rename column	1) Add new column + dual write → 2) read only the new column → 3) drop the old
Add NOT NULL	1) Allow NULL + backfill default → 2) make code always fill → 3) add NOT NULL
Big index	`CREATE INDEX CONCURRENTLY` (PG) — no lock. Alembic op.create_index() option

The principle: a migration completes within a bounded window, and the resulting DB schema must work correctly with both the old version and the new version of the code.

`ALTER TABLE` lock pitfalls #

A common case in PostgreSQL:

This one line can stop production

ALTER TABLE posts ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'draft';

In PG 11+, DEFAULT is instant (just metadata change). But some patterns still rewrite the entire row → big-table lock → production frozen.

The safe approach:

Split into 3 steps

-- 1) Just add the column (allow NULL)
ALTER TABLE posts ADD COLUMN status VARCHAR(20);

-- 2) Backfill (in batches)
UPDATE posts SET status = 'draft' WHERE status IS NULL AND id BETWEEN 1 AND 10000;
-- ... repeat

-- 3) NOT NULL + default
ALTER TABLE posts ALTER COLUMN status SET DEFAULT 'draft';
ALTER TABLE posts ALTER COLUMN status SET NOT NULL;

The same operational mindset as N+1 / index issues in Django Advanced #3 query optimization.

8) Backup and recovery #

RDS automated backups support PITR (Point-In-Time Recovery).

Type	Description
Automated backup	During retention period (7–35 days), daily + WAL every 5 minutes
Manual snapshot	Explicit — independent of retention, kept until explicitly deleted
PITR	Restore to any point within the automated backup window (creates a new instance)

Manual snapshot before deploy

aws rds create-db-snapshot \
  --db-snapshot-identifier blog-db-pre-deploy-$(date +%Y%m%d-%H%M) \
  --db-instance-identifier blog-db

Take one before a risky migration, and in the event of an incident, restore to a new instance → swap DNS, done.

9) Connection pools and IAM auth #

Two more pieces come in as production traffic grows.

RDS Proxy #

If each task maintains its own pool, task count × pool size = total connections. PostgreSQL connections are memory-heavy, so this can quickly exhaust the DB.

In the line of Advanced #1 ECS, RDS Proxy steps in.

Where Proxy fits in

Fargate (10 tasks × pool 20) ──▶ RDS Proxy ──▶ RDS
       Total 200 connections           ▼
                                Actual RDS connections: 20–50

Pros: shorter failover (~30s), connection burst protection, IAM auth option. Cons: hourly cost (~$0.015 vCPU/h).

IAM auth #

Use IAM tokens instead of a password to access RDS. Access is controlled by the task role’s IAM policy. It’s safer, but you have to manage token TTL (15 min), ORM compatibility, and TLS enforcement — operations become a step more complex. Secrets Manager is sufficient for small systems.

Pitfalls — what often breaks in production #

1) `password authentication failed` — rotation issue #

Secrets Manager auto-rotation is on, but tasks have the old password cached. The Task Definition’s secrets field fetches the ARN on each task start, but if the app reads the secret once at startup and holds it in memory, the value never refreshes.

Fix:

Read once on app start + once more on connection failure
Restart tasks (deployment) at rotation time

2) RDS Free Tier ended but still running #

Free Tier is 12 months — once over, ~$30/month is auto-billed. Catch it early with Basics #3 billing alerts.

3) Single-AZ in production #

To save costs, dropping --multi-az means AZ failure = DB down = all traffic down. Multi-AZ is recommended even in small production. Loss from one incident vs ~$30–50/month.

4) Forgot `deletion-protection` and accidentally deleted #

One click in the console and RDS is gone. Always enable deletion-protection. Disabling it requires a separate update call → an intentional act.

5) `terraform destroy` wipes RDS too #

Covered in #4 IaC. Protect with terraform’s lifecycle block.

Terraform protection

resource "aws_db_instance" "blog" {
  # ...
  deletion_protection      = true
  skip_final_snapshot      = false
  final_snapshot_identifier = "blog-db-final-${formatdate("YYYYMMDD-hhmm", timestamp())}"

  lifecycle {
    prevent_destroy = true
  }
}

6) Migration lock that doesn’t release #

PostgreSQL’s ALTER TABLE waits indefinitely without a lock_timeout. A lock held during production traffic blocks all queries.

Safety guard

SET lock_timeout = '5s';
ALTER TABLE posts ADD COLUMN status TEXT;
-- If lock can't be acquired within 5s, error out

In Alembic: op.execute("SET lock_timeout = '5s'") at the start of the migration.

Wrapping up #

What we covered in this post:

The DB’s layout — Private DB Subnet Group, only sg-fargate inbound on sg-rds
Secrets Manager — no plaintext password, inject ARN via Task Definition’s secrets, executionRole permissions
RDS production options — Multi-AZ, gp3, deletion-protection, storage-encrypted, performance-insights, auto-minor-upgrade
Failover — 60–120s, endpoint stays the same, connection pool needs validation
Session Manager — psql from a task without bastion, enableExecuteCommand
Migration — separate RunTask not container start, blue/green-compatible, ALTER TABLE lock splitting
Backups — automated PITR + manual snapshots before risky deploys
Connection pools — RDS Proxy for big traffic, Secrets enough for small systems
Pitfalls — password rotation cache, Free Tier expiry, single-AZ, missing deletion-protection, infinite migration lock

Next — CI/CD #

You’ve now run the build → ECR push → Service update → migration flow once by hand. You can’t keep doing that by hand.

In #3 CI/CD — GitHub Actions + ECR + ECS we automate this flow with GitHub Actions OIDC, plus auto-rollback on deploy failure (deployment circuit breaker), CodeDeploy’s blue/green options — making a single git push the end of deployment.

The big picture — the DB’s layout #

1) Create the DB Subnet Group #

2) Dedicated DB Security Group #

3) Create the password in Secrets Manager #

4) Create the RDS instance #

What failover looks like #

5) Inject Secrets into the Task #

Recommended secret shape #

IAM permissions #

6) Verify the first connection #

Session Manager port forward (recommended) #

7) The operational side of migrations #

Where do you run them? #

Backward Compatible — Blue/Green-friendly #

ALTER TABLE lock pitfalls #

8) Backup and recovery #

9) Connection pools and IAM auth #

RDS Proxy #

IAM auth #

Pitfalls — what often breaks in production #

1) password authentication failed — rotation issue #

2) RDS Free Tier ended but still running #

3) Single-AZ in production #

4) Forgot deletion-protection and accidentally deleted #

5) terraform destroy wipes RDS too #

6) Migration lock that doesn’t release #

Wrapping up #

Next — CI/CD #

`ALTER TABLE` lock pitfalls #

1) `password authentication failed` — rotation issue #

4) Forgot `deletion-protection` and accidentally deleted #

5) `terraform destroy` wipes RDS too #