AWS in Practice #2: RDS Integration and Migration Operations

10 min read

In #1 we put the blog API up on ECS Fargate, but its DB was in memory. This post moves that to RDS Postgres Multi-AZ and lays out how to run migrations without killing production traffic.

If Intermediate #4 RDS was the first RDS instance we created in the console, this is the production version — VPC isolation, Secrets injection, and migration patterns all in one go.

The big picture — the DB’s layout #

What this post adds
Fargate Task (Private Subnet)
    │ DATABASE_URL  (from Secrets Manager)
sg-rds  ← 5432 ← sg-fargate
RDS Postgres Multi-AZ  (DB Subnet Group)
    Primary  (AZ-a)
    Standby  (AZ-c)   ← Promoted on failover

Three core principles:

  1. The DB isn’t visible from the internet — Private subnet or an isolated DB Subnet Group
  2. Passwords aren’t in code — Secrets Manager → Task Definition’s secrets field
  3. Migrations are split into a deploy step — no migrate on container start

1) Create the DB Subnet Group #

RDS can only live inside a DB Subnet Group. It needs subnets across at least 2 AZs (whether you turn on Multi-AZ or not).

DB Subnet Group
aws rds create-db-subnet-group \
  --db-subnet-group-name blog-db-subnets \
  --db-subnet-group-description "Blog API DB subnets" \
  --subnet-ids subnet-db-a subnet-db-c

These subnets must be without a route to the internet gateway. External access only goes through the ALB / Fargate.

2) Dedicated DB Security Group #

The SG's layout
sg-rds  inbound:
   port 5432  ← sg-fargate     ← Fargate tasks only
   port 5432  ← sg-bastion     ← (optional) operator jump host

sg-rds  outbound:
   not needed (narrow from default)

Important: source as sg-fargate (another SG), not an IP range. No matter how many Fargate tasks you have, the rule applies automatically. No SG updates needed when CIDRs change.

Create the DB SG
aws ec2 create-security-group \
  --group-name sg-rds \
  --description "RDS allow from Fargate" \
  --vpc-id $VPC_ID

aws ec2 authorize-security-group-ingress \
  --group-id $SG_RDS \
  --protocol tcp --port 5432 \
  --source-group $SG_FARGATE

3) Create the password in Secrets Manager #

Same pattern as Advanced #6. Two approaches:

ApproachDescription
Manual creationGenerate a strong random password and enter it in RDS
RDS auto-integration with Secrets ManagerCheck “Manage in Secrets Manager” in the RDS console → password auto-generated + auto-rotation

Auto-integration is recommended for production. Manual is for learning.

Manual — create Secrets
aws secretsmanager create-secret \
  --name blog-api/db \
  --secret-string '{
    "username": "blog_admin",
    "password": "S3cr3t-r4nd0m-32-bytes-...",
    "engine": "postgres",
    "host": "PLACEHOLDER",
    "port": 5432,
    "dbname": "blogdb"
  }'

After the DB is created, just update host.

4) Create the RDS instance #

RDS Postgres Multi-AZ
aws rds create-db-instance \
  --db-instance-identifier blog-db \
  --db-instance-class db.t4g.small \
  --engine postgres --engine-version 16.4 \
  --allocated-storage 20 --storage-type gp3 \
  --master-username blog_admin \
  --master-user-password "S3cr3t-r4nd0m-32-bytes-..." \
  --db-name blogdb \
  --vpc-security-group-ids $SG_RDS \
  --db-subnet-group-name blog-db-subnets \
  --multi-az \
  --publicly-accessible false \
  --backup-retention-period 7 \
  --deletion-protection \
  --enable-performance-insights \
  --storage-encrypted \
  --auto-minor-version-upgrade

Production options:

OptionMeaning
db.t4g.smallARM-based, ~20% cheaper than x86 (t3), good starting point for small workloads
gp3Latest SSD, allows separate IOPS / throughput tuning
--multi-azAutomatic standby — failover takes 60–120s
--publicly-accessible falseNo internet access. Always false
--backup-retention-period 77-day automatic backups (PITR available)
--deletion-protectionPrevents accidental deletion. Even from CLI, requires a separate option
--enable-performance-insightsPer-query performance analysis (free for 7 days)
--storage-encryptedDisk encryption with KMS — always on

What failover looks like #

How Multi-AZ behaves:

Auto-failover during incident
Primary (AZ-a)  ─ sync replication ─▶  Standby (AZ-c)
    │                                    │
    × failure                            │
    ▼                                    ▼
Failover triggers                  Standby → Primary promoted
DNS endpoint auto-updates to new Primary (60–120s)

From the application’s view, the endpoint hostname stays the same (blog-db.xxxx.ap-northeast-2.rds.amazonaws.com). DNS TTL is short (~5s) so connections naturally route to the new Primary after failover. But the connection pool needs to validate disconnected connections — zombie connections in the pool produce 504s for a while.

5) Inject Secrets into the Task #

Don’t put the password in plain text in environment variables; reference the ARN through secrets in the Task Definition.

task-definition.json (excerpt)
{
  "containerDefinitions": [
    {
      "name": "api",
      "image": "...",
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-AbCdEf:url::"
        }
      ],
      "environment": [
        { "name": "ENVIRONMENT", "value": "production" }
      ]
    }
  ]
}

Shape of valueFrom:

arn:aws:secretsmanager:<region>:<account>:secret:<name>-<random>:<json-key>::

To pull just one key from the JSON, specify the key like :url::. To pull the whole JSON, leave the key part empty.

Recommended secret shape #

blog-api/db secret JSON
{
  "url": "postgresql://blog_admin:PASSWORD@blog-db.xxxx.rds.amazonaws.com:5432/blogdb",
  "username": "blog_admin",
  "password": "PASSWORD",
  "host": "blog-db.xxxx.rds.amazonaws.com",
  "port": 5432,
  "dbname": "blogdb"
}

Having a pre-assembled url key makes things easier on the app side — read in one line.

IAM permissions #

executionRole needs access to Secrets.

executionRole policy addition
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["secretsmanager:GetSecretValue"],
    "Resource": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-*"
  }]
}

Without this the task ends up STOPPED with ResourceInitializationError — exactly the pitfall covered in #1’s pitfalls.

6) Verify the first connection #

From bastion or CloudShell
psql "postgresql://blog_admin@blog-db.xxxx.ap-northeast-2.rds.amazonaws.com:5432/blogdb"
# Enter password → SELECT 1;

You can only access from inside the VPC, so you need one of: a jump host / VPC-mode CloudShell / Session Manager port forwarding.

Session Manager port forward (recommended) #

A clean way to reach RDS through an ECS task — no bastion host:

Enable enableExecuteCommand on the Fargate task
aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --enable-execute-command --force-new-deployment

# Get into the task
TASK=$(aws ecs list-tasks --cluster blog-cluster --service-name blog-api \
   --query 'taskArns[0]' --output text)
aws ecs execute-command --cluster blog-cluster --task $TASK \
   --container api --interactive --command "/bin/sh"

Run psql directly inside the task — powerful for ad-hoc debugging. Needs SSM / iam:PassRole / ecs:ExecuteCommand permissions.

7) The operational side of migrations #

Most incidents come from migrations. Two things matter.

Where do you run them? #

Comparing options
A) Auto on container start (CMD before uvicorn/gunicorn)
   ─ Simple. Small teams.
   ─ Risk: N replicas run simultaneously → locks / race
   ─ Risk: failure means the container itself doesn't come up → rollback awkward

B) Separate ECS RunTask (deploy stage)
   ─ One execution. No locks/races.
   ─ Result tracked separately.
   ─ Recommended.

C) CodeDeploy lifecycle hook
   ─ As part of blue/green deploy
   ─ pre-traffic-shift or before-allow-traffic

Production recommendation is B (RunTask). CI pushes a new image → migration RunTask → on success, Service update.

Run a migration task once
aws ecs run-task \
  --cluster blog-cluster \
  --task-definition blog-api-migrate:5 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
       subnets=[subnet-aaa],
       securityGroups=[sg-fargate],
       assignPublicIp=ENABLED
     }" \
  --count 1 \
  --started-by "deploy-$(git rev-parse --short HEAD)"

blog-api-migrate is a separate Task Definition — same image, different command.

migrate task definition (excerpt)
{
  "family": "blog-api-migrate",
  "containerDefinitions": [{
    "name": "migrate",
    "image": "...:v123",
    "command": ["alembic", "upgrade", "head"],
    "essential": true
  }]
}

For Django: ["python", "manage.py", "migrate", "--noinput"].

Backward Compatible — Blue/Green-friendly #

During a deployment, the previous version and the new version are running simultaneously (the very nature of a rolling deployment). For both versions to use the same DB without breaking, migrations must be backward and forward compatible.

Risky changeSafe flow (3 steps)
Drop column1) Deploy code that stops using it → 2) drop column migration → 3) cleanup
Rename column1) Add new column + dual write → 2) read only the new column → 3) drop the old
Add NOT NULL1) Allow NULL + backfill default → 2) make code always fill → 3) add NOT NULL
Big indexCREATE INDEX CONCURRENTLY (PG) — no lock. Alembic op.create_index() option

The principle: a migration completes within a bounded window, and the resulting DB schema must work correctly with both the old version and the new version of the code.

ALTER TABLE lock pitfalls #

A common case in PostgreSQL:

This one line can stop production
ALTER TABLE posts ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'draft';

In PG 11+, DEFAULT is instant (just metadata change). But some patterns still rewrite the entire row → big-table lock → production frozen.

The safe approach:

Split into 3 steps
-- 1) Just add the column (allow NULL)
ALTER TABLE posts ADD COLUMN status VARCHAR(20);

-- 2) Backfill (in batches)
UPDATE posts SET status = 'draft' WHERE status IS NULL AND id BETWEEN 1 AND 10000;
-- ... repeat

-- 3) NOT NULL + default
ALTER TABLE posts ALTER COLUMN status SET DEFAULT 'draft';
ALTER TABLE posts ALTER COLUMN status SET NOT NULL;

The same operational mindset as N+1 / index issues in Django Advanced #3 query optimization.

8) Backup and recovery #

RDS automated backups support PITR (Point-In-Time Recovery).

TypeDescription
Automated backupDuring retention period (7–35 days), daily + WAL every 5 minutes
Manual snapshotExplicit — independent of retention, kept until explicitly deleted
PITRRestore to any point within the automated backup window (creates a new instance)
Manual snapshot before deploy
aws rds create-db-snapshot \
  --db-snapshot-identifier blog-db-pre-deploy-$(date +%Y%m%d-%H%M) \
  --db-instance-identifier blog-db

Take one before a risky migration, and in the event of an incident, restore to a new instance → swap DNS, done.

9) Connection pools and IAM auth #

Two more pieces come in as production traffic grows.

RDS Proxy #

If each task maintains its own pool, task count × pool size = total connections. PostgreSQL connections are memory-heavy, so this can quickly exhaust the DB.

In the line of Advanced #1 ECS, RDS Proxy steps in.

Where Proxy fits in
Fargate (10 tasks × pool 20) ──▶ RDS Proxy ──▶ RDS
       Total 200 connections           ▼
                                Actual RDS connections: 20–50

Pros: shorter failover (~30s), connection burst protection, IAM auth option. Cons: hourly cost (~$0.015 vCPU/h).

IAM auth #

Use IAM tokens instead of a password to access RDS. Access is controlled by the task role’s IAM policy. It’s safer, but you have to manage token TTL (15 min), ORM compatibility, and TLS enforcement — operations become a step more complex. Secrets Manager is sufficient for small systems.

Pitfalls — what often breaks in production #

1) password authentication failed — rotation issue #

Secrets Manager auto-rotation is on, but tasks have the old password cached. The Task Definition’s secrets field fetches the ARN on each task start, but if the app reads the secret once at startup and holds it in memory, the value never refreshes.

Fix:

  • Read once on app start + once more on connection failure
  • Restart tasks (deployment) at rotation time

2) RDS Free Tier ended but still running #

Free Tier is 12 months — once over, ~$30/month is auto-billed. Catch it early with Basics #3 billing alerts.

3) Single-AZ in production #

To save costs, dropping --multi-az means AZ failure = DB down = all traffic down. Multi-AZ is recommended even in small production. Loss from one incident vs ~$30–50/month.

4) Forgot deletion-protection and accidentally deleted #

One click in the console and RDS is gone. Always enable deletion-protection. Disabling it requires a separate update call → an intentional act.

5) terraform destroy wipes RDS too #

Covered in #4 IaC. Protect with terraform’s lifecycle block.

Terraform protection
resource "aws_db_instance" "blog" {
  # ...
  deletion_protection      = true
  skip_final_snapshot      = false
  final_snapshot_identifier = "blog-db-final-${formatdate("YYYYMMDD-hhmm", timestamp())}"

  lifecycle {
    prevent_destroy = true
  }
}

6) Migration lock that doesn’t release #

PostgreSQL’s ALTER TABLE waits indefinitely without a lock_timeout. A lock held during production traffic blocks all queries.

Safety guard
SET lock_timeout = '5s';
ALTER TABLE posts ADD COLUMN status TEXT;
-- If lock can't be acquired within 5s, error out

In Alembic: op.execute("SET lock_timeout = '5s'") at the start of the migration.

Wrapping up #

What we covered in this post:

  • The DB’s layout — Private DB Subnet Group, only sg-fargate inbound on sg-rds
  • Secrets Manager — no plaintext password, inject ARN via Task Definition’s secrets, executionRole permissions
  • RDS production options — Multi-AZ, gp3, deletion-protection, storage-encrypted, performance-insights, auto-minor-upgrade
  • Failover — 60–120s, endpoint stays the same, connection pool needs validation
  • Session Manager — psql from a task without bastion, enableExecuteCommand
  • Migration — separate RunTask not container start, blue/green-compatible, ALTER TABLE lock splitting
  • Backups — automated PITR + manual snapshots before risky deploys
  • Connection pools — RDS Proxy for big traffic, Secrets enough for small systems
  • Pitfalls — password rotation cache, Free Tier expiry, single-AZ, missing deletion-protection, infinite migration lock

Next — CI/CD #

You’ve now run the build → ECR push → Service update → migration flow once by hand. You can’t keep doing that by hand.

In #3 CI/CD — GitHub Actions + ECR + ECS we automate this flow with GitHub Actions OIDC, plus auto-rollback on deploy failure (deployment circuit breaker), CodeDeploy’s blue/green options — making a single git push the end of deployment.

X