23 Chapter

RDS integration and migration operations

RDS Postgres Multi-AZ inside the VPC, Security Group design, injecting the password via Secrets Manager, the operational flow of Alembic / Django migrations, and a blue/green-compatible migration pattern, all in one place.

In Chapter 22 Infra skeleton we brought up the blog API on ECS Fargate, but the DB lived in memory. This chapter moves that part over to RDS Postgres Multi-AZ and organizes how to run migrations without killing production traffic.

If Chapter 11 RDS was the first RDS built from the console, this chapter covers operational patterns on top of it. We handle VPC separation, Secrets injection, and the migration patterns all in one go. As Part 4’s second chapter, it’s the step that lifts the DB you used to build by hand in the console up to an operable form.

The big picture — where to put the DB #

What this chapter adds

Fargate Task (Private Subnet)
    │
    │ DATABASE_URL  (from Secrets Manager)
    ▼
sg-rds  ← 5432 ← sg-fargate
    │
    ▼
RDS Postgres Multi-AZ  (DB Subnet Group)
    Primary  (AZ-a)
    Standby  (AZ-c)   ← promoted on failover

There are three core principles.

The DB is invisible from the internet. Place it in a private subnet or an isolated DB Subnet Group.
The password is not in the code. Inject it from Secrets Manager via the Task Definition’s secrets field.
Separate migrations into a deploy step. Don’t run migrate at container startup.

1) Create the DB Subnet Group #

RDS can only live inside a DB Subnet Group. Whether you turn Multi-AZ on or not, you need subnets spanning at least 2 AZs.

DB Subnet Group

aws rds create-db-subnet-group \
  --db-subnet-group-name blog-db-subnets \
  --db-subnet-group-description "Blog API DB subnets" \
  --subnet-ids subnet-db-a subnet-db-c

These subnets must be configured with no route to an internet gateway. External access happens only through the ALB / Fargate.

2) A dedicated DB Security Group #

The SG's role

sg-rds  inbound:
   port 5432  ← sg-fargate     ← Fargate tasks only
   port 5432  ← sg-bastion     ← (optional) operator jump host

sg-rds  outbound:
   not needed (narrow it down from default)

Here too, set the source to sg-fargate (another SG), not an IP range. No matter how many Fargate tasks grow, it applies automatically, and there’s no need to update the SG when a CIDR changes.

Create the DB SG

aws ec2 create-security-group \
  --group-name sg-rds \
  --description "RDS allow from Fargate" \
  --vpc-id $VPC_ID

aws ec2 authorize-security-group-ingress \
  --group-id $SG_RDS \
  --protocol tcp --port 5432 \
  --source-group $SG_FARGATE

3) Create the password with Secrets Manager #

Exactly the pattern from Chapter 20 Secrets / Parameter Store. There are two ways.

Method	Role
Manual creation	generate a strong random value yourself and enter it identically into RDS
RDS auto-integration with Secrets Manager	check “Manage in Secrets Manager” in the RDS console → password auto-generated + auto-rotated

For production, auto-integration is recommended. Manual is for learning.

Manual — create the secret

aws secretsmanager create-secret \
  --name blog-api/db \
  --secret-string '{
    "username": "blog_admin",
    "password": "S3cr3t-r4nd0m-32-bytes-...",
    "engine": "postgres",
    "host": "PLACEHOLDER",
    "port": 5432,
    "dbname": "blogdb"
  }'

After the DB is created, update only host.

4) Create the RDS instance #

RDS Postgres Multi-AZ

aws rds create-db-instance \
  --db-instance-identifier blog-db \
  --db-instance-class db.t4g.small \
  --engine postgres --engine-version 16.4 \
  --allocated-storage 20 --storage-type gp3 \
  --master-username blog_admin \
  --master-user-password "S3cr3t-r4nd0m-32-bytes-..." \
  --db-name blogdb \
  --vpc-security-group-ids $SG_RDS \
  --db-subnet-group-name blog-db-subnets \
  --multi-az \
  --publicly-accessible false \
  --backup-retention-period 7 \
  --deletion-protection \
  --enable-performance-insights \
  --storage-encrypted \
  --auto-minor-version-upgrade

Organizing the production options:

Option	Meaning
`db.t4g.small`	ARM-based, ~20% cheaper than x86 (`t3`), a starting point for small workloads
`gp3`	latest SSD, IOPS / throughput can be decoupled
`--multi-az`	automatic Standby — failover in 60 ~ 120 seconds
`--publicly-accessible false`	not reachable from the internet. Must be false
`--backup-retention-period 7`	automated backups retained for 7 days (PITR possible)
`--deletion-protection`	prevents accidental deletion. Needs a separate option even via CLI
`--enable-performance-insights`	per-query performance analysis (7 days free)
`--storage-encrypted`	disk encryption with KMS — always turn it on

The shape of a failover #

Let’s see how Multi-AZ works.

Automatic failover on a fault

Primary (AZ-a)  ─ synchronous replication ─▶  Standby (AZ-c)
    │                              │
    × fault                        │
    ▼                              ▼
failover triggered             Standby → Primary promotion
                                  │
                                  ▼
DNS endpoint auto-updates to the new Primary (60~120s)

From the application’s point of view, the endpoint hostname stays the same (blog-db.xxxx.ap-northeast-2.rds.amazonaws.com). The DNS TTL is short (~5s), so after failover it naturally reconnects to the new Primary. However, the connection pool must validate dead connections. If zombie connections linger in the pool, you’ll get 504s for a while.

5) Injecting the Secret into the Task #

Instead of putting the password in an environment variable in plaintext, point at the ARN with the Task Definition’s secrets.

task-definition.json (excerpt)

{
  "containerDefinitions": [
    {
      "name": "api",
      "image": "...",
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-AbCdEf:url::"
        }
      ],
      "environment": [
        { "name": "ENVIRONMENT", "value": "production" }
      ]
    }
  ]
}

The form of valueFrom is as follows.

arn:aws:secretsmanager:<region>:<account>:secret:<name>-<random>:<json-key>::

To pull just one key of the JSON, specify the key like :url::. To receive the whole JSON, leave the key part empty.

The shape inside the Secret (recommended) #

blog-api/db secret JSON

{
  "url": "postgresql://blog_admin:PASSWORD@blog-db.xxxx.rds.amazonaws.com:5432/blogdb",
  "username": "blog_admin",
  "password": "PASSWORD",
  "host": "blog-db.xxxx.rds.amazonaws.com",
  "port": 5432,
  "dbname": "blogdb"
}

Keeping an already-assembled url key is convenient because the app can take it directly.

IAM permissions #

The executionRole must be able to access Secrets.

Add to the executionRole policy

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["secretsmanager:GetSecretValue"],
    "Resource": "arn:aws:secretsmanager:ap-northeast-2:123456789012:secret:blog-api/db-*"
  }]
}

If you omit it, the task ends up STOPPED with ResourceInitializationError. That’s the role covered in the pitfalls of Chapter 22 Infra skeleton.

6) Verifying the first connection #

From a Bastion or CloudShell

psql "postgresql://blog_admin@blog-db.xxxx.ap-northeast-2.rds.amazonaws.com:5432/blogdb"
# enter password → SELECT 1;

Since it’s reachable only inside the VPC, you need one of: a jump host / CloudShell VPC environment / Session Manager port forwarding.

Session Manager port forward (recommended) #

A clean way to get to RDS through an ECS task without a bastion host.

Turn on enableExecuteCommand for the Fargate Task

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --enable-execute-command --force-new-deployment

# step into the task
TASK=$(aws ecs list-tasks --cluster blog-cluster --service-name blog-api \
   --query 'taskArns[0]' --output text)
aws ecs execute-command --cluster blog-cluster --task $TASK \
   --container api --interactive --command "/bin/sh"

By using psql directly inside the task, this is powerful for ad-hoc debugging. It needs SSM / iam:PassRole / ecs:ExecuteCommand permissions.

7) Operating migrations #

Most incidents come from migrations. Two things are key.

Where to run them #

Comparing options

A) Automatically at container startup (CMD before uvicorn/gunicorn)
   ─ simple. small team.
   ─ risk: N replicas run at the same time → lock / race
   ─ risk: on failure the container itself won't come up → tricky rollback

B) A separate ECS RunTask (deploy step)
   ─ runs once. no lock/race.
   ─ separate result tracking.
   ─ recommended.

C) CodeDeploy lifecycle hook
   ─ as part of a blue/green deployment
   ─ pre-traffic-shift or before-allow-traffic

The production recommendation is B (RunTask). The flow is: CI pushes a new image → migration RunTask → on success, Service update.

Run the migration task once

aws ecs run-task \
  --cluster blog-cluster \
  --task-definition blog-api-migrate:5 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
       subnets=[subnet-aaa],
       securityGroups=[sg-fargate],
       assignPublicIp=ENABLED
     }" \
  --count 1 \
  --started-by "deploy-$(git rev-parse --short HEAD)"

blog-api-migrate is a separate Task Definition. Same image, but a different command.

migrate task definition (excerpt)

{
  "family": "blog-api-migrate",
  "containerDefinitions": [{
    "name": "migrate",
    "image": "...:v123",
    "command": ["alembic", "upgrade", "head"],
    "essential": true
  }]
}

For Django it’s ["python", "manage.py", "migrate", "--noinput"].

Backward Compatible — Blue/Green compatibility #

At one moment during deployment, the old version and the new version are alive at the same time (the definition of a rolling deployment). For these two versions not to break while looking at the same DB, the migration must be compatible in both directions.

Risky change	Safe flow (3 steps)
Drop a column	1) deploy with the code no longer using it → 2) drop-column migration → 3) cleanup
Rename a column	1) add the new column + dual-write to both → 2) read only the new column → 3) drop the old column
Add NOT NULL	1) allow NULL + fill the default → 2) make the code always fill it → 3) add NOT NULL
Large index	`CREATE INDEX CONCURRENTLY` (PG) — no lock. Alembic op.create_index() option

The principle is this. One migration finishes within N minutes, and the DB that migration was applied to must work correctly with both the old-version code and the new-version code.

The lock pitfall of `ALTER TABLE` #

A pitfall you commonly hit in PostgreSQL.

This setting can halt production

ALTER TABLE posts ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'draft';

In PG 11+, DEFAULT is instant (metadata change only). But some patterns still lead to a full-row rewrite → a large table lock → production frozen.

The safe way is to split it into steps.

Split into 3 steps

-- 1) add only the column (allow NULL)
ALTER TABLE posts ADD COLUMN status VARCHAR(20);

-- 2) backfill (in batches)
UPDATE posts SET status = 'draft' WHERE status IS NULL AND id BETWEEN 1 AND 10000;
-- ... repeat

-- 3) NOT NULL + default
ALTER TABLE posts ALTER COLUMN status SET DEFAULT 'draft';
ALTER TABLE posts ALTER COLUMN status SET NOT NULL;

8) Backup and recovery #

RDS automated backups even support PITR (Point-In-Time Recovery).

Type	Role
Automated backup	daily during the retention period (7 ~ 35 days) + WAL every 5 minutes
Manual snapshot	explicit — independent of retention, kept until explicitly deleted
PITR	restore to an arbitrary point within the automated-backup window (as a new instance)

Manual snapshot before deployment

aws rds create-db-snapshot \
  --db-snapshot-identifier blog-db-pre-deploy-$(date +%Y%m%d-%H%M) \
  --db-instance-identifier blog-db

Take one right before a risky migration, and on an incident you just restore to a new instance and swap the DNS — done. The deeper patterns of disaster recovery are covered in Chapter 30 disaster recovery & backup.

9) Connection pools and IAM authentication #

As production traffic grows, two more things come in.

RDS Proxy #

If each task holds its own pool, then task × pool size = the number of concurrent connections. PG uses a lot of memory per connection. Leave it as is and the DB dies.

As an extension of Chapter 15 ECS Fargate, RDS Proxy plays its part.

The shape with the Proxy in place

Fargate (10 task × pool 20) ──▶ RDS Proxy ──▶ RDS
        200 total                  ▼
                              actual RDS connections: 20~50

Its upsides are shorter failover time (down to ~30 seconds), preventing connection floods, and an IAM auth option. Its downside is the hourly cost (~$0.015 vCPU/h).

IAM authentication #

Access RDS with an IAM token instead of a password. Control it through the Task role’s IAM policy. It’s safe, but token TTL (15 minutes) management, the compatibility of some ORMs, and forced TLS come along with it, making the operational shape one step more complex. For a small system, Secrets Manager is enough.

Pitfalls — the parts that often break in production #

1) `password authentication failed` — a rotation problem #

This is when Secrets Manager auto-rotation is on but the task is cached with the old password. The Task Definition’s secrets calls the ARN every time, but if the app reads it once at startup and holds onto it, the cache doesn’t refresh. The fixes are as follows.

Read once at app startup + once more on a connection failure
Restart the task to match the rotation timing (deployment)

2) The RDS Free Tier ended but it’s still running in production #

The Free Tier is 12 months. Once it ends, ~$30 a month is billed automatically. Detect it early with the billing alerts from Chapter 3 cost management.

3) Single-AZ operation #

If you drop --multi-az to save cost, then AZ failure = DB down = all traffic down. Even a small production deployment should use Multi-AZ. Weigh the loss of a single failure against the ~$30 ~ 50 a month.

4) Accidental deletion without `deletion-protection` on #

One click in the console makes RDS disappear. Always turn on deletion-protection. Turning it off needs a separate update call, so it becomes a deliberate act.

5) `terraform destroy` wipes out the RDS too #

A pitfall we’ll cover in Chapter 25 Terraform intro. Protect it with a terraform lifecycle block.

Terraform protection

resource "aws_db_instance" "blog" {
  # ...
  deletion_protection      = true
  skip_final_snapshot      = false
  final_snapshot_identifier = "blog-db-final-${formatdate("YYYYMMDD-hhmm", timestamp())}"

  lifecycle {
    prevent_destroy = true
  }
}

6) A migration lock that never releases #

PG’s ALTER TABLE waits without a lock_timeout, indefinitely. A large lock during production traffic makes all queries wait.

Safety guard

SET lock_timeout = '5s';
ALTER TABLE posts ADD COLUMN status TEXT;
-- if it can't grab the lock within 5s, it bails out with an error

For Alembic, put op.execute("SET lock_timeout = '5s'") at the start of the migration.

Terraform companion — the DB skeleton as code #

Moving the DB Subnet Group · SG · RDS · Secrets you built with the CLI above to Terraform looks like this. Setting the password to manage_master_user_password = true so that RDS creates and rotates it directly in Secrets Manager makes §3’s manual secret-creation step disappear and means you never touch a plaintext password in the code (Chapter 20).

rds.tf

resource "aws_db_subnet_group" "main" {
  name       = "blog-db"
  subnet_ids = var.db_subnet_ids        # DB subnets in 2 AZs
}

resource "aws_security_group" "rds" {
  name_prefix = "blog-rds-"
  vpc_id      = var.vpc_id
}

# allow only 5432 coming from the Fargate SG
resource "aws_security_group_rule" "rds_from_fargate" {
  type                     = "ingress"
  security_group_id        = aws_security_group.rds.id
  source_security_group_id = var.fargate_sg_id
  from_port = 5432, to_port = 5432, protocol = "tcp"
}

resource "aws_db_instance" "main" {
  identifier              = "blog-db"
  engine                  = "postgres"
  engine_version          = "17.2"
  instance_class          = "db.t4g.micro"
  allocated_storage       = 20
  storage_encrypted       = true          # incident prevention: encryption at rest
  multi_az                = true           # incident prevention: AZ-failure failover
  deletion_protection     = true           # incident prevention: block accidental deletion
  backup_retention_period = 14             # PITR 14 days (Chapter 30)
  db_subnet_group_name    = aws_db_subnet_group.main.name
  vpc_security_group_ids  = [aws_security_group.rds.id]

  db_name                     = "blog"
  username                    = "blog"
  manage_master_user_password = true       # → Secrets Manager auto-create & rotate
}

In the Task definition, inject the secret ARN that RDS created via secrets (continuing from the Task Definition in Chapter 22).

ecs-api.tf — secret injection (excerpt)

secrets = [{
  name      = "DB_SECRET"
  valueFrom = aws_db_instance.main.master_user_secret[0].secret_arn
}]

Run migrations as a one-off Task (aws ecs run-task ... command=["alembic","upgrade","head"]), not as the service, as in §7. This code carries straight over into the rds.tf of the Part 6 capstone.

Exercises #

From this chapter’s RDS creation command, write one sentence each on what incident the three options --storage-encrypted, --deletion-protection, and --multi-az each prevent. Also mark which item under §“Pitfalls” each connects to.
Write out, without looking at §“Backward Compatible,” the 3 steps for safely applying the risky change of adding NOT NULL to a column. Explain in one paragraph how these 3 steps mesh with the rolling deployment (old and new versions coexisting) seen in Chapter 22.
Organize, grounded in §“Where to run them,” the reason for separating migrations into a dedicated RunTask (option B) rather than running them automatically at container startup (option A). It helps to recall in advance how this RunTask gets automated in Chapter 24 CI/CD.

In short: put RDS in a DB Subnet Group that is invisible from the internet, and allow only sg-fargate inbound on sg-rds. Inject the password from Secrets Manager via the Task Definition’s secrets, and turn on Multi-AZ, encryption, and deletion protection. Separate migrations into a dedicated RunTask rather than container startup, and keep production traffic alive with two-way compatibility and lock_timeout.

Next chapter #

We’ve now run the flow of image build → ECR push → Service update → migration by hand once. We can’t do it by hand every time. In the next Chapter 24 CI/CD — GitHub Actions + ECR + ECS, we’ll automate this flow with GitHub Actions OIDC, and build a flow where deployment finishes with a single git push — including auto-rollback on a failed deployment (deployment circuit breaker) and CodeDeploy’s blue/green option.