24 Chapter

CI/CD — GitHub Actions + ECR + ECS

Access-key-free GitHub Actions with OIDC, ECR push, automatic Task Definition updates, ECS Service rolling deployment, deployment circuit breaker and auto-rollback, all the way to CodeDeploy blue/green. A deployment flow that finishes in a single git push.

In Chapter 22 Infrastructure skeleton we stood up an ECS Service by hand, and in Chapter 23 RDS integration we ran RDS and migrations by hand. This chapter binds all that manual work into a single git push.

As the third chapter of Part 4, what it covers is as follows.

GitHub Actions ↔ AWS authentication without access keys — OIDC
the build → ECR push → Task Definition update → Service update → migration workflow
auto-rollback — Deployment Circuit Breaker
progressive deployment — CodeDeploy blue/green / canary
a comparison with CodePipeline — when to use which

The big picture #

A flow that finishes in one git push

git push (main)
   │
   ▼
GitHub Actions
   │
   ├─ 1) Test                    ← pytest / npm test
   │
   ├─ 2) AWS OIDC assume-role   ← no access keys
   │
   ├─ 3) Build & push image     ← <git-sha> tag
   │       ECR: blog-api:abc1234
   │
   ├─ 4) Run migrations         ← ecs run-task (blog-api-migrate)
   │       wait → check exit code
   │
   ├─ 5) Update Task Definition ← new revision with new image
   │
   ├─ 6) Update Service          ← rolling deployment
   │
   └─ 7) Wait services-stable    ← 5~10 min
           circuit breaker auto-rollback on failure

The goal of this chapter is to make this flow run in one pass.

1) GitHub OIDC — access-key-free authentication #

The old pattern was to issue access keys from an IAM user and store them in GitHub Secrets. That brings risks like git-history exposure, the obligation to rotate keys, and poor traceability.

The OIDC (OpenID Connect) pattern has GitHub issue a short-lived token (15 minutes) on every workflow run, and makes AWS IAM trust that token.

The shape of OIDC

GitHub Actions Job starts
   │
   ▼
GitHub OIDC Provider issues a JWT
   {sub: "repo:myorg/blog-api:ref:refs/heads/main", aud: "sts.amazonaws.com"}
   │
   ▼
aws-actions/configure-aws-credentials
   ├─ STS:AssumeRoleWithWebIdentity
   ├─ AWS validates the sub claim against the trust policy
   ▼
temporary credentials (AccessKey / SecretKey / SessionToken)  TTL: 1h

Once only — registering the IAM OIDC Provider #

OIDC Provider

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

The thumbprint is the SHA1 of GitHub OIDC’s SSL certificate. The AWS console GUI fetches it automatically.

IAM Role — Trust Policy #

Trust Policy of the github-actions-deploy role

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:myorg/blog-api:ref:refs/heads/main"
      }
    }
  }]
}

The sub pattern is the key.

Pattern	Meaning
`repo:myorg/blog-api:ref:refs/heads/main`	main branch only
`repo:myorg/blog-api:ref:refs/tags/*`	tag pushes only
`repo:myorg/blog-api:environment:production`	only after passing the environment gate
`repo:myorg/blog-api:*`	dangerous — this role can be used even from any PR

The production recommendation is environment gate + main/tag only.

Permissions Policy #

Grant only the actions deployment needs.

github-actions-deploy permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ECR",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECS",
      "Effect": "Allow",
      "Action": [
        "ecs:RegisterTaskDefinition",
        "ecs:DescribeTaskDefinition",
        "ecs:UpdateService",
        "ecs:DescribeServices",
        "ecs:RunTask",
        "ecs:DescribeTasks",
        "ecs:ListTasks"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PassRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
        "arn:aws:iam::123456789012:role/blog-api-task-role"
      ]
    }
  ]
}

If iam:PassRole is missing, RegisterTaskDefinition fails. The act of granting an IAM role to a Task Definition is “passing” that role, so it needs a separate permission.

2) The GitHub Actions workflow #

.github/workflows/deploy.yml

name: Deploy to ECS

on:
  push:
    branches: [main]
  workflow_dispatch:

permissions:
  id-token: write   # OIDC token issuance — required
  contents: read

env:
  AWS_REGION: ap-northeast-2
  ECR_REPOSITORY: blog-api
  ECS_CLUSTER: blog-cluster
  ECS_SERVICE: blog-api
  TASK_FAMILY: blog-api
  MIGRATE_FAMILY: blog-api-migrate

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.14" }
      - run: pip install -r requirements.txt -r requirements-dev.txt
      - run: pytest -q

  deploy:
    needs: test
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      # 1) AWS OIDC
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}

      # 2) ECR login
      - name: Login to ECR
        id: ecr
        uses: aws-actions/amazon-ecr-login@v2

      # 3) Build & push
      - name: Build and push
        id: build
        env:
          REGISTRY: ${{ steps.ecr.outputs.registry }}
          TAG: ${{ github.sha }}
        run: |
          docker build --platform=linux/amd64 \
            -t $REGISTRY/$ECR_REPOSITORY:$TAG \
            -t $REGISTRY/$ECR_REPOSITORY:latest .
          docker push $REGISTRY/$ECR_REPOSITORY:$TAG
          docker push $REGISTRY/$ECR_REPOSITORY:latest
          echo "image=$REGISTRY/$ECR_REPOSITORY:$TAG" >> $GITHUB_OUTPUT

      # 4) Migration RunTask
      - name: Run DB migrations
        env:
          IMAGE: ${{ steps.build.outputs.image }}
        run: |
          # register a new revision of the migration task definition with the new image
          DEF=$(aws ecs describe-task-definition --task-definition $MIGRATE_FAMILY \
            --query 'taskDefinition' --output json)
          NEW=$(echo "$DEF" | jq --arg I "$IMAGE" \
            '.containerDefinitions[0].image=$I |
             {family,taskRoleArn,executionRoleArn,networkMode,containerDefinitions,
              volumes,placementConstraints,requiresCompatibilities,cpu,memory}')
          NEW_ARN=$(aws ecs register-task-definition \
            --cli-input-json "$NEW" \
            --query 'taskDefinition.taskDefinitionArn' --output text)

          # RunTask
          TASK_ARN=$(aws ecs run-task --cluster $ECS_CLUSTER \
            --task-definition $NEW_ARN --launch-type FARGATE \
            --network-configuration "awsvpcConfiguration={
                subnets=[${{ secrets.MIGRATE_SUBNET_ID }}],
                securityGroups=[${{ secrets.FARGATE_SG_ID }}],
                assignPublicIp=ENABLED
              }" \
            --started-by "deploy-${{ github.sha }}" \
            --query 'tasks[0].taskArn' --output text)

          echo "Migration task: $TASK_ARN"
          aws ecs wait tasks-stopped --cluster $ECS_CLUSTER --tasks $TASK_ARN

          # check exit code (fail if non-zero)
          EXIT=$(aws ecs describe-tasks --cluster $ECS_CLUSTER --tasks $TASK_ARN \
            --query 'tasks[0].containers[0].exitCode' --output text)
          if [ "$EXIT" != "0" ]; then
            echo "Migration failed (exit=$EXIT)"
            aws logs tail /ecs/blog-api-migrate --since 10m
            exit 1
          fi

      # 5) Update Service Task Definition
      - name: Render service task definition
        id: render
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ops/task-definition.json
          container-name: api
          image: ${{ steps.build.outputs.image }}

      # 6) Deploy to ECS Service
      - name: Deploy
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 15

The key steps laid out:

Step	Meaning
`id-token: write`	permission to issue an OIDC token. drop it and STS AssumeRole 401s
`environment: production`	GitHub environment gate — manual approval, secret isolation
`aws-actions/amazon-ecs-render-task-definition`	base JSON + new image → generate new JSON
`aws-actions/amazon-ecs-deploy-task-definition`	RegisterTaskDefinition + UpdateService + wait
`wait-for-service-stability`	wait until stable — the step fails on failure

3) Deployment Circuit Breaker — auto-rollback #

A feature we touched on briefly in Chapter 22 Infrastructure skeleton. If a new deployment doesn’t come up, it automatically reverts to the previous task definition.

Enabling the Circuit Breaker on the Service

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --deployment-configuration "
    deploymentCircuitBreaker={enable=true,rollback=true},
    maximumPercent=200,
    minimumHealthyPercent=100"

How it works:

When a new task can’t reach a healthy state, ECS counts it.
If it doesn’t reach healthy within a certain count / time, it judges the deployment as failed.
If rollback=true, it automatically returns to the previous task definition.

At the GitHub Actions step, wait-for-service-stability returns false, so the workflow fails too.

Manual rollback #

For when auto-rollback doesn’t trigger or you need to investigate after the fact.

Manual rollback to a previous revision

PREV=$(aws ecs describe-task-definition --task-definition blog-api:42 \
  --query 'taskDefinition.taskDefinitionArn' --output text)

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --task-definition $PREV \
  --force-new-deployment

4) Progressive deployment — Canary / Blue-Green #

Default ECS rolling has a new task take traffic the moment it becomes healthy. If you need a more conservative shape, CodeDeploy plays the role.

Blue/Green #

The shape of Blue/Green

Blue (currently in production)  ←──── 100% traffic
   │
   ▼
Stand up Green, the new version (Blue stays alive)
   │
   ▼
Validate Green via the ALB Listener's Test traffic
   │
   ▼
Listener's 100% traffic → Green
   │
   ▼
Wait timer (10~60 min) — if no issue, terminate Blue
                       if there's an issue, return to Blue with one Listener line (instant rollback)

The upsides are as follows.

instant rollback — just revert the Listener
explicitly secures time to validate the new version

The downsides are as follows.

double the resources (during deployment)
the ALB Listener pattern is slightly complex (Test listener + Production listener)
heavier setup than ECS Rolling

Canary #

Canary

Linear (10% every 5 minutes) — 100% in 50 minutes
Canary (10% → wait 5 minutes → 90% at once)
AllAtOnce (immediately 100% — the fastest shape of Blue/Green)

CodeDeploy’s deployment configuration names are as follows.

CodeDeployDefault.ECSAllAtOnce
CodeDeployDefault.ECSLinear10PercentEvery1Minutes
CodeDeployDefault.ECSCanary10Percent5Minutes

Which one to use #

Situation	Recommendation
small operation / side project	ECS Rolling + Circuit Breaker
high-traffic operation, risky change	CodeDeploy Blue/Green Linear
ML inference / large memory models	Blue/Green (needs warmup time)

This book assumes ECS Rolling + Circuit Breaker as the default. Blue/Green is a story for after traffic has grown.

5) Comparison with CodePipeline #

Besides GitHub Actions, there’s AWS-native CI/CD.

	GitHub Actions	CodePipeline
Trigger	push / PR / schedule	CodeCommit / GitHub / S3 / ECR push
Build	runners pool (hosted/self-hosted)	CodeBuild
Deploy	direct calls or actions	CodeDeploy / ECS / CFN / Lambda
Price	hosted per-minute / self-hosted free	$1/month per pipeline + CodeBuild
Upsides	code and workflow in one place, rich ecosystem	AWS-native integration, consistent IAM
Downsides	OIDC setup / separate secret management	weak integration with external services

If your code is on GitHub, GitHub Actions is the natural choice. If your code is on CodeCommit per company security policy, it’s CodePipeline.

6) Environment separation — dev / staging / prod #

Branch by environment within one workflow.

.github/workflows/deploy.yml (environment matrix)

on:
  push:
    branches: [main, develop]

jobs:
  deploy:
    strategy:
      matrix:
        include:
          - branch: develop
            env: dev
            cluster: blog-cluster-dev
            role: arn:aws:iam::123456789012:role/github-actions-deploy-dev
          - branch: main
            env: prod
            cluster: blog-cluster-prod
            role: arn:aws:iam::123456789012:role/github-actions-deploy-prod
    if: github.ref == format('refs/heads/{0}', matrix.branch)
    environment: ${{ matrix.env }}

You can attach separate secrets, required reviewers, and a wait timer to GitHub environments (dev, prod). For the production environment, attach a 2-person approval + 5-minute wait timer to prevent mistakes.

7) Managing secrets and variables #

	Where to put it
AWS Account ID	GitHub vars
cluster name / service name	GitHub vars or workflow env
DB password / API key	AWS Secrets Manager (Chapter 23)
GitHub deploy role ARN	GitHub vars
Slack webhook (CI notifications)	GitHub secrets

The principle is as follows. App secrets go in AWS Secrets Manager, and GitHub secrets hold only the tokens CI itself needs.

Pitfalls — things you often meet in the CI/CD flow #

1) `aws sts get-caller-identity` returns 401 #

Suspect the OIDC setup. The order of checks is as follows.

permissions: id-token: write missing
Does the IAM Role trust policy’s sub pattern exactly match the actual workflow’s repo:org/repo:ref:...?
Is the OIDC Provider thumbprint up to date?
Is the Role’s aud condition sts.amazonaws.com?

2) The Service updates even though the Migration failed #

If you don’t check the exit code of run-task, the workflow proceeds to the next step even when the migration fails. Always check the exitCode from aws ecs describe-tasks and exit 1 on non-zero.

3) Deploying with the `latest` tag #

If the Task Definition image is :latest, you can’t trace which code is currently running. Specify down to the ECR image digest (@sha256:...) or use a git SHA tag.

4) IP shortage for the Migration RunTask #

Free Tier’s default subnet is small on IPs, so the production task + migration task trying to grab IPs at the same time can fail. Separate a dedicated SG / subnet for migrations, or verify there are IPs available in the deploy window.

5) The Circuit Breaker rolls back even a healthy deployment #

Too short a health-check grace period + a long boot time makes a healthy deployment look unhealthy. Set health-check-grace-period-seconds to the app’s boot time + margin (e.g., 90 seconds for Django).

6) GitHub Actions OIDC audience cache #

You changed sub or aud, but the old value still comes through. You have to restart with a new job — not a workflow cache — for a new token to be issued.

7) `ecs-deploy-task-definition` getting stuck #

With wait-for-service-stability: true set, if wait-for-minutes is too short, even a healthy deployment is treated as a failure. Set it conservatively to 15 ~ 20 minutes.

Exercises #

Write two reasons, on the basis of §“GitHub OIDC,” why OIDC is more secure than the IAM-user access-key approach. Also explain in one sentence why setting the Trust Policy’s sub pattern to repo:myorg/blog-api:* is dangerous.
Explain, on the basis of §“Pitfall 2,” what incident arises if you don’t check the exit code at the migration RunTask step, and point to which part of the workflow YAML is responsible for this check.
Lay out, on the basis of the §“Which one to use” table, in which situations you’d choose ECS Rolling + Circuit Breaker versus CodeDeploy Blue/Green. It helps to recall in advance where the deployment_circuit_breaker block goes when moving this deployment configuration into code in Chapter 25 Terraform intro.

In short: GitHub Actions OIDC assumes an AWS role with a short-lived token and no access keys, and restricts which branches or environments can deploy via the Trust Policy’s sub pattern. The workflow goes test → build/push → migration RunTask (check exit code) → Service deploy → wait-stable, and the Deployment Circuit Breaker auto-rolls back on failure. App secrets go in Secrets Manager; only CI’s own tokens go in GitHub secrets.

Next chapter #

Deployment is automated. But the infrastructure itself — VPC / SG / RDS / ALB / ECS — is still held in your hands via the console and CLI. Could you stand up an identical new environment once more? In the next Chapter 25 IaC — Terraform intro we move the infrastructure into code. We cover the shape of provider / resource / state, the S3 + DynamoDB backend, separating dev/prod with modules, and the flow of codifying Chapter 22’s infrastructure step by step.