Contents
24 Chapter

CI/CD — GitHub Actions + ECR + ECS

Access-key-free GitHub Actions with OIDC, ECR push, automatic Task Definition updates, ECS Service rolling deployment, deployment circuit breaker and auto-rollback, all the way to CodeDeploy blue/green. A deployment flow that finishes in a single git push.

In Chapter 22 Infrastructure skeleton we stood up an ECS Service by hand, and in Chapter 23 RDS integration we ran RDS and migrations by hand. This chapter binds all that manual work into a single git push.

As the third chapter of Part 4, what it covers is as follows.

  • GitHub Actions ↔ AWS authentication without access keys — OIDC
  • the build → ECR push → Task Definition update → Service update → migration workflow
  • auto-rollback — Deployment Circuit Breaker
  • progressive deployment — CodeDeploy blue/green / canary
  • a comparison with CodePipeline — when to use which

The big picture #

A flow that finishes in one git push
git push (main)
GitHub Actions
   ├─ 1) Test                    ← pytest / npm test
   ├─ 2) AWS OIDC assume-role   ← no access keys
   ├─ 3) Build & push image     ← <git-sha> tag
   │       ECR: blog-api:abc1234
   ├─ 4) Run migrations         ← ecs run-task (blog-api-migrate)
   │       wait → check exit code
   ├─ 5) Update Task Definition ← new revision with new image
   ├─ 6) Update Service          ← rolling deployment
   └─ 7) Wait services-stable    ← 5~10 min
           circuit breaker auto-rollback on failure

The goal of this chapter is to make this flow run in one pass.

1) GitHub OIDC — access-key-free authentication #

The old pattern was to issue access keys from an IAM user and store them in GitHub Secrets. That brings risks like git-history exposure, the obligation to rotate keys, and poor traceability.

The OIDC (OpenID Connect) pattern has GitHub issue a short-lived token (15 minutes) on every workflow run, and makes AWS IAM trust that token.

The shape of OIDC
GitHub Actions Job starts
GitHub OIDC Provider issues a JWT
   {sub: "repo:myorg/blog-api:ref:refs/heads/main", aud: "sts.amazonaws.com"}
aws-actions/configure-aws-credentials
   ├─ STS:AssumeRoleWithWebIdentity
   ├─ AWS validates the sub claim against the trust policy
temporary credentials (AccessKey / SecretKey / SessionToken)  TTL: 1h

Once only — registering the IAM OIDC Provider #

OIDC Provider
aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

The thumbprint is the SHA1 of GitHub OIDC’s SSL certificate. The AWS console GUI fetches it automatically.

IAM Role — Trust Policy #

Trust Policy of the github-actions-deploy role
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:myorg/blog-api:ref:refs/heads/main"
      }
    }
  }]
}

The sub pattern is the key.

PatternMeaning
repo:myorg/blog-api:ref:refs/heads/mainmain branch only
repo:myorg/blog-api:ref:refs/tags/*tag pushes only
repo:myorg/blog-api:environment:productiononly after passing the environment gate
repo:myorg/blog-api:*dangerous — this role can be used even from any PR

The production recommendation is environment gate + main/tag only.

Permissions Policy #

Grant only the actions deployment needs.

github-actions-deploy permissions
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ECR",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECS",
      "Effect": "Allow",
      "Action": [
        "ecs:RegisterTaskDefinition",
        "ecs:DescribeTaskDefinition",
        "ecs:UpdateService",
        "ecs:DescribeServices",
        "ecs:RunTask",
        "ecs:DescribeTasks",
        "ecs:ListTasks"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PassRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
        "arn:aws:iam::123456789012:role/blog-api-task-role"
      ]
    }
  ]
}

If iam:PassRole is missing, RegisterTaskDefinition fails. The act of granting an IAM role to a Task Definition is “passing” that role, so it needs a separate permission.

2) The GitHub Actions workflow #

.github/workflows/deploy.yml
name: Deploy to ECS

on:
  push:
    branches: [main]
  workflow_dispatch:

permissions:
  id-token: write   # OIDC token issuance — required
  contents: read

env:
  AWS_REGION: ap-northeast-2
  ECR_REPOSITORY: blog-api
  ECS_CLUSTER: blog-cluster
  ECS_SERVICE: blog-api
  TASK_FAMILY: blog-api
  MIGRATE_FAMILY: blog-api-migrate

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.14" }
      - run: pip install -r requirements.txt -r requirements-dev.txt
      - run: pytest -q

  deploy:
    needs: test
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      # 1) AWS OIDC
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}

      # 2) ECR login
      - name: Login to ECR
        id: ecr
        uses: aws-actions/amazon-ecr-login@v2

      # 3) Build & push
      - name: Build and push
        id: build
        env:
          REGISTRY: ${{ steps.ecr.outputs.registry }}
          TAG: ${{ github.sha }}
        run: |
          docker build --platform=linux/amd64 \
            -t $REGISTRY/$ECR_REPOSITORY:$TAG \
            -t $REGISTRY/$ECR_REPOSITORY:latest .
          docker push $REGISTRY/$ECR_REPOSITORY:$TAG
          docker push $REGISTRY/$ECR_REPOSITORY:latest
          echo "image=$REGISTRY/$ECR_REPOSITORY:$TAG" >> $GITHUB_OUTPUT

      # 4) Migration RunTask
      - name: Run DB migrations
        env:
          IMAGE: ${{ steps.build.outputs.image }}
        run: |
          # register a new revision of the migration task definition with the new image
          DEF=$(aws ecs describe-task-definition --task-definition $MIGRATE_FAMILY \
            --query 'taskDefinition' --output json)
          NEW=$(echo "$DEF" | jq --arg I "$IMAGE" \
            '.containerDefinitions[0].image=$I |
             {family,taskRoleArn,executionRoleArn,networkMode,containerDefinitions,
              volumes,placementConstraints,requiresCompatibilities,cpu,memory}')
          NEW_ARN=$(aws ecs register-task-definition \
            --cli-input-json "$NEW" \
            --query 'taskDefinition.taskDefinitionArn' --output text)

          # RunTask
          TASK_ARN=$(aws ecs run-task --cluster $ECS_CLUSTER \
            --task-definition $NEW_ARN --launch-type FARGATE \
            --network-configuration "awsvpcConfiguration={
                subnets=[${{ secrets.MIGRATE_SUBNET_ID }}],
                securityGroups=[${{ secrets.FARGATE_SG_ID }}],
                assignPublicIp=ENABLED
              }" \
            --started-by "deploy-${{ github.sha }}" \
            --query 'tasks[0].taskArn' --output text)

          echo "Migration task: $TASK_ARN"
          aws ecs wait tasks-stopped --cluster $ECS_CLUSTER --tasks $TASK_ARN

          # check exit code (fail if non-zero)
          EXIT=$(aws ecs describe-tasks --cluster $ECS_CLUSTER --tasks $TASK_ARN \
            --query 'tasks[0].containers[0].exitCode' --output text)
          if [ "$EXIT" != "0" ]; then
            echo "Migration failed (exit=$EXIT)"
            aws logs tail /ecs/blog-api-migrate --since 10m
            exit 1
          fi

      # 5) Update Service Task Definition
      - name: Render service task definition
        id: render
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ops/task-definition.json
          container-name: api
          image: ${{ steps.build.outputs.image }}

      # 6) Deploy to ECS Service
      - name: Deploy
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 15

The key steps laid out:

StepMeaning
id-token: writepermission to issue an OIDC token. drop it and STS AssumeRole 401s
environment: productionGitHub environment gate — manual approval, secret isolation
aws-actions/amazon-ecs-render-task-definitionbase JSON + new image → generate new JSON
aws-actions/amazon-ecs-deploy-task-definitionRegisterTaskDefinition + UpdateService + wait
wait-for-service-stabilitywait until stable — the step fails on failure

3) Deployment Circuit Breaker — auto-rollback #

A feature we touched on briefly in Chapter 22 Infrastructure skeleton. If a new deployment doesn’t come up, it automatically reverts to the previous task definition.

Enabling the Circuit Breaker on the Service
aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --deployment-configuration "
    deploymentCircuitBreaker={enable=true,rollback=true},
    maximumPercent=200,
    minimumHealthyPercent=100"

How it works:

  1. When a new task can’t reach a healthy state, ECS counts it.
  2. If it doesn’t reach healthy within a certain count / time, it judges the deployment as failed.
  3. If rollback=true, it automatically returns to the previous task definition.

At the GitHub Actions step, wait-for-service-stability returns false, so the workflow fails too.

Manual rollback #

For when auto-rollback doesn’t trigger or you need to investigate after the fact.

Manual rollback to a previous revision
PREV=$(aws ecs describe-task-definition --task-definition blog-api:42 \
  --query 'taskDefinition.taskDefinitionArn' --output text)

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --task-definition $PREV \
  --force-new-deployment

4) Progressive deployment — Canary / Blue-Green #

Default ECS rolling has a new task take traffic the moment it becomes healthy. If you need a more conservative shape, CodeDeploy plays the role.

Blue/Green #

The shape of Blue/Green
Blue (currently in production)  ←──── 100% traffic
Stand up Green, the new version (Blue stays alive)
Validate Green via the ALB Listener's Test traffic
Listener's 100% traffic → Green
Wait timer (10~60 min) — if no issue, terminate Blue
                       if there's an issue, return to Blue with one Listener line (instant rollback)

The upsides are as follows.

  • instant rollback — just revert the Listener
  • explicitly secures time to validate the new version

The downsides are as follows.

  • double the resources (during deployment)
  • the ALB Listener pattern is slightly complex (Test listener + Production listener)
  • heavier setup than ECS Rolling

Canary #

Canary
Linear (10% every 5 minutes) — 100% in 50 minutes
Canary (10% → wait 5 minutes → 90% at once)
AllAtOnce (immediately 100% — the fastest shape of Blue/Green)

CodeDeploy’s deployment configuration names are as follows.

  • CodeDeployDefault.ECSAllAtOnce
  • CodeDeployDefault.ECSLinear10PercentEvery1Minutes
  • CodeDeployDefault.ECSCanary10Percent5Minutes

Which one to use #

SituationRecommendation
small operation / side projectECS Rolling + Circuit Breaker
high-traffic operation, risky changeCodeDeploy Blue/Green Linear
ML inference / large memory modelsBlue/Green (needs warmup time)

This book assumes ECS Rolling + Circuit Breaker as the default. Blue/Green is a story for after traffic has grown.

5) Comparison with CodePipeline #

Besides GitHub Actions, there’s AWS-native CI/CD.

GitHub ActionsCodePipeline
Triggerpush / PR / scheduleCodeCommit / GitHub / S3 / ECR push
Buildrunners pool (hosted/self-hosted)CodeBuild
Deploydirect calls or actionsCodeDeploy / ECS / CFN / Lambda
Pricehosted per-minute / self-hosted free$1/month per pipeline + CodeBuild
Upsidescode and workflow in one place, rich ecosystemAWS-native integration, consistent IAM
DownsidesOIDC setup / separate secret managementweak integration with external services

If your code is on GitHub, GitHub Actions is the natural choice. If your code is on CodeCommit per company security policy, it’s CodePipeline.

6) Environment separation — dev / staging / prod #

Branch by environment within one workflow.

.github/workflows/deploy.yml (environment matrix)
on:
  push:
    branches: [main, develop]

jobs:
  deploy:
    strategy:
      matrix:
        include:
          - branch: develop
            env: dev
            cluster: blog-cluster-dev
            role: arn:aws:iam::123456789012:role/github-actions-deploy-dev
          - branch: main
            env: prod
            cluster: blog-cluster-prod
            role: arn:aws:iam::123456789012:role/github-actions-deploy-prod
    if: github.ref == format('refs/heads/{0}', matrix.branch)
    environment: ${{ matrix.env }}

You can attach separate secrets, required reviewers, and a wait timer to GitHub environments (dev, prod). For the production environment, attach a 2-person approval + 5-minute wait timer to prevent mistakes.

7) Managing secrets and variables #

Where to put it
AWS Account IDGitHub vars
cluster name / service nameGitHub vars or workflow env
DB password / API keyAWS Secrets Manager (Chapter 23)
GitHub deploy role ARNGitHub vars
Slack webhook (CI notifications)GitHub secrets

The principle is as follows. App secrets go in AWS Secrets Manager, and GitHub secrets hold only the tokens CI itself needs.

Pitfalls — things you often meet in the CI/CD flow #

1) aws sts get-caller-identity returns 401 #

Suspect the OIDC setup. The order of checks is as follows.

  • permissions: id-token: write missing
  • Does the IAM Role trust policy’s sub pattern exactly match the actual workflow’s repo:org/repo:ref:...?
  • Is the OIDC Provider thumbprint up to date?
  • Is the Role’s aud condition sts.amazonaws.com?

2) The Service updates even though the Migration failed #

If you don’t check the exit code of run-task, the workflow proceeds to the next step even when the migration fails. Always check the exitCode from aws ecs describe-tasks and exit 1 on non-zero.

3) Deploying with the latest tag #

If the Task Definition image is :latest, you can’t trace which code is currently running. Specify down to the ECR image digest (@sha256:...) or use a git SHA tag.

4) IP shortage for the Migration RunTask #

Free Tier’s default subnet is small on IPs, so the production task + migration task trying to grab IPs at the same time can fail. Separate a dedicated SG / subnet for migrations, or verify there are IPs available in the deploy window.

5) The Circuit Breaker rolls back even a healthy deployment #

Too short a health-check grace period + a long boot time makes a healthy deployment look unhealthy. Set health-check-grace-period-seconds to the app’s boot time + margin (e.g., 90 seconds for Django).

6) GitHub Actions OIDC audience cache #

You changed sub or aud, but the old value still comes through. You have to restart with a new job — not a workflow cache — for a new token to be issued.

7) ecs-deploy-task-definition getting stuck #

With wait-for-service-stability: true set, if wait-for-minutes is too short, even a healthy deployment is treated as a failure. Set it conservatively to 15 ~ 20 minutes.

Exercises #

  1. Write two reasons, on the basis of §“GitHub OIDC,” why OIDC is more secure than the IAM-user access-key approach. Also explain in one sentence why setting the Trust Policy’s sub pattern to repo:myorg/blog-api:* is dangerous.
  2. Explain, on the basis of §“Pitfall 2,” what incident arises if you don’t check the exit code at the migration RunTask step, and point to which part of the workflow YAML is responsible for this check.
  3. Lay out, on the basis of the §“Which one to use” table, in which situations you’d choose ECS Rolling + Circuit Breaker versus CodeDeploy Blue/Green. It helps to recall in advance where the deployment_circuit_breaker block goes when moving this deployment configuration into code in Chapter 25 Terraform intro.

In short: GitHub Actions OIDC assumes an AWS role with a short-lived token and no access keys, and restricts which branches or environments can deploy via the Trust Policy’s sub pattern. The workflow goes test → build/push → migration RunTask (check exit code) → Service deploy → wait-stable, and the Deployment Circuit Breaker auto-rolls back on failure. App secrets go in Secrets Manager; only CI’s own tokens go in GitHub secrets.

Next chapter #

Deployment is automated. But the infrastructure itself — VPC / SG / RDS / ALB / ECS — is still held in your hands via the console and CLI. Could you stand up an identical new environment once more? In the next Chapter 25 IaC — Terraform intro we move the infrastructure into code. We cover the shape of provider / resource / state, the S3 + DynamoDB backend, separating dev/prod with modules, and the flow of codifying Chapter 22’s infrastructure step by step.

X