AWS in Practice #3: CI/CD — GitHub Actions + ECR + ECS

10 min read

In #1 we launched the ECS Service by hand, and in #2 we ran RDS and migrations by hand. This post bundles all that manual work into a single git push.

What we’ll cover:

  • GitHub Actions ↔ AWS auth without access keys — OIDC
  • The build → ECR push → Task Definition update → Service update → migration workflow
  • Auto rollback — Deployment Circuit Breaker
  • Progressive deploy — a touch of CodeDeploy blue/green / canary
  • CodePipeline comparison — when to use which

The big picture #

A single git push that does everything
git push (main)
GitHub Actions
   ├─ 1) Test                    ← pytest / npm test
   ├─ 2) AWS OIDC assume-role   ← no access keys
   ├─ 3) Build & push image     ← <git-sha> tag
   │       ECR: blog-api:abc1234
   ├─ 4) Run migrations         ← ecs run-task (blog-api-migrate)
   │       wait → check exit code
   ├─ 5) Update Task Definition ← new revision with new image
   ├─ 6) Update Service          ← rolling deploy
   └─ 7) Wait services-stable    ← 5–10 min
           on failure, circuit breaker auto-rollback

This post’s goal is making this flow run in one go.

1) GitHub OIDC — auth without access keys #

The old pattern: IAM user → access key → save in GitHub Secrets. Risky — exposure in git history, key rotation overhead, and difficult to audit.

The OIDC (OpenID Connect) pattern has GitHub issue a short-lived token (15 min) for each workflow run, which AWS IAM then trusts.

OIDC shape
GitHub Actions Job starts
GitHub OIDC Provider issues a JWT
   {sub: "repo:myorg/blog-api:ref:refs/heads/main", aud: "sts.amazonaws.com"}
aws-actions/configure-aws-credentials
   ├─ STS:AssumeRoleWithWebIdentity
   ├─ AWS validates sub claim against trust policy
Temporary credentials (AccessKey / SecretKey / SessionToken)  TTL: 1h

One-time — register the IAM OIDC Provider #

OIDC Provider
aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

The thumbprint is the SHA1 of GitHub OIDC’s SSL cert. The AWS console GUI fetches it automatically.

IAM Role — Trust Policy #

Trust policy for the github-actions-deploy role
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:myorg/blog-api:ref:refs/heads/main"
      }
    }
  }]
}

The pattern in sub is the key:

PatternMeaning
repo:myorg/blog-api:ref:refs/heads/mainOnly main branch
repo:myorg/blog-api:ref:refs/tags/*Only tag pushes
repo:myorg/blog-api:environment:productionOnly those that pass the environment gate
repo:myorg/blog-api:*Risky — even PRs can use this role

Production recommendation: environment gate + main/tag only.

Permissions Policy #

Only the actions needed for deployment:

github-actions-deploy permissions
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ECR",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECS",
      "Effect": "Allow",
      "Action": [
        "ecs:RegisterTaskDefinition",
        "ecs:DescribeTaskDefinition",
        "ecs:UpdateService",
        "ecs:DescribeServices",
        "ecs:RunTask",
        "ecs:DescribeTasks",
        "ecs:ListTasks"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PassRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
        "arn:aws:iam::123456789012:role/blog-api-task-role"
      ]
    }
  ]
}

Without iam:PassRole, RegisterTaskDefinition fails — embedding an IAM role in a Task Definition is considered “passing” that role, which requires explicit permission.

2) GitHub Actions workflow #

.github/workflows/deploy.yml
name: Deploy to ECS

on:
  push:
    branches: [main]
  workflow_dispatch:

permissions:
  id-token: write   # OIDC token issuance — required
  contents: read

env:
  AWS_REGION: ap-northeast-2
  ECR_REPOSITORY: blog-api
  ECS_CLUSTER: blog-cluster
  ECS_SERVICE: blog-api
  TASK_FAMILY: blog-api
  MIGRATE_FAMILY: blog-api-migrate

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.14" }
      - run: pip install -r requirements.txt -r requirements-dev.txt
      - run: pytest -q

  deploy:
    needs: test
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      # 1) AWS OIDC
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}

      # 2) ECR login
      - name: Login to ECR
        id: ecr
        uses: aws-actions/amazon-ecr-login@v2

      # 3) Build & push
      - name: Build and push
        id: build
        env:
          REGISTRY: ${{ steps.ecr.outputs.registry }}
          TAG: ${{ github.sha }}
        run: |
          docker build --platform=linux/amd64 \
            -t $REGISTRY/$ECR_REPOSITORY:$TAG \
            -t $REGISTRY/$ECR_REPOSITORY:latest .
          docker push $REGISTRY/$ECR_REPOSITORY:$TAG
          docker push $REGISTRY/$ECR_REPOSITORY:latest
          echo "image=$REGISTRY/$ECR_REPOSITORY:$TAG" >> $GITHUB_OUTPUT

      # 4) Migration RunTask
      - name: Run DB migrations
        env:
          IMAGE: ${{ steps.build.outputs.image }}
        run: |
          # Register a new revision of the migrate task definition with the new image
          DEF=$(aws ecs describe-task-definition --task-definition $MIGRATE_FAMILY \
            --query 'taskDefinition' --output json)
          NEW=$(echo "$DEF" | jq --arg I "$IMAGE" \
            '.containerDefinitions[0].image=$I |
             {family,taskRoleArn,executionRoleArn,networkMode,containerDefinitions,
              volumes,placementConstraints,requiresCompatibilities,cpu,memory}')
          NEW_ARN=$(aws ecs register-task-definition \
            --cli-input-json "$NEW" \
            --query 'taskDefinition.taskDefinitionArn' --output text)

          # RunTask
          TASK_ARN=$(aws ecs run-task --cluster $ECS_CLUSTER \
            --task-definition $NEW_ARN --launch-type FARGATE \
            --network-configuration "awsvpcConfiguration={
                subnets=[${{ secrets.MIGRATE_SUBNET_ID }}],
                securityGroups=[${{ secrets.FARGATE_SG_ID }}],
                assignPublicIp=ENABLED
              }" \
            --started-by "deploy-${{ github.sha }}" \
            --query 'tasks[0].taskArn' --output text)

          echo "Migration task: $TASK_ARN"
          aws ecs wait tasks-stopped --cluster $ECS_CLUSTER --tasks $TASK_ARN

          # Check exit code (non-zero = fail)
          EXIT=$(aws ecs describe-tasks --cluster $ECS_CLUSTER --tasks $TASK_ARN \
            --query 'tasks[0].containers[0].exitCode' --output text)
          if [ "$EXIT" != "0" ]; then
            echo "Migration failed (exit=$EXIT)"
            aws logs tail /ecs/blog-api-migrate --since 10m
            exit 1
          fi

      # 5) Update Service Task Definition
      - name: Render service task definition
        id: render
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ops/task-definition.json
          container-name: api
          image: ${{ steps.build.outputs.image }}

      # 6) Deploy to ECS Service
      - name: Deploy
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 15

Key items:

ItemMeaning
id-token: writePermission to issue OIDC token. Without it, STS AssumeRole 401
environment: productionGitHub environment gate — manual approval, secret separation
aws-actions/amazon-ecs-render-task-definitionBase JSON + new image → new JSON
aws-actions/amazon-ecs-deploy-task-definitionRegisterTaskDefinition + UpdateService + wait
wait-for-service-stabilityWait for stable state — step fails on failure

3) Deployment Circuit Breaker — auto-rollback #

Something we touched briefly in #1. When a new deployment can’t come up, it automatically reverts to the previous task definition.

Enable Circuit Breaker on the Service
aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --deployment-configuration "
    deploymentCircuitBreaker={enable=true,rollback=true},
    maximumPercent=200,
    minimumHealthyPercent=100"

How it works:

  1. ECS counts when new tasks can’t reach healthy state
  2. Marks deployment failed if it can’t reach healthy within a count / time
  3. With rollback=true, automatically reverts to the previous task definition

In GitHub Actions, wait-for-service-stability returns a failure, so the workflow step also fails.

Manual rollback #

When auto-rollback didn’t fire or you need to investigate after the fact:

Manually roll back to a previous revision
PREV=$(aws ecs describe-task-definition --task-definition blog-api:42 \
  --query 'taskDefinition.taskDefinitionArn' --output text)

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --task-definition $PREV \
  --force-new-deployment

4) Progressive deploys — Canary / Blue-Green #

By default, ECS rolling deployments send traffic to a new task as soon as it becomes healthy. For a more conservative approach, CodeDeploy steps in.

Blue/Green #

The Blue/Green shape
Blue (current production)  ←──── 100% traffic
Stand up Green (new version) — Blue still alive
Validate Green via the ALB Listener's Test traffic
Listener's 100% traffic → Green
Wait timer (10–60 min) — if no issue, terminate Blue
                       — if issues, one Listener line back to Blue (instant rollback)

Pros:

  • Instant rollback — flip the Listener back, done
  • Explicit time to validate the new version

Cons:

  • Double resources (during deploy)
  • ALB Listener pattern is slightly complex (Test listener + Production listener)
  • Heavier setup than ECS Rolling

Canary #

Canary
Linear (10% every 5 min) — 50 min to 100%
Canary (10% → 5 min wait → 90% in one shot)
AllAtOnce (instant 100% — fastest Blue/Green shape)

CodeDeploy deployment configuration names:

  • CodeDeployDefault.ECSAllAtOnce
  • CodeDeployDefault.ECSLinear10PercentEvery1Minutes
  • CodeDeployDefault.ECSCanary10Percent5Minutes

Which for which case? #

CaseRecommendation
Small production / side projectECS Rolling + Circuit Breaker
Big traffic production, risky changesCodeDeploy Blue/Green Linear
ML inference / large memory modelsBlue/Green (warmup time needed)

This series assumes ECS Rolling + Circuit Breaker as the default. Blue/Green is for after traffic gets bigger.

5) Comparison with CodePipeline #

Beyond GitHub Actions, there’s AWS-native CI/CD.

GitHub ActionsCodePipeline
Triggerpush / PR / scheduleCodeCommit / GitHub / S3 / ECR push
BuildRunners pool (hosted/self-hosted)CodeBuild
DeployDirect calls or actionsCodeDeploy / ECS / CFN / Lambda
PricingHosted minutes / self-hosted free$1/month per pipeline + CodeBuild
ProsCode and workflow in one place, rich ecosystemAWS-native integration, IAM consistency
ConsOIDC setup / separate secret managementWeaker external service integration

If your code is on GitHub, GitHub Actions is the natural choice. If company security policy requires code in CodeCommit, go with CodePipeline.

6) Environment separation — dev / staging / prod #

Branching by environment within a single workflow:

.github/workflows/deploy.yml (environment matrix)
on:
  push:
    branches: [main, develop]

jobs:
  deploy:
    strategy:
      matrix:
        include:
          - branch: develop
            env: dev
            cluster: blog-cluster-dev
            role: arn:aws:iam::123456789012:role/github-actions-deploy-dev
          - branch: main
            env: prod
            cluster: blog-cluster-prod
            role: arn:aws:iam::123456789012:role/github-actions-deploy-prod
    if: github.ref == format('refs/heads/{0}', matrix.branch)
    environment: ${{ matrix.env }}

You can attach separate secrets, required reviewers, and wait timers to each GitHub environment (dev, prod). Put a 2-person approval + 5-min wait timer on the production environment to prevent mistakes.

7) Handling secrets and variables #

Where to put
AWS Account IDGitHub vars
Cluster name / Service nameGitHub vars or workflow env
DB password / API keysAWS Secrets Manager (#2)
GitHub deploy role ARNGitHub vars
Slack webhook (CI alerts)GitHub secrets

Principle: app secrets in AWS Secrets Manager, GitHub secrets only for tokens needed by CI itself.

Pitfalls — common issues in CI/CD #

1) aws sts get-caller-identity returns 401 #

Suspect OIDC setup. Check in order:

  • Missing permissions: id-token: write
  • Does the trust policy’s sub pattern exactly match the workflow’s actual repo:org/repo:ref:...?
  • Is the OIDC Provider thumbprint up to date?
  • Is the Role’s aud condition sts.amazonaws.com?

2) Service updates even when migration fails #

Without checking the run-task exit code, the workflow proceeds to the next step even when the migration has failed. Always check aws ecs describe-tasks exitCode and call exit 1 on non-zero.

3) Deploying with the latest tag #

If Task Definition image is :latest, you can’t track which code is running. Specify ECR image digest (@sha256:...) or git SHA tag.

4) Migration RunTask can’t get an IP #

Free Tier default subnets are small, so production tasks and migration tasks competing for IPs can cause failures. Use a separate SG / subnet for migrations, or verify that IPs are available during the deploy window.

5) Circuit Breaker rolls back even healthy deploys #

Too short health-check grace period + long boot time → healthy deploys get misjudged unhealthy. Set health-check-grace-period-seconds to app boot + buffer (e.g. Django 90s).

6) GitHub Actions OIDC audience cache #

You changed sub or aud but the old values keep showing up. This is not a workflow cache issue — you need to start a fresh job to get a new token.

7) ecs-deploy-task-definition stuck #

With wait-for-service-stability: true, if wait-for-minutes is too short, even healthy deploys fail. Be conservative — 15–20 minutes.

Wrapping up #

What we covered in this post:

  • OIDC — IAM OIDC Provider + Trust policy sub pattern, id-token: write permission
  • Permissions policy — three groups: ECR / ECS / iam:PassRole
  • Workflow — test → OIDC → ECR push → migration RunTask → Service deploy → wait-stable
  • Circuit Breakerenable=true, rollback=true, maximumPercent / minimumHealthyPercent shape rolling
  • Manual rollbackupdate-service to a previous task definition revision
  • Blue/Green & Canary — CodeDeploy’s ECSLinear10PercentEvery1Minutes etc.
  • CodePipeline comparison — natural choice based on code location
  • Environment separation — branch matrix + GitHub environments with approval/wait
  • Secret management — app secrets in AWS Secrets Manager, only CI tokens in GitHub secrets
  • Pitfalls — OIDC 401, missing migration check, latest tag, IP shortage, grace period, stale token, stuck stability wait

Next — IaC #

Deployment is automated. But the infrastructure itself — VPC / SG / RDS / ALB / ECS — is still managed by hand through the console / CLI. Could you spin up another identical environment from scratch?

In #4 IaC — Terraform fundamentals we move infrastructure to code. The shape of provider / resource / state, S3+DynamoDB backend, modules for dev/prod separation, and the flow of code-ifying the #1 infrastructure line by line.

X