AWS in Practice #3: CI/CD — GitHub Actions + ECR + ECS

Infrastructure AWS CI/CD GitHub Actions ECS

Monday, May 4, 2026

10 min read

In #1 we launched the ECS Service by hand, and in #2 we ran RDS and migrations by hand. This post bundles all that manual work into a single git push.

What we’ll cover:

GitHub Actions ↔ AWS auth without access keys — OIDC
The build → ECR push → Task Definition update → Service update → migration workflow
Auto rollback — Deployment Circuit Breaker
Progressive deploy — a touch of CodeDeploy blue/green / canary
CodePipeline comparison — when to use which

The big picture #

A single git push that does everything

git push (main)
   │
   ▼
GitHub Actions
   │
   ├─ 1) Test                    ← pytest / npm test
   │
   ├─ 2) AWS OIDC assume-role   ← no access keys
   │
   ├─ 3) Build & push image     ← <git-sha> tag
   │       ECR: blog-api:abc1234
   │
   ├─ 4) Run migrations         ← ecs run-task (blog-api-migrate)
   │       wait → check exit code
   │
   ├─ 5) Update Task Definition ← new revision with new image
   │
   ├─ 6) Update Service          ← rolling deploy
   │
   └─ 7) Wait services-stable    ← 5–10 min
           on failure, circuit breaker auto-rollback

This post’s goal is making this flow run in one go.

1) GitHub OIDC — auth without access keys #

The old pattern: IAM user → access key → save in GitHub Secrets. Risky — exposure in git history, key rotation overhead, and difficult to audit.

The OIDC (OpenID Connect) pattern has GitHub issue a short-lived token (15 min) for each workflow run, which AWS IAM then trusts.

OIDC shape

GitHub Actions Job starts
   │
   ▼
GitHub OIDC Provider issues a JWT
   {sub: "repo:myorg/blog-api:ref:refs/heads/main", aud: "sts.amazonaws.com"}
   │
   ▼
aws-actions/configure-aws-credentials
   ├─ STS:AssumeRoleWithWebIdentity
   ├─ AWS validates sub claim against trust policy
   ▼
Temporary credentials (AccessKey / SecretKey / SessionToken)  TTL: 1h

One-time — register the IAM OIDC Provider #

OIDC Provider

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

The thumbprint is the SHA1 of GitHub OIDC’s SSL cert. The AWS console GUI fetches it automatically.

IAM Role — Trust Policy #

Trust policy for the github-actions-deploy role

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:myorg/blog-api:ref:refs/heads/main"
      }
    }
  }]
}

The pattern in sub is the key:

Pattern	Meaning
`repo:myorg/blog-api:ref:refs/heads/main`	Only main branch
`repo:myorg/blog-api:ref:refs/tags/*`	Only tag pushes
`repo:myorg/blog-api:environment:production`	Only those that pass the environment gate
`repo:myorg/blog-api:*`	Risky — even PRs can use this role

Production recommendation: environment gate + main/tag only.

Permissions Policy #

Only the actions needed for deployment:

github-actions-deploy permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ECR",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECS",
      "Effect": "Allow",
      "Action": [
        "ecs:RegisterTaskDefinition",
        "ecs:DescribeTaskDefinition",
        "ecs:UpdateService",
        "ecs:DescribeServices",
        "ecs:RunTask",
        "ecs:DescribeTasks",
        "ecs:ListTasks"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PassRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
        "arn:aws:iam::123456789012:role/blog-api-task-role"
      ]
    }
  ]
}

Without iam:PassRole, RegisterTaskDefinition fails — embedding an IAM role in a Task Definition is considered “passing” that role, which requires explicit permission.

2) GitHub Actions workflow #

.github/workflows/deploy.yml

name: Deploy to ECS

on:
  push:
    branches: [main]
  workflow_dispatch:

permissions:
  id-token: write   # OIDC token issuance — required
  contents: read

env:
  AWS_REGION: ap-northeast-2
  ECR_REPOSITORY: blog-api
  ECS_CLUSTER: blog-cluster
  ECS_SERVICE: blog-api
  TASK_FAMILY: blog-api
  MIGRATE_FAMILY: blog-api-migrate

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.14" }
      - run: pip install -r requirements.txt -r requirements-dev.txt
      - run: pytest -q

  deploy:
    needs: test
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      # 1) AWS OIDC
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}

      # 2) ECR login
      - name: Login to ECR
        id: ecr
        uses: aws-actions/amazon-ecr-login@v2

      # 3) Build & push
      - name: Build and push
        id: build
        env:
          REGISTRY: ${{ steps.ecr.outputs.registry }}
          TAG: ${{ github.sha }}
        run: |
          docker build --platform=linux/amd64 \
            -t $REGISTRY/$ECR_REPOSITORY:$TAG \
            -t $REGISTRY/$ECR_REPOSITORY:latest .
          docker push $REGISTRY/$ECR_REPOSITORY:$TAG
          docker push $REGISTRY/$ECR_REPOSITORY:latest
          echo "image=$REGISTRY/$ECR_REPOSITORY:$TAG" >> $GITHUB_OUTPUT

      # 4) Migration RunTask
      - name: Run DB migrations
        env:
          IMAGE: ${{ steps.build.outputs.image }}
        run: |
          # Register a new revision of the migrate task definition with the new image
          DEF=$(aws ecs describe-task-definition --task-definition $MIGRATE_FAMILY \
            --query 'taskDefinition' --output json)
          NEW=$(echo "$DEF" | jq --arg I "$IMAGE" \
            '.containerDefinitions[0].image=$I |
             {family,taskRoleArn,executionRoleArn,networkMode,containerDefinitions,
              volumes,placementConstraints,requiresCompatibilities,cpu,memory}')
          NEW_ARN=$(aws ecs register-task-definition \
            --cli-input-json "$NEW" \
            --query 'taskDefinition.taskDefinitionArn' --output text)

          # RunTask
          TASK_ARN=$(aws ecs run-task --cluster $ECS_CLUSTER \
            --task-definition $NEW_ARN --launch-type FARGATE \
            --network-configuration "awsvpcConfiguration={
                subnets=[${{ secrets.MIGRATE_SUBNET_ID }}],
                securityGroups=[${{ secrets.FARGATE_SG_ID }}],
                assignPublicIp=ENABLED
              }" \
            --started-by "deploy-${{ github.sha }}" \
            --query 'tasks[0].taskArn' --output text)

          echo "Migration task: $TASK_ARN"
          aws ecs wait tasks-stopped --cluster $ECS_CLUSTER --tasks $TASK_ARN

          # Check exit code (non-zero = fail)
          EXIT=$(aws ecs describe-tasks --cluster $ECS_CLUSTER --tasks $TASK_ARN \
            --query 'tasks[0].containers[0].exitCode' --output text)
          if [ "$EXIT" != "0" ]; then
            echo "Migration failed (exit=$EXIT)"
            aws logs tail /ecs/blog-api-migrate --since 10m
            exit 1
          fi

      # 5) Update Service Task Definition
      - name: Render service task definition
        id: render
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ops/task-definition.json
          container-name: api
          image: ${{ steps.build.outputs.image }}

      # 6) Deploy to ECS Service
      - name: Deploy
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 15

Key items:

Item	Meaning
`id-token: write`	Permission to issue OIDC token. Without it, STS AssumeRole 401
`environment: production`	GitHub environment gate — manual approval, secret separation
`aws-actions/amazon-ecs-render-task-definition`	Base JSON + new image → new JSON
`aws-actions/amazon-ecs-deploy-task-definition`	RegisterTaskDefinition + UpdateService + wait
`wait-for-service-stability`	Wait for stable state — step fails on failure

3) Deployment Circuit Breaker — auto-rollback #

Something we touched briefly in #1. When a new deployment can’t come up, it automatically reverts to the previous task definition.

Enable Circuit Breaker on the Service

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --deployment-configuration "
    deploymentCircuitBreaker={enable=true,rollback=true},
    maximumPercent=200,
    minimumHealthyPercent=100"

How it works:

ECS counts when new tasks can’t reach healthy state
Marks deployment failed if it can’t reach healthy within a count / time
With rollback=true, automatically reverts to the previous task definition

In GitHub Actions, wait-for-service-stability returns a failure, so the workflow step also fails.

Manual rollback #

When auto-rollback didn’t fire or you need to investigate after the fact:

Manually roll back to a previous revision

PREV=$(aws ecs describe-task-definition --task-definition blog-api:42 \
  --query 'taskDefinition.taskDefinitionArn' --output text)

aws ecs update-service \
  --cluster blog-cluster --service blog-api \
  --task-definition $PREV \
  --force-new-deployment

4) Progressive deploys — Canary / Blue-Green #

By default, ECS rolling deployments send traffic to a new task as soon as it becomes healthy. For a more conservative approach, CodeDeploy steps in.

Blue/Green #

The Blue/Green shape

Blue (current production)  ←──── 100% traffic
   │
   ▼
Stand up Green (new version) — Blue still alive
   │
   ▼
Validate Green via the ALB Listener's Test traffic
   │
   ▼
Listener's 100% traffic → Green
   │
   ▼
Wait timer (10–60 min) — if no issue, terminate Blue
                       — if issues, one Listener line back to Blue (instant rollback)

Pros:

Instant rollback — flip the Listener back, done
Explicit time to validate the new version

Cons:

Double resources (during deploy)
ALB Listener pattern is slightly complex (Test listener + Production listener)
Heavier setup than ECS Rolling

Canary #

Canary

Linear (10% every 5 min) — 50 min to 100%
Canary (10% → 5 min wait → 90% in one shot)
AllAtOnce (instant 100% — fastest Blue/Green shape)

CodeDeploy deployment configuration names:

CodeDeployDefault.ECSAllAtOnce
CodeDeployDefault.ECSLinear10PercentEvery1Minutes
CodeDeployDefault.ECSCanary10Percent5Minutes

Which for which case? #

Case	Recommendation
Small production / side project	ECS Rolling + Circuit Breaker
Big traffic production, risky changes	CodeDeploy Blue/Green Linear
ML inference / large memory models	Blue/Green (warmup time needed)

This series assumes ECS Rolling + Circuit Breaker as the default. Blue/Green is for after traffic gets bigger.

5) Comparison with CodePipeline #

Beyond GitHub Actions, there’s AWS-native CI/CD.

	GitHub Actions	CodePipeline
Trigger	push / PR / schedule	CodeCommit / GitHub / S3 / ECR push
Build	Runners pool (hosted/self-hosted)	CodeBuild
Deploy	Direct calls or actions	CodeDeploy / ECS / CFN / Lambda
Pricing	Hosted minutes / self-hosted free	$1/month per pipeline + CodeBuild
Pros	Code and workflow in one place, rich ecosystem	AWS-native integration, IAM consistency
Cons	OIDC setup / separate secret management	Weaker external service integration

If your code is on GitHub, GitHub Actions is the natural choice. If company security policy requires code in CodeCommit, go with CodePipeline.

6) Environment separation — dev / staging / prod #

Branching by environment within a single workflow:

.github/workflows/deploy.yml (environment matrix)

on:
  push:
    branches: [main, develop]

jobs:
  deploy:
    strategy:
      matrix:
        include:
          - branch: develop
            env: dev
            cluster: blog-cluster-dev
            role: arn:aws:iam::123456789012:role/github-actions-deploy-dev
          - branch: main
            env: prod
            cluster: blog-cluster-prod
            role: arn:aws:iam::123456789012:role/github-actions-deploy-prod
    if: github.ref == format('refs/heads/{0}', matrix.branch)
    environment: ${{ matrix.env }}

You can attach separate secrets, required reviewers, and wait timers to each GitHub environment (dev, prod). Put a 2-person approval + 5-min wait timer on the production environment to prevent mistakes.

7) Handling secrets and variables #

	Where to put
AWS Account ID	GitHub vars
Cluster name / Service name	GitHub vars or workflow env
DB password / API keys	AWS Secrets Manager (#2)
GitHub deploy role ARN	GitHub vars
Slack webhook (CI alerts)	GitHub secrets

Principle: app secrets in AWS Secrets Manager, GitHub secrets only for tokens needed by CI itself.

Pitfalls — common issues in CI/CD #

1) `aws sts get-caller-identity` returns 401 #

Suspect OIDC setup. Check in order:

Missing permissions: id-token: write
Does the trust policy’s sub pattern exactly match the workflow’s actual repo:org/repo:ref:...?
Is the OIDC Provider thumbprint up to date?
Is the Role’s aud condition sts.amazonaws.com?

2) Service updates even when migration fails #

Without checking the run-task exit code, the workflow proceeds to the next step even when the migration has failed. Always check aws ecs describe-tasks exitCode and call exit 1 on non-zero.

3) Deploying with the `latest` tag #

If Task Definition image is :latest, you can’t track which code is running. Specify ECR image digest (@sha256:...) or git SHA tag.

4) Migration RunTask can’t get an IP #

Free Tier default subnets are small, so production tasks and migration tasks competing for IPs can cause failures. Use a separate SG / subnet for migrations, or verify that IPs are available during the deploy window.

5) Circuit Breaker rolls back even healthy deploys #

Too short health-check grace period + long boot time → healthy deploys get misjudged unhealthy. Set health-check-grace-period-seconds to app boot + buffer (e.g. Django 90s).

6) GitHub Actions OIDC audience cache #

You changed sub or aud but the old values keep showing up. This is not a workflow cache issue — you need to start a fresh job to get a new token.

7) `ecs-deploy-task-definition` stuck #

With wait-for-service-stability: true, if wait-for-minutes is too short, even healthy deploys fail. Be conservative — 15–20 minutes.

Wrapping up #

What we covered in this post:

OIDC — IAM OIDC Provider + Trust policy sub pattern, id-token: write permission
Permissions policy — three groups: ECR / ECS / iam:PassRole
Workflow — test → OIDC → ECR push → migration RunTask → Service deploy → wait-stable
Circuit Breaker — enable=true, rollback=true, maximumPercent / minimumHealthyPercent shape rolling
Manual rollback — update-service to a previous task definition revision
Blue/Green & Canary — CodeDeploy’s ECSLinear10PercentEvery1Minutes etc.
CodePipeline comparison — natural choice based on code location
Environment separation — branch matrix + GitHub environments with approval/wait
Secret management — app secrets in AWS Secrets Manager, only CI tokens in GitHub secrets
Pitfalls — OIDC 401, missing migration check, latest tag, IP shortage, grace period, stale token, stuck stability wait

Next — IaC #

Deployment is automated. But the infrastructure itself — VPC / SG / RDS / ALB / ECS — is still managed by hand through the console / CLI. Could you spin up another identical environment from scratch?

In #4 IaC — Terraform fundamentals we move infrastructure to code. The shape of provider / resource / state, S3+DynamoDB backend, modules for dev/prod separation, and the flow of code-ifying the #1 infrastructure line by line.

The big picture #

1) GitHub OIDC — auth without access keys #

One-time — register the IAM OIDC Provider #

IAM Role — Trust Policy #

Permissions Policy #

2) GitHub Actions workflow #

3) Deployment Circuit Breaker — auto-rollback #

Manual rollback #

4) Progressive deploys — Canary / Blue-Green #

Blue/Green #

Canary #

Which for which case? #

5) Comparison with CodePipeline #

6) Environment separation — dev / staging / prod #

7) Handling secrets and variables #

Pitfalls — common issues in CI/CD #

1) aws sts get-caller-identity returns 401 #

2) Service updates even when migration fails #

3) Deploying with the latest tag #

4) Migration RunTask can’t get an IP #

5) Circuit Breaker rolls back even healthy deploys #

6) GitHub Actions OIDC audience cache #

7) ecs-deploy-task-definition stuck #

Wrapping up #

Next — IaC #

1) `aws sts get-caller-identity` returns 401 #

3) Deploying with the `latest` tag #

7) `ecs-deploy-task-definition` stuck #