CI/CD — GitHub Actions + ECR + ECS
Access-key-free GitHub Actions with OIDC, ECR push, automatic Task Definition updates, ECS Service rolling deployment, deployment circuit breaker and auto-rollback, all the way to CodeDeploy blue/green. A deployment flow that finishes in a single git push.
In Chapter 22 Infrastructure skeleton we stood up an ECS Service by hand, and in Chapter 23 RDS integration we ran RDS and migrations by hand. This chapter binds all that manual work into a single git push.
As the third chapter of Part 4, what it covers is as follows.
- GitHub Actions ↔ AWS authentication without access keys — OIDC
- the build → ECR push → Task Definition update → Service update → migration workflow
- auto-rollback — Deployment Circuit Breaker
- progressive deployment — CodeDeploy blue/green / canary
- a comparison with CodePipeline — when to use which
The big picture #
git push (main)
│
▼
GitHub Actions
│
├─ 1) Test ← pytest / npm test
│
├─ 2) AWS OIDC assume-role ← no access keys
│
├─ 3) Build & push image ← <git-sha> tag
│ ECR: blog-api:abc1234
│
├─ 4) Run migrations ← ecs run-task (blog-api-migrate)
│ wait → check exit code
│
├─ 5) Update Task Definition ← new revision with new image
│
├─ 6) Update Service ← rolling deployment
│
└─ 7) Wait services-stable ← 5~10 min
circuit breaker auto-rollback on failureThe goal of this chapter is to make this flow run in one pass.
1) GitHub OIDC — access-key-free authentication #
The old pattern was to issue access keys from an IAM user and store them in GitHub Secrets. That brings risks like git-history exposure, the obligation to rotate keys, and poor traceability.
The OIDC (OpenID Connect) pattern has GitHub issue a short-lived token (15 minutes) on every workflow run, and makes AWS IAM trust that token.
GitHub Actions Job starts
│
▼
GitHub OIDC Provider issues a JWT
{sub: "repo:myorg/blog-api:ref:refs/heads/main", aud: "sts.amazonaws.com"}
│
▼
aws-actions/configure-aws-credentials
├─ STS:AssumeRoleWithWebIdentity
├─ AWS validates the sub claim against the trust policy
▼
temporary credentials (AccessKey / SecretKey / SessionToken) TTL: 1hOnce only — registering the IAM OIDC Provider #
aws iam create-open-id-connect-provider \
--url https://token.actions.githubusercontent.com \
--client-id-list sts.amazonaws.com \
--thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1The thumbprint is the SHA1 of GitHub OIDC’s SSL certificate. The AWS console GUI fetches it automatically.
IAM Role — Trust Policy #
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:myorg/blog-api:ref:refs/heads/main"
}
}
}]
}The sub pattern is the key.
| Pattern | Meaning |
|---|---|
repo:myorg/blog-api:ref:refs/heads/main | main branch only |
repo:myorg/blog-api:ref:refs/tags/* | tag pushes only |
repo:myorg/blog-api:environment:production | only after passing the environment gate |
repo:myorg/blog-api:* | dangerous — this role can be used even from any PR |
The production recommendation is environment gate + main/tag only.
Permissions Policy #
Grant only the actions deployment needs.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ECR",
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:PutImage",
"ecr:InitiateLayerUpload",
"ecr:UploadLayerPart",
"ecr:CompleteLayerUpload"
],
"Resource": "*"
},
{
"Sid": "ECS",
"Effect": "Allow",
"Action": [
"ecs:RegisterTaskDefinition",
"ecs:DescribeTaskDefinition",
"ecs:UpdateService",
"ecs:DescribeServices",
"ecs:RunTask",
"ecs:DescribeTasks",
"ecs:ListTasks"
],
"Resource": "*"
},
{
"Sid": "PassRole",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": [
"arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"arn:aws:iam::123456789012:role/blog-api-task-role"
]
}
]
}If iam:PassRole is missing, RegisterTaskDefinition fails. The act of granting an IAM role to a Task Definition is “passing” that role, so it needs a separate permission.
2) The GitHub Actions workflow #
name: Deploy to ECS
on:
push:
branches: [main]
workflow_dispatch:
permissions:
id-token: write # OIDC token issuance — required
contents: read
env:
AWS_REGION: ap-northeast-2
ECR_REPOSITORY: blog-api
ECS_CLUSTER: blog-cluster
ECS_SERVICE: blog-api
TASK_FAMILY: blog-api
MIGRATE_FAMILY: blog-api-migrate
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.14" }
- run: pip install -r requirements.txt -r requirements-dev.txt
- run: pytest -q
deploy:
needs: test
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
# 1) AWS OIDC
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
aws-region: ${{ env.AWS_REGION }}
# 2) ECR login
- name: Login to ECR
id: ecr
uses: aws-actions/amazon-ecr-login@v2
# 3) Build & push
- name: Build and push
id: build
env:
REGISTRY: ${{ steps.ecr.outputs.registry }}
TAG: ${{ github.sha }}
run: |
docker build --platform=linux/amd64 \
-t $REGISTRY/$ECR_REPOSITORY:$TAG \
-t $REGISTRY/$ECR_REPOSITORY:latest .
docker push $REGISTRY/$ECR_REPOSITORY:$TAG
docker push $REGISTRY/$ECR_REPOSITORY:latest
echo "image=$REGISTRY/$ECR_REPOSITORY:$TAG" >> $GITHUB_OUTPUT
# 4) Migration RunTask
- name: Run DB migrations
env:
IMAGE: ${{ steps.build.outputs.image }}
run: |
# register a new revision of the migration task definition with the new image
DEF=$(aws ecs describe-task-definition --task-definition $MIGRATE_FAMILY \
--query 'taskDefinition' --output json)
NEW=$(echo "$DEF" | jq --arg I "$IMAGE" \
'.containerDefinitions[0].image=$I |
{family,taskRoleArn,executionRoleArn,networkMode,containerDefinitions,
volumes,placementConstraints,requiresCompatibilities,cpu,memory}')
NEW_ARN=$(aws ecs register-task-definition \
--cli-input-json "$NEW" \
--query 'taskDefinition.taskDefinitionArn' --output text)
# RunTask
TASK_ARN=$(aws ecs run-task --cluster $ECS_CLUSTER \
--task-definition $NEW_ARN --launch-type FARGATE \
--network-configuration "awsvpcConfiguration={
subnets=[${{ secrets.MIGRATE_SUBNET_ID }}],
securityGroups=[${{ secrets.FARGATE_SG_ID }}],
assignPublicIp=ENABLED
}" \
--started-by "deploy-${{ github.sha }}" \
--query 'tasks[0].taskArn' --output text)
echo "Migration task: $TASK_ARN"
aws ecs wait tasks-stopped --cluster $ECS_CLUSTER --tasks $TASK_ARN
# check exit code (fail if non-zero)
EXIT=$(aws ecs describe-tasks --cluster $ECS_CLUSTER --tasks $TASK_ARN \
--query 'tasks[0].containers[0].exitCode' --output text)
if [ "$EXIT" != "0" ]; then
echo "Migration failed (exit=$EXIT)"
aws logs tail /ecs/blog-api-migrate --since 10m
exit 1
fi
# 5) Update Service Task Definition
- name: Render service task definition
id: render
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: ops/task-definition.json
container-name: api
image: ${{ steps.build.outputs.image }}
# 6) Deploy to ECS Service
- name: Deploy
uses: aws-actions/amazon-ecs-deploy-task-definition@v2
with:
task-definition: ${{ steps.render.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
wait-for-minutes: 15The key steps laid out:
| Step | Meaning |
|---|---|
id-token: write | permission to issue an OIDC token. drop it and STS AssumeRole 401s |
environment: production | GitHub environment gate — manual approval, secret isolation |
aws-actions/amazon-ecs-render-task-definition | base JSON + new image → generate new JSON |
aws-actions/amazon-ecs-deploy-task-definition | RegisterTaskDefinition + UpdateService + wait |
wait-for-service-stability | wait until stable — the step fails on failure |
3) Deployment Circuit Breaker — auto-rollback #
A feature we touched on briefly in Chapter 22 Infrastructure skeleton. If a new deployment doesn’t come up, it automatically reverts to the previous task definition.
aws ecs update-service \
--cluster blog-cluster --service blog-api \
--deployment-configuration "
deploymentCircuitBreaker={enable=true,rollback=true},
maximumPercent=200,
minimumHealthyPercent=100"How it works:
- When a new task can’t reach a healthy state, ECS counts it.
- If it doesn’t reach healthy within a certain count / time, it judges the deployment as failed.
- If
rollback=true, it automatically returns to the previous task definition.
At the GitHub Actions step, wait-for-service-stability returns false, so the workflow fails too.
Manual rollback #
For when auto-rollback doesn’t trigger or you need to investigate after the fact.
PREV=$(aws ecs describe-task-definition --task-definition blog-api:42 \
--query 'taskDefinition.taskDefinitionArn' --output text)
aws ecs update-service \
--cluster blog-cluster --service blog-api \
--task-definition $PREV \
--force-new-deployment4) Progressive deployment — Canary / Blue-Green #
Default ECS rolling has a new task take traffic the moment it becomes healthy. If you need a more conservative shape, CodeDeploy plays the role.
Blue/Green #
Blue (currently in production) ←──── 100% traffic
│
▼
Stand up Green, the new version (Blue stays alive)
│
▼
Validate Green via the ALB Listener's Test traffic
│
▼
Listener's 100% traffic → Green
│
▼
Wait timer (10~60 min) — if no issue, terminate Blue
if there's an issue, return to Blue with one Listener line (instant rollback)The upsides are as follows.
- instant rollback — just revert the Listener
- explicitly secures time to validate the new version
The downsides are as follows.
- double the resources (during deployment)
- the ALB Listener pattern is slightly complex (Test listener + Production listener)
- heavier setup than ECS Rolling
Canary #
Linear (10% every 5 minutes) — 100% in 50 minutes
Canary (10% → wait 5 minutes → 90% at once)
AllAtOnce (immediately 100% — the fastest shape of Blue/Green)CodeDeploy’s deployment configuration names are as follows.
CodeDeployDefault.ECSAllAtOnceCodeDeployDefault.ECSLinear10PercentEvery1MinutesCodeDeployDefault.ECSCanary10Percent5Minutes
Which one to use #
| Situation | Recommendation |
|---|---|
| small operation / side project | ECS Rolling + Circuit Breaker |
| high-traffic operation, risky change | CodeDeploy Blue/Green Linear |
| ML inference / large memory models | Blue/Green (needs warmup time) |
This book assumes ECS Rolling + Circuit Breaker as the default. Blue/Green is a story for after traffic has grown.
5) Comparison with CodePipeline #
Besides GitHub Actions, there’s AWS-native CI/CD.
| GitHub Actions | CodePipeline | |
|---|---|---|
| Trigger | push / PR / schedule | CodeCommit / GitHub / S3 / ECR push |
| Build | runners pool (hosted/self-hosted) | CodeBuild |
| Deploy | direct calls or actions | CodeDeploy / ECS / CFN / Lambda |
| Price | hosted per-minute / self-hosted free | $1/month per pipeline + CodeBuild |
| Upsides | code and workflow in one place, rich ecosystem | AWS-native integration, consistent IAM |
| Downsides | OIDC setup / separate secret management | weak integration with external services |
If your code is on GitHub, GitHub Actions is the natural choice. If your code is on CodeCommit per company security policy, it’s CodePipeline.
6) Environment separation — dev / staging / prod #
Branch by environment within one workflow.
on:
push:
branches: [main, develop]
jobs:
deploy:
strategy:
matrix:
include:
- branch: develop
env: dev
cluster: blog-cluster-dev
role: arn:aws:iam::123456789012:role/github-actions-deploy-dev
- branch: main
env: prod
cluster: blog-cluster-prod
role: arn:aws:iam::123456789012:role/github-actions-deploy-prod
if: github.ref == format('refs/heads/{0}', matrix.branch)
environment: ${{ matrix.env }}You can attach separate secrets, required reviewers, and a wait timer to GitHub environments (dev, prod). For the production environment, attach a 2-person approval + 5-minute wait timer to prevent mistakes.
7) Managing secrets and variables #
| Where to put it | |
|---|---|
| AWS Account ID | GitHub vars |
| cluster name / service name | GitHub vars or workflow env |
| DB password / API key | AWS Secrets Manager (Chapter 23) |
| GitHub deploy role ARN | GitHub vars |
| Slack webhook (CI notifications) | GitHub secrets |
The principle is as follows. App secrets go in AWS Secrets Manager, and GitHub secrets hold only the tokens CI itself needs.
Pitfalls — things you often meet in the CI/CD flow #
1) aws sts get-caller-identity returns 401
#
Suspect the OIDC setup. The order of checks is as follows.
permissions: id-token: writemissing- Does the IAM Role trust policy’s
subpattern exactly match the actual workflow’srepo:org/repo:ref:...? - Is the OIDC Provider thumbprint up to date?
- Is the Role’s
audconditionsts.amazonaws.com?
2) The Service updates even though the Migration failed #
If you don’t check the exit code of run-task, the workflow proceeds to the next step even when the migration fails. Always check the exitCode from aws ecs describe-tasks and exit 1 on non-zero.
3) Deploying with the latest tag
#
If the Task Definition image is :latest, you can’t trace which code is currently running. Specify down to the ECR image digest (@sha256:...) or use a git SHA tag.
4) IP shortage for the Migration RunTask #
Free Tier’s default subnet is small on IPs, so the production task + migration task trying to grab IPs at the same time can fail. Separate a dedicated SG / subnet for migrations, or verify there are IPs available in the deploy window.
5) The Circuit Breaker rolls back even a healthy deployment #
Too short a health-check grace period + a long boot time makes a healthy deployment look unhealthy. Set health-check-grace-period-seconds to the app’s boot time + margin (e.g., 90 seconds for Django).
6) GitHub Actions OIDC audience cache #
You changed sub or aud, but the old value still comes through. You have to restart with a new job — not a workflow cache — for a new token to be issued.
7) ecs-deploy-task-definition getting stuck
#
With wait-for-service-stability: true set, if wait-for-minutes is too short, even a healthy deployment is treated as a failure. Set it conservatively to 15 ~ 20 minutes.
Exercises #
- Write two reasons, on the basis of §“GitHub OIDC,” why OIDC is more secure than the IAM-user access-key approach. Also explain in one sentence why setting the Trust Policy’s
subpattern torepo:myorg/blog-api:*is dangerous. - Explain, on the basis of §“Pitfall 2,” what incident arises if you don’t check the exit code at the migration RunTask step, and point to which part of the workflow YAML is responsible for this check.
- Lay out, on the basis of the §“Which one to use” table, in which situations you’d choose ECS Rolling + Circuit Breaker versus CodeDeploy Blue/Green. It helps to recall in advance where the
deployment_circuit_breakerblock goes when moving this deployment configuration into code in Chapter 25 Terraform intro.
In short: GitHub Actions OIDC assumes an AWS role with a short-lived token and no access keys, and restricts which branches or environments can deploy via the Trust Policy’s
subpattern. The workflow goes test → build/push → migration RunTask (check exit code) → Service deploy → wait-stable, and the Deployment Circuit Breaker auto-rolls back on failure. App secrets go in Secrets Manager; only CI’s own tokens go in GitHub secrets.
Next chapter #
Deployment is automated. But the infrastructure itself — VPC / SG / RDS / ALB / ECS — is still held in your hands via the console and CLI. Could you stand up an identical new environment once more? In the next Chapter 25 IaC — Terraform intro we move the infrastructure into code. We cover the shape of provider / resource / state, the S3 + DynamoDB backend, separating dev/prod with modules, and the flow of codifying Chapter 22’s infrastructure step by step.