AWS in Practice #3: CI/CD — GitHub Actions + ECR + ECS
In #1 we launched the ECS Service by hand, and in #2 we ran RDS and migrations by hand. This post bundles all that manual work into a single git push.
What we’ll cover:
- GitHub Actions ↔ AWS auth without access keys — OIDC
- The build → ECR push → Task Definition update → Service update → migration workflow
- Auto rollback — Deployment Circuit Breaker
- Progressive deploy — a touch of CodeDeploy blue/green / canary
- CodePipeline comparison — when to use which
The big picture #
git push (main)
│
▼
GitHub Actions
│
├─ 1) Test ← pytest / npm test
│
├─ 2) AWS OIDC assume-role ← no access keys
│
├─ 3) Build & push image ← <git-sha> tag
│ ECR: blog-api:abc1234
│
├─ 4) Run migrations ← ecs run-task (blog-api-migrate)
│ wait → check exit code
│
├─ 5) Update Task Definition ← new revision with new image
│
├─ 6) Update Service ← rolling deploy
│
└─ 7) Wait services-stable ← 5–10 min
on failure, circuit breaker auto-rollbackThis post’s goal is making this flow run in one go.
1) GitHub OIDC — auth without access keys #
The old pattern: IAM user → access key → save in GitHub Secrets. Risky — exposure in git history, key rotation overhead, and difficult to audit.
The OIDC (OpenID Connect) pattern has GitHub issue a short-lived token (15 min) for each workflow run, which AWS IAM then trusts.
GitHub Actions Job starts
│
▼
GitHub OIDC Provider issues a JWT
{sub: "repo:myorg/blog-api:ref:refs/heads/main", aud: "sts.amazonaws.com"}
│
▼
aws-actions/configure-aws-credentials
├─ STS:AssumeRoleWithWebIdentity
├─ AWS validates sub claim against trust policy
▼
Temporary credentials (AccessKey / SecretKey / SessionToken) TTL: 1hOne-time — register the IAM OIDC Provider #
aws iam create-open-id-connect-provider \
--url https://token.actions.githubusercontent.com \
--client-id-list sts.amazonaws.com \
--thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1The thumbprint is the SHA1 of GitHub OIDC’s SSL cert. The AWS console GUI fetches it automatically.
IAM Role — Trust Policy #
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:myorg/blog-api:ref:refs/heads/main"
}
}
}]
}The pattern in sub is the key:
| Pattern | Meaning |
|---|---|
repo:myorg/blog-api:ref:refs/heads/main | Only main branch |
repo:myorg/blog-api:ref:refs/tags/* | Only tag pushes |
repo:myorg/blog-api:environment:production | Only those that pass the environment gate |
repo:myorg/blog-api:* | Risky — even PRs can use this role |
Production recommendation: environment gate + main/tag only.
Permissions Policy #
Only the actions needed for deployment:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ECR",
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:PutImage",
"ecr:InitiateLayerUpload",
"ecr:UploadLayerPart",
"ecr:CompleteLayerUpload"
],
"Resource": "*"
},
{
"Sid": "ECS",
"Effect": "Allow",
"Action": [
"ecs:RegisterTaskDefinition",
"ecs:DescribeTaskDefinition",
"ecs:UpdateService",
"ecs:DescribeServices",
"ecs:RunTask",
"ecs:DescribeTasks",
"ecs:ListTasks"
],
"Resource": "*"
},
{
"Sid": "PassRole",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": [
"arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"arn:aws:iam::123456789012:role/blog-api-task-role"
]
}
]
}Without iam:PassRole, RegisterTaskDefinition fails — embedding an IAM role in a Task Definition is considered “passing” that role, which requires explicit permission.
2) GitHub Actions workflow #
name: Deploy to ECS
on:
push:
branches: [main]
workflow_dispatch:
permissions:
id-token: write # OIDC token issuance — required
contents: read
env:
AWS_REGION: ap-northeast-2
ECR_REPOSITORY: blog-api
ECS_CLUSTER: blog-cluster
ECS_SERVICE: blog-api
TASK_FAMILY: blog-api
MIGRATE_FAMILY: blog-api-migrate
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.14" }
- run: pip install -r requirements.txt -r requirements-dev.txt
- run: pytest -q
deploy:
needs: test
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
# 1) AWS OIDC
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
aws-region: ${{ env.AWS_REGION }}
# 2) ECR login
- name: Login to ECR
id: ecr
uses: aws-actions/amazon-ecr-login@v2
# 3) Build & push
- name: Build and push
id: build
env:
REGISTRY: ${{ steps.ecr.outputs.registry }}
TAG: ${{ github.sha }}
run: |
docker build --platform=linux/amd64 \
-t $REGISTRY/$ECR_REPOSITORY:$TAG \
-t $REGISTRY/$ECR_REPOSITORY:latest .
docker push $REGISTRY/$ECR_REPOSITORY:$TAG
docker push $REGISTRY/$ECR_REPOSITORY:latest
echo "image=$REGISTRY/$ECR_REPOSITORY:$TAG" >> $GITHUB_OUTPUT
# 4) Migration RunTask
- name: Run DB migrations
env:
IMAGE: ${{ steps.build.outputs.image }}
run: |
# Register a new revision of the migrate task definition with the new image
DEF=$(aws ecs describe-task-definition --task-definition $MIGRATE_FAMILY \
--query 'taskDefinition' --output json)
NEW=$(echo "$DEF" | jq --arg I "$IMAGE" \
'.containerDefinitions[0].image=$I |
{family,taskRoleArn,executionRoleArn,networkMode,containerDefinitions,
volumes,placementConstraints,requiresCompatibilities,cpu,memory}')
NEW_ARN=$(aws ecs register-task-definition \
--cli-input-json "$NEW" \
--query 'taskDefinition.taskDefinitionArn' --output text)
# RunTask
TASK_ARN=$(aws ecs run-task --cluster $ECS_CLUSTER \
--task-definition $NEW_ARN --launch-type FARGATE \
--network-configuration "awsvpcConfiguration={
subnets=[${{ secrets.MIGRATE_SUBNET_ID }}],
securityGroups=[${{ secrets.FARGATE_SG_ID }}],
assignPublicIp=ENABLED
}" \
--started-by "deploy-${{ github.sha }}" \
--query 'tasks[0].taskArn' --output text)
echo "Migration task: $TASK_ARN"
aws ecs wait tasks-stopped --cluster $ECS_CLUSTER --tasks $TASK_ARN
# Check exit code (non-zero = fail)
EXIT=$(aws ecs describe-tasks --cluster $ECS_CLUSTER --tasks $TASK_ARN \
--query 'tasks[0].containers[0].exitCode' --output text)
if [ "$EXIT" != "0" ]; then
echo "Migration failed (exit=$EXIT)"
aws logs tail /ecs/blog-api-migrate --since 10m
exit 1
fi
# 5) Update Service Task Definition
- name: Render service task definition
id: render
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: ops/task-definition.json
container-name: api
image: ${{ steps.build.outputs.image }}
# 6) Deploy to ECS Service
- name: Deploy
uses: aws-actions/amazon-ecs-deploy-task-definition@v2
with:
task-definition: ${{ steps.render.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
wait-for-minutes: 15Key items:
| Item | Meaning |
|---|---|
id-token: write | Permission to issue OIDC token. Without it, STS AssumeRole 401 |
environment: production | GitHub environment gate — manual approval, secret separation |
aws-actions/amazon-ecs-render-task-definition | Base JSON + new image → new JSON |
aws-actions/amazon-ecs-deploy-task-definition | RegisterTaskDefinition + UpdateService + wait |
wait-for-service-stability | Wait for stable state — step fails on failure |
3) Deployment Circuit Breaker — auto-rollback #
Something we touched briefly in #1. When a new deployment can’t come up, it automatically reverts to the previous task definition.
aws ecs update-service \
--cluster blog-cluster --service blog-api \
--deployment-configuration "
deploymentCircuitBreaker={enable=true,rollback=true},
maximumPercent=200,
minimumHealthyPercent=100"How it works:
- ECS counts when new tasks can’t reach healthy state
- Marks deployment failed if it can’t reach healthy within a count / time
- With
rollback=true, automatically reverts to the previous task definition
In GitHub Actions, wait-for-service-stability returns a failure, so the workflow step also fails.
Manual rollback #
When auto-rollback didn’t fire or you need to investigate after the fact:
PREV=$(aws ecs describe-task-definition --task-definition blog-api:42 \
--query 'taskDefinition.taskDefinitionArn' --output text)
aws ecs update-service \
--cluster blog-cluster --service blog-api \
--task-definition $PREV \
--force-new-deployment4) Progressive deploys — Canary / Blue-Green #
By default, ECS rolling deployments send traffic to a new task as soon as it becomes healthy. For a more conservative approach, CodeDeploy steps in.
Blue/Green #
Blue (current production) ←──── 100% traffic
│
▼
Stand up Green (new version) — Blue still alive
│
▼
Validate Green via the ALB Listener's Test traffic
│
▼
Listener's 100% traffic → Green
│
▼
Wait timer (10–60 min) — if no issue, terminate Blue
— if issues, one Listener line back to Blue (instant rollback)Pros:
- Instant rollback — flip the Listener back, done
- Explicit time to validate the new version
Cons:
- Double resources (during deploy)
- ALB Listener pattern is slightly complex (Test listener + Production listener)
- Heavier setup than ECS Rolling
Canary #
Linear (10% every 5 min) — 50 min to 100%
Canary (10% → 5 min wait → 90% in one shot)
AllAtOnce (instant 100% — fastest Blue/Green shape)CodeDeploy deployment configuration names:
CodeDeployDefault.ECSAllAtOnceCodeDeployDefault.ECSLinear10PercentEvery1MinutesCodeDeployDefault.ECSCanary10Percent5Minutes
Which for which case? #
| Case | Recommendation |
|---|---|
| Small production / side project | ECS Rolling + Circuit Breaker |
| Big traffic production, risky changes | CodeDeploy Blue/Green Linear |
| ML inference / large memory models | Blue/Green (warmup time needed) |
This series assumes ECS Rolling + Circuit Breaker as the default. Blue/Green is for after traffic gets bigger.
5) Comparison with CodePipeline #
Beyond GitHub Actions, there’s AWS-native CI/CD.
| GitHub Actions | CodePipeline | |
|---|---|---|
| Trigger | push / PR / schedule | CodeCommit / GitHub / S3 / ECR push |
| Build | Runners pool (hosted/self-hosted) | CodeBuild |
| Deploy | Direct calls or actions | CodeDeploy / ECS / CFN / Lambda |
| Pricing | Hosted minutes / self-hosted free | $1/month per pipeline + CodeBuild |
| Pros | Code and workflow in one place, rich ecosystem | AWS-native integration, IAM consistency |
| Cons | OIDC setup / separate secret management | Weaker external service integration |
If your code is on GitHub, GitHub Actions is the natural choice. If company security policy requires code in CodeCommit, go with CodePipeline.
6) Environment separation — dev / staging / prod #
Branching by environment within a single workflow:
on:
push:
branches: [main, develop]
jobs:
deploy:
strategy:
matrix:
include:
- branch: develop
env: dev
cluster: blog-cluster-dev
role: arn:aws:iam::123456789012:role/github-actions-deploy-dev
- branch: main
env: prod
cluster: blog-cluster-prod
role: arn:aws:iam::123456789012:role/github-actions-deploy-prod
if: github.ref == format('refs/heads/{0}', matrix.branch)
environment: ${{ matrix.env }}You can attach separate secrets, required reviewers, and wait timers to each GitHub environment (dev, prod). Put a 2-person approval + 5-min wait timer on the production environment to prevent mistakes.
7) Handling secrets and variables #
| Where to put | |
|---|---|
| AWS Account ID | GitHub vars |
| Cluster name / Service name | GitHub vars or workflow env |
| DB password / API keys | AWS Secrets Manager (#2) |
| GitHub deploy role ARN | GitHub vars |
| Slack webhook (CI alerts) | GitHub secrets |
Principle: app secrets in AWS Secrets Manager, GitHub secrets only for tokens needed by CI itself.
Pitfalls — common issues in CI/CD #
1) aws sts get-caller-identity returns 401
#
Suspect OIDC setup. Check in order:
- Missing
permissions: id-token: write - Does the trust policy’s
subpattern exactly match the workflow’s actualrepo:org/repo:ref:...? - Is the OIDC Provider thumbprint up to date?
- Is the Role’s
audconditionsts.amazonaws.com?
2) Service updates even when migration fails #
Without checking the run-task exit code, the workflow proceeds to the next step even when the migration has failed. Always check aws ecs describe-tasks exitCode and call exit 1 on non-zero.
3) Deploying with the latest tag
#
If Task Definition image is :latest, you can’t track which code is running. Specify ECR image digest (@sha256:...) or git SHA tag.
4) Migration RunTask can’t get an IP #
Free Tier default subnets are small, so production tasks and migration tasks competing for IPs can cause failures. Use a separate SG / subnet for migrations, or verify that IPs are available during the deploy window.
5) Circuit Breaker rolls back even healthy deploys #
Too short health-check grace period + long boot time → healthy deploys get misjudged unhealthy. Set health-check-grace-period-seconds to app boot + buffer (e.g. Django 90s).
6) GitHub Actions OIDC audience cache #
You changed sub or aud but the old values keep showing up. This is not a workflow cache issue — you need to start a fresh job to get a new token.
7) ecs-deploy-task-definition stuck
#
With wait-for-service-stability: true, if wait-for-minutes is too short, even healthy deploys fail. Be conservative — 15–20 minutes.
Wrapping up #
What we covered in this post:
- OIDC — IAM OIDC Provider + Trust policy
subpattern,id-token: writepermission - Permissions policy — three groups: ECR / ECS /
iam:PassRole - Workflow — test → OIDC → ECR push → migration RunTask → Service deploy → wait-stable
- Circuit Breaker —
enable=true, rollback=true,maximumPercent/minimumHealthyPercentshape rolling - Manual rollback —
update-serviceto a previous task definition revision - Blue/Green & Canary — CodeDeploy’s
ECSLinear10PercentEvery1Minutesetc. - CodePipeline comparison — natural choice based on code location
- Environment separation — branch matrix + GitHub environments with approval/wait
- Secret management — app secrets in AWS Secrets Manager, only CI tokens in GitHub secrets
- Pitfalls — OIDC 401, missing migration check, latest tag, IP shortage, grace period, stale token, stuck stability wait
Next — IaC #
Deployment is automated. But the infrastructure itself — VPC / SG / RDS / ALB / ECS — is still managed by hand through the console / CLI. Could you spin up another identical environment from scratch?
In #4 IaC — Terraform fundamentals we move infrastructure to code. The shape of provider / resource / state, S3+DynamoDB backend, modules for dev/prod separation, and the flow of code-ifying the #1 infrastructure line by line.