Contents
24 Chapter

CI/CD Pipeline

The myshop-api built through Chapter 23 still relies heavily on humans when a new version comes in. This chapter automates that process. With OIDC trust, GitHub Actions pushes a container image to AWS ECR without static keys, auto-commits the Helm values in the manifest repo, and ArgoCD, covered in Chapter 20, detects that change and syncs it to the cluster. We also cover PR approval gates, the dev / prod split, Argo Rollouts canary deployment, and image tag immutability.

Having gone through Chapter 23 DB integration, myshop-api is a complete service with EKS · RDS · Secrets · the connection pool all in place, but humans are still heavily involved when a new version arrives. Someone builds and pushes the container, someone changes the manifest’s image tag, and someone runs helm upgrade. This chapter automates that flow as code. GitHub Actions pushes the image to ECR without static keys via OIDC trust, auto-commits the Helm values in the manifest repo, and ArgoCD, covered in Chapter 20 GitOps, detects that change and syncs it to the cluster.

The goal of this chapter is a state where one code push auto-deploys to dev, and one git tag queues up a prod deployment. We also cover the production-standard PR approval gate and canary automatic promote / rollback.

The two-repo model — separating code and manifests #

The most common pattern in GitOps is the separation of two repos. The model touched on in Chapter 20 GitOps §“One repo vs two repos” is shown here as a full production pipeline.

repoRole
myshop-api (application repo)Source code, Dockerfile, GitHub Actions workflow
myshop-manifests (manifest repo)Helm values, ArgoCD Application manifests, per-environment config

There are three benefits to this separation.

  • Separation of permissions — the reviewers for code changes and for infrastructure / deployment changes can differ.
  • Clarity of changes — looking at the git log, “which version was up in prod at this point” is clear.
  • ArgoCD only needs to watch one place — watch just the manifest repo and the desired state of every environment is captured.

The flow of a code push is captured in the following one line.

One cycle of GitOps
[developer push] -> [GitHub Actions: build / test / ECR push]
              -> [auto-commit the image tag in the manifest repo]
              -> [ArgoCD detects the change]
              -> [deploy the new version to the cluster]

We unpack each stage one section at a time.

GitHub Actions — AWS credentials dynamically via OIDC #

The old way to call AWS APIs from GitHub Actions was to store an IAM user’s access key / secret key in GitHub Secrets. The problem with this method is obvious — the keys are static so rotation is hard, and once leaked the impact is large.

The new standard is OIDC trust. It’s the model where GitHub Actions issues a JWT token and AWS IAM verifies that token to issue temporary credentials — the same structure as the IRSA of Chapter 16 RBAC / ServiceAccount in depth. It’s the shape where a ServiceAccount’s projected token just changes places to GitHub Actions’s JWT.

Registering the OIDC provider (Terraform) #

terraform/modules/github-oidc/main.tf
resource "aws_iam_openid_connect_provider" "github" {
  url             = "https://token.actions.githubusercontent.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = ["6938fd4d98bab03faadb97b34396831e3780aea1"]
}

resource "aws_iam_role" "github_actions_ecr_push" {
  name = "github-actions-myshop-api-ecr-push"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.github.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:myshop/myshop-api:ref:refs/heads/main"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ecr_push" {
  role       = aws_iam_role.github_actions_ecr_push.name
  policy_arn = aws_iam_policy.ecr_push.arn
}

The sub in Condition is the key — only a workflow triggered from the main branch of the myshop/myshop-api repo can take on this Role. Other repos, other branches, and other forks are all rejected. The way the IRSA trust policy of Chapter 16 isolated by namespace + ServiceAccount name changes, in GitHub Actions, to repo + branch.

Workflow — build and push #

.github/workflows/build.yml
name: Build and push

on:
  push:
    branches: [main]
    tags: ['v*']

permissions:
  id-token: write    # needed to issue the OIDC token
  contents: read

env:
  AWS_REGION: ap-northeast-2
  ECR_REPOSITORY: myshop-api

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set image tag
        id: meta
        run: |
          if [[ "$GITHUB_REF" == refs/tags/v* ]]; then
            echo "tag=${GITHUB_REF#refs/tags/v}" >> $GITHUB_OUTPUT
          else
            echo "tag=main-$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
          fi

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-myshop-api-ecr-push
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/${{ env.ECR_REPOSITORY }}:${{ steps.meta.outputs.tag }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Update manifest repo
        env:
          GH_TOKEN: ${{ secrets.MANIFESTS_REPO_TOKEN }}
        run: |
          gh api repos/myshop/myshop-manifests/dispatches \
            -f event_type=update-image \
            -F client_payload[app]=myshop-api \
            -F client_payload[tag]=${{ steps.meta.outputs.tag }} \
            -F client_payload[env]=dev

We point out the three key steps.

  • Configure AWS credentials (OIDC)AssumeRoleWithWebIdentity on the IAM Role created above via OIDC. This one step receives temporary credentials without static keys.
  • Build and push — multi-platform build with Docker buildx + ECR push. Layer caching is automatic with the GHA cache.
  • Update manifest repo — triggers another workflow in the manifest repo with a repository_dispatch event. That workflow auto-commits the Helm values.

Auto-commit in the manifest repo #

In the manifest repo we put a workflow that receives the dispatch above and updates the values file.

myshop-manifests/.github/workflows/update-image.yml
name: Update image tag

on:
  repository_dispatch:
    types: [update-image]

jobs:
  update:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Update values
        run: |
          APP=${{ github.event.client_payload.app }}
          TAG=${{ github.event.client_payload.tag }}
          ENV=${{ github.event.client_payload.env }}

          yq -i ".image.tag = \"$TAG\"" charts/$APP/values-$ENV.yaml

      - name: Commit and push
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
          git add charts/
          git commit -m "chore: bump ${{ github.event.client_payload.app }} to ${{ github.event.client_payload.tag }} (${{ github.event.client_payload.env }})"
          git push

When this commit goes into the manifest repo’s main branch, ArgoCD, which has been watching that change, auto-syncs it to the cluster. It’s the shape where the two files values-dev.yaml / values-prod.yaml of Chapter 22 become the targets of this chapter’s auto-commit.

ArgoCD — the watcher of the manifest repo #

We use the ArgoCD model covered in Chapter 20 GitOps directly. One Application CRD manifest handles the deployment of one myshop-api environment.

Installing ArgoCD #

ArgoCD Helm install
helm repo add argo https://argoproj.github.io/argo-helm
helm install argocd argo/argo-cd \
  -n argocd --create-namespace \
  --values argocd-values.yaml
argocd-values.yaml — partial
server:
  ingress:
    enabled: true
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
    hosts:
      - argocd.myshop.example.com

configs:
  cm:
    timeout.reconciliation: 30s

The ArgoCD UI is exposed at argocd.myshop.example.com. It’s the shape where the AWS Load Balancer Controller created in Chapter 22 resolves this Ingress to an ALB too. In production it’s standard to bind it with SSO (GitHub, Google), and the RBAC model seen in Chapter 14 RBAC / NetworkPolicy / ResourceQuota carries over straight into ArgoCD UI’s permission model too.

The myshop-api Application #

argocd/applications/myshop-api-prod.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myshop-api-prod
  namespace: argocd
spec:
  project: myshop

  source:
    repoURL: https://github.com/myshop/myshop-manifests.git
    targetRevision: main
    path: charts/myshop-api
    helm:
      valueFiles:
        - values.yaml
        - values-prod.yaml

  destination:
    server: https://kubernetes.default.svc
    namespace: myshop

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - ServerSideApply=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m

Let’s point out the tradeoffs among the three options.

  • automated — changes in git are reflected to the cluster immediately. A mode suitable for dev.
  • selfHeal: true — even if someone modifies directly with kubectl edit, it auto-recovers to the git manifest.
  • prune: true — objects that disappear from git are deleted from the cluster too.

The migration Job made with the Helm hook of Chapter 23 is automatically converted into a PreSync hook in ArgoCD. The flow where ArgoCD runs the migration Job first before applying the new manifest, and moves to the next stage only on that Job’s success, is absorbed naturally into GitOps.

dev vs prod — the automatic sync split #

The pattern of turning off automatic sync for prod and going with a manual trigger is frequently used.

myshop-api-prod.yaml — manual sync
syncPolicy:
  syncOptions:
    - CreateNamespace=true
    - ServerSideApply=true
  # remove the automated block -> manual sync mode

The deployment flow branches as follows.

dev vs prod deployment flow
[dev]
git push -> GitHub Actions build -> ECR push
        -> manifest repo commit (values-dev.yaml)
        -> ArgoCD auto-sync -> deploy to the dev cluster

[prod]
git tag v1.5.0 -> GitHub Actions build -> ECR push
              -> manifest repo commit (values-prod.yaml)
              -> a human clicks "Sync" in the ArgoCD UI
              -> deploy to the prod cluster

The human gate for prod deployment is the safeguard. The manifest itself is reviewed via a git PR, and the actual application is confirmed once more by an operator. This double gate is covered in the change-management procedure of Chapter 26 Operations checklist.

The standard for bundling Applications — App of Apps #

It’s a pattern where, instead of applying Application manifests by hand into ArgoCD, one root Application watches the other Applications.

argocd/root.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/myshop/myshop-manifests.git
    targetRevision: main
    path: argocd/applications
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true

When you make a new Application in the argocd/applications/ directory, it’s automatically registered in ArgoCD, and that Application syncs its own manifest. The cluster’s own operations come into GitOps too. It’s the stage where the model touched on in Chapter 20 §“App of Apps” settles in as the standard setup of full multi-environment operations.

Image Updater — moving image tag updates to ArgoCD #

The flow above had GitHub Actions commit to the manifest repo to update the image tag. ArgoCD Image Updater is an option that moves this step to ArgoCD.

myshop-api-prod.yaml — Image Updater annotation
metadata:
  annotations:
    argocd-image-updater.argoproj.io/image-list: api=123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myshop-api
    argocd-image-updater.argoproj.io/api.update-strategy: semver
    argocd-image-updater.argoproj.io/write-back-method: git
    argocd-image-updater.argoproj.io/write-back-target: helmvalues:./charts/myshop-api/values-prod.yaml

ArgoCD Image Updater polls ECR regularly and, when it finds a new tag, auto-commits to the manifest repo. The commit step of GitHub Actions becomes unnecessary, but since the polling cycle is on a 5-minute basis, immediacy drops. If you want to leave the order of the code push and manifest commit clearly in git, the GitHub Actions commit model is more intuitive. This book’s standard path is the GitHub Actions commit, and we point out Image Updater only as an option for multi-cluster environments.

Canary · blue-green — Argo Rollouts #

The standard Deployment’s RollingUpdate is the simplest zero-downtime deployment model. On top of that model covered in Chapter 4 Deployment / ReplicaSet, more sophisticated patterns (canary, blue-green, promote after automatic analysis) are unpacked by Argo Rollouts.

rollout.yaml — 5% canary -> analysis -> 100%
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myshop-api
  namespace: myshop
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: myshop-api-canary
      stableService: myshop-api-stable
      trafficRouting:
        alb:
          ingress: myshop-api
          servicePort: 80
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
  selector:
    matchLabels:
      app.kubernetes.io/name: myshop-api
  template:
    spec:
      containers:
        - name: api
          image: 123456789012.dkr.ecr.ap-northeast-2.amazonaws.com/myshop-api:1.5.0
          # ... (same spec as the Deployment)

The new version transitions gradually in the order of 5 % traffic for 5 minutes → automatic analysis (Prometheus metric query) → on pass 25 % → 50 % → 100 %. If a failure is detected at the analysis stage, it auto-rolls back.

analysistemplate.yaml — success-rate analysis
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(http_requests_total{app="myshop-api",status=~"2.."}[5m]))
              / sum(rate(http_requests_total{app="myshop-api"}[5m]))
      successCondition: result[0] >= 0.99
      failureLimit: 1

The Prometheus metric we’ll cover in Chapter 25 Monitoring · alerts goes directly into the canary’s automatic promote / rollback decision at this stage. Argo Rollouts shows its true value when bound with the metric stack of Chapter 19 Observability — it’s the shape where the metric is used as the input data of automation, not as a dashboard a human looks at.

The standard for the PR flow — environments + required reviewers #

We also lay out GitHub Actions’s production-standard gate.

.github/workflows/build.yml — using an environment
jobs:
  build-prod:
    if: startsWith(github.ref, 'refs/tags/v')
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://api.myshop.example.com
    steps:
      - ...

If you create the environment: production in GitHub Settings and specify Required reviewers, a workflow going to that environment won’t start without a human’s approval. It’s the standard pattern that prevents a prod deployment from auto-starting on a single tag. Combined with the ArgoCD UI’s manual Sync, a double gate of the build stage + the deployment stage is created.

Checks for the first cycle #

These are the items to check at the point when GitHub Actions push → ECR → manifest commit → ArgoCD sync has gone around once.

Check the ECR image
aws ecr describe-images \
  --repository-name myshop-api \
  --region ap-northeast-2 \
  --query 'imageDetails[*].[imageTags,imagePushedAt]' \
  --output table
ArgoCD Application status
argocd app get myshop-api-prod
argocd app sync myshop-api-prod   # manual sync (for prod)
argocd app history myshop-api-prod
Is the deployed image tag correct?
kubectl get deployment myshop-api -n myshop \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

If the three commands consistently point to the new tag, the deployment is working normally. In the ArgoCD UI the same information is shown visually, and drift between the manifest and the cluster is visible at a glance too. If ArgoCD is stuck at OutOfSync, refer to the GitOps debugging section of Chapter 27 kubectl debugging patterns — a format error in the values file, insufficient ECR image permission (ImagePullBackOff), and the manifest repo’s trust are the three most common causes.

One trap — the mutability of container image tags #

The production standard is keeping image tags immutable. If you let the same tag point to a different image, ArgoCD’s drift detection loses its meaning. The following setup is essential.

  • Enable immutable tags on the ECR repository — turn on image_tag_mutability = "IMMUTABLE" with Terraform.
  • Never use the latest tag in prod — always a git SHA or semver.
  • Image tag = git commit hash or git tag — which commit is up in which environment is visible at a glance.

If this setup is missing, the accident of “the tag that worked until yesterday is a different image today” happens. It’s the point where the source of truth of GitOps breaks. The principle touched on in Chapter 20 GitOps §“For git to be the single source” leads into a concrete form in this chapter’s ECR setup.

Exercises #

  1. Apply this chapter’s GitHub OIDC Terraform manifest and set up ECR push without static keys from one repo in your own GitHub organization. Switch the pattern of Condition.sub between the two values repo:org/repo:ref:refs/heads/main and repo:org/repo:environment:production and compare which one each policy allows. In one paragraph, explain how environment-based isolation differs from branch-based isolation.
  2. Make the two ArgoCD Application manifests for dev and prod, and split them so dev is automated.prune + selfHeal and prod is manual sync. Measure how many seconds it takes for selfHeal to revert to the git value when you arbitrarily change the Deployment’s replicas with kubectl edit in dev. Explain in one paragraph why the same behavior is dangerous in prod, in the operational context of Chapter 26.
  3. Apply Argo Rollouts’s canary manifest and auto-promote myshop-api’s new version 5 % → 25 % → 100 %. Deliberately deploy a version that returns 5xx and observe the analysis stage detecting the failure and auto-rolling back. Note how the Prometheus query that becomes the input of this automatic analysis connects to the alert rules of Chapter 25.

In one line: the CI/CD standard for a production cluster is a GitOps pipeline where GitHub Actions OIDC, ECR, manifest repo auto-commit, and ArgoCD watch work as one flow. The two-repo separation solves permissions, change tracking, and ArgoCD’s single watch target at once, and it splits dev into automated sync + selfHeal and prod into manual sync + the double gate of GitHub environment. Argo Rollouts’s canary uses Prometheus metrics as automation input to move promote / rollback from a human’s hand to code. If image tag immutability is missing, GitOps’s source of truth breaks.

Next chapter #

At this point myshop-api has settled into a pattern where one code push auto-deploys to dev, and one git tag queues up a prod deployment. But there’s still no layer that looks into all those behaviors.

In the next chapter we fill that empty space. In Chapter 25 Monitoring · alerts we cover the observability stack of a production cluster composed of Prometheus + Grafana + Alertmanager + CloudWatch, and the core alert rule set. The metric · log · trace model of Chapter 19 Observability leads into a full AWS-coupled operational setup.

X