Contents
25 Chapter

IaC — Terraform Intro

Why IaC, the shape of Terraform's provider / resource / state, team collaboration with an S3 + DynamoDB backend, environment separation with modules, and the flow of codifying the previous chapters' infrastructure step by step.

The infrastructure we built in Chapter 22 ~ Chapter 24 is still handled directly via the console and CLI. Asked to stand up the same setup once more — from memory? from notes? — and it wobbles. Moving that work into Terraform is this chapter.

As the fourth chapter of Part 4, what it covers is as follows.

  • Why IaC — repeatability / code review / drift tracking
  • Terraform’s structure — provider, resource, data, variable, output, state
  • state is the real heart — the S3 + DynamoDB lock backend
  • modules — units of reuse, branching by environment
  • codifying Chapter 22’s ECS infrastructure step by step

Why IaC #

There are four pains you meet in a console-only operation.

  1. Not reproducible — stand up staging exactly like prod? With human memory, subtle differences always remain.
  2. Changes not trackable — “who changed the SG last week?” means digging through CloudTrail. With code, it’s git log.
  3. Not reviewable — a one-line edit to a production cluster’s SG inbound gets no colleague’s eyes on it.
  4. The burden of delete / recreate — one thing built wrong and you’re afraid to fix it.

IaC (Infrastructure as Code) expresses infrastructure as declarative code and solves all four of the above at once.

ToolRole
Terraformmulti-cloud, the most standard. the star of this chapter
Pulumiwritten in TypeScript / Python / Go. strong on dynamic logic
AWS CDKTypeScript / Python → transpiled to CloudFormation
CloudFormationAWS-native YAML/JSON. weak on dynamic expression
OpenTofuthe OSS fork of Terraform (after the license dispute)

This book standardizes on Terraform. But as of 2026 you should know about Terraform’s license change and the OpenTofu option before getting into the tool in earnest, so we touch on it once here.

Terraform vs OpenTofu — the 2026 fork #

In 2023, HashiCorp changed Terraform’s license from open source (MPL 2.0) to the BSL (Business Source License) 1.1. Individuals and most companies can still use it, but building a competing product with Terraform is restricted. In response, the community forked the last MPL version into OpenTofu, governed as open source under the Linux Foundation. In 2025, IBM acquired HashiCorp.

The key point is that the two are effectively compatible.

  • Same HCL syntax, same provider ecosystem, compatible state.
  • Only the CLI changes from terraformtofu (tofu init / tofu plan / tofu apply).
  • Every .tf file in this book works as-is on both Terraform and OpenTofu.
When to choosePick
Fully open source / community governance / avoiding the BSL mattersOpenTofu
You need HCP Terraform (cloud state · policy · team features) or commercial supportTerraform
Learning / side projectsEither is fine (identical syntax)

As of 2026, OpenTofu has matured enough that production adoptions (Boeing · Capital One, etc.) are growing. This book writes its explanations and commands against terraform, but if your company uses OpenTofu, just read the command as tofu and it works the same.

1) Terraform’s five components #

main.tf — the smallest structure
# 1) Provider — how to communicate with AWS
provider "aws" {
  region = "ap-northeast-2"
}

# 2) Resource — the actual infrastructure to create
resource "aws_ecr_repository" "blog_api" {
  name                 = "blog-api"
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}

# 3) Data — look up an existing resource
data "aws_caller_identity" "current" {}

# 4) Variable — external input
variable "environment" {
  type    = string
  default = "dev"
}

# 5) Output — expose the result you created
output "ecr_url" {
  value = aws_ecr_repository.blog_api.repository_url
}

When the five components gather in one file, they become one unit of infrastructure.

The 4-step workflow #

Terraform's 4 steps
terraform init      # download providers, initialize backend
terraform plan      # preview what gets created/changed/deleted
terraform apply     # apply
terraform destroy   # delete

The output of plan is Terraform’s greatest value. It stops incidents before the code merge.

Example plan output
Terraform will perform the following actions:

  # aws_security_group.fargate will be created
  + resource "aws_security_group" "fargate" {
      + arn                    = (known after apply)
      + name                   = "sg-fargate"
      + ingress = [
          + {
              + from_port = 8000
              + to_port   = 8000
              + protocol  = "tcp"
              + ...
            },
        ]
    }

Plan: 1 to add, 0 to change, 0 to destroy.

+ add / ~ change / - delete / -/+ recreate (always be conscious, since a changed ID is dreadful).

2) State — the real heart #

Terraform stores “the state of the infrastructure built so far” in state (a .tfstate file). This file is what lets the next plan compute the difference.

The role of state
the actual AWS infrastructure   ←──────  Terraform code
                                   state (the result of the last apply)

Terraform looks at the three-way consistency of code ↔ state ↔ AWS and then drafts the change plan.

What happens when state breaks #

SituationResult
state lostTerraform recognizes “nothing ever built” → tries to create resources that already exist
two people apply at oncestate breaks, or one side overwrites the other’s changes
state file in git in plaintextpassword / key exposure (many resources have secrets in state)

So local .tfstate is for learning only. For production, a remote backend is mandatory.

S3 + DynamoDB Backend #

The most common production pattern.

backend.tf
terraform {
  required_version = ">= 1.7"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "blog-api/prod/terraform.tfstate"
    region         = "ap-northeast-2"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

The setup laid out:

Role
S3 bucketstores the state file (versioning + encryption enabled)
DynamoDB tableblocks concurrent applies — the lock table
the bucket key prefixseparates environments with the <project>/<env>/terraform.tfstate pattern
encrypt = trueauto-encrypt with KMS

The one-time bootstrap to set up the backend #

The S3 and DynamoDB themselves have to be created by someone first. It’s a chicken-and-egg problem. There are two flows.

  1. Create once manually via console / CLI (this chapter’s assumption)
  2. Create in a separate “bootstrap” folder with a local backend, then migrate the backend to S3
bootstrap
aws s3api create-bucket \
  --bucket myorg-terraform-state \
  --region ap-northeast-2 \
  --create-bucket-configuration LocationConstraint=ap-northeast-2

aws s3api put-bucket-versioning \
  --bucket myorg-terraform-state \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region ap-northeast-2

Never destroy these two resources with Terraform. The state lives inside them.

3) Directory structure — separation by environment #

The real-world shape
infra/
├─ modules/
│   ├─ network/        ← VPC, Subnets, SGs
│   ├─ ecs-service/    ← ALB + Service + Auto Scaling
│   └─ rds/            ← DB
├─ envs/
│   ├─ dev/
│   │   ├─ main.tf
│   │   ├─ backend.tf
│   │   ├─ variables.tf
│   │   └─ terraform.tfvars
│   └─ prod/
│       ├─ main.tf
│       ├─ backend.tf
│       ├─ variables.tf
│       └─ terraform.tfvars
└─ bootstrap/          ← S3 / DynamoDB (once only)

Separate state by giving each environment a different backend key.

envs/dev/backend.tf
terraform { backend "s3" {
  bucket         = "myorg-terraform-state"
  key            = "blog-api/dev/terraform.tfstate"
  region         = "ap-northeast-2"
  dynamodb_table = "terraform-state-lock"
}}

This fully separates dev and prod. A dev apply will never touch prod state.

4) Modules — units of reuse #

To avoid repeating the same infrastructure pattern across dev / prod, bundle it into a module.

modules/ecs-service/variables.tf
variable "name"          { type = string }
variable "cluster_arn"   { type = string }
variable "image"         { type = string }
variable "vpc_id"        { type = string }
variable "subnet_ids"    { type = list(string) }
variable "alb_sg_id"     { type = string }
variable "desired_count" { type = number, default = 2 }
variable "cpu"           { type = string, default = "512" }
variable "memory"        { type = string, default = "1024" }
variable "container_port" { type = number, default = 8000 }
modules/ecs-service/main.tf (excerpt)
resource "aws_security_group" "fargate" {
  name        = "sg-${var.name}-fargate"
  description = "Fargate task SG"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = var.container_port
    to_port         = var.container_port
    protocol        = "tcp"
    security_groups = [var.alb_sg_id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_lb_target_group" "this" {
  name        = "tg-${var.name}"
  port        = var.container_port
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    interval            = 15
  }
}

resource "aws_ecs_task_definition" "this" {
  family                   = var.name
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.cpu
  memory                   = var.memory
  execution_role_arn       = aws_iam_role.execution.arn
  task_role_arn            = aws_iam_role.task.arn

  container_definitions = jsonencode([{
    name  = "api"
    image = var.image
    portMappings = [{ containerPort = var.container_port, protocol = "tcp" }]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.this.name
        "awslogs-region"        = data.aws_region.current.name
        "awslogs-stream-prefix" = "api"
      }
    }
  }])
}

resource "aws_ecs_service" "this" {
  name            = var.name
  cluster         = var.cluster_arn
  task_definition = aws_ecs_task_definition.this.arn
  desired_count   = var.desired_count
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.subnet_ids
    security_groups  = [aws_security_group.fargate.id]
    assign_public_ip = true
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.this.arn
    container_name   = "api"
    container_port   = var.container_port
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

output "target_group_arn" { value = aws_lb_target_group.this.arn }
output "service_name"     { value = aws_ecs_service.this.name }

Chapter 22’s console work has gathered here into one file.

Using the module #

envs/prod/main.tf
module "network" {
  source       = "../../modules/network"
  name         = "blog-prod"
  cidr         = "10.0.0.0/16"
  azs          = ["ap-northeast-2a", "ap-northeast-2c"]
}

module "rds" {
  source            = "../../modules/rds"
  name              = "blog-prod"
  vpc_id            = module.network.vpc_id
  db_subnet_ids     = module.network.db_subnet_ids
  fargate_sg_id     = module.api.fargate_sg_id
  multi_az          = true
  instance_class    = "db.t4g.small"
  deletion_protection = true
}

module "api" {
  source         = "../../modules/ecs-service"
  name           = "blog-prod"
  cluster_arn    = aws_ecs_cluster.blog.arn
  image          = var.image  # injected by CI
  vpc_id         = module.network.vpc_id
  subnet_ids     = module.network.private_subnet_ids
  alb_sg_id      = module.network.alb_sg_id
  desired_count  = 4
  cpu            = "1024"
  memory         = "2048"
}

The dev environment is a lighter shape with desired_count = 1, multi_az = false, instance_class = "db.t4g.micro". The key is same module + different variables.

5) Terraform ↔ CI/CD integration #

This is about how to bind with the GitHub Actions of Chapter 24 CI/CD.

Two flows #

Role
A. Separate infra / appinfra changes via a separate PR + apply, app deploy only updates the image
B. Bound in one workflowimage build → terraform apply puts the new image on the service

At first, A is recommended. Infra changes are heavy, app deploys are frequent. The two flows carry different risk levels.

Plan as a PR comment #

.github/workflows/terraform-plan.yml
name: Terraform Plan
on:
  pull_request:
    paths: ['infra/**']

permissions:
  id-token: write
  contents: read
  pull-requests: write

jobs:
  plan:
    runs-on: ubuntu-latest
    defaults:
      run: { working-directory: infra/envs/prod }
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
          aws-region: ap-northeast-2
      - uses: hashicorp/setup-terraform@v3
        with: { terraform_version: 1.9.0 }
      - run: terraform init
      - run: terraform plan -no-color -out=tfplan
      - name: Comment Plan
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const out = require('child_process')
              .execSync('terraform show -no-color tfplan', { cwd: 'infra/envs/prod' });
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '```\n' + out + '\n```'
            });

At the PR review stage, checking what’s changing in one place is the most effective way to stop production incidents before the code merge.

The terraform-plan role is fine with read-only permissions. Keep the apply permission in a separate role.

6) Drift tracking #

Resources you change by hand in the console diverge from state (drift). terraform plan shows the difference and asks “revert?”

Periodic drift check
terraform plan -detailed-exitcode
# exit 0 = no difference
# exit 2 = difference exists (not a failure)

Run it once a day in CI, and on exit 2, notify via Slack.

Pitfalls — pitfalls of Terraform operations #

1) State lock won’t release #

If an apply is interrupted with ctrl-c, the DynamoDB lock stays put. The next apply fails with “Resource locked.”

Force unlock (careful)
terraform force-unlock <LOCK_ID>

The LOCK_ID is shown in the error message. Always confirm that someone else isn’t actually working before doing this.

2) Editing state by hand #

Opening .tfstate in vim to edit it almost always ends in regret. Use the state commands instead.

state commands
terraform state list                       # list resources
terraform state show aws_ecr_repository.x  # detail one resource
terraform state rm aws_ecr_repository.x    # remove from state (doesn't delete the actual resource)
terraform state mv module.a.x module.b.x   # move a resource
terraform import aws_ecr_repository.x my-repo  # register an existing resource into state

3) Passwords in state in plaintext #

aws_db_instance’s password and aws_secretsmanager_secret_version’s secret_string go into state in plaintext. State bucket encryption + access restriction are mandatory.

State bucket policy (example)
data "aws_iam_policy_document" "state_bucket" {
  statement {
    effect    = "Deny"
    actions   = ["s3:*"]
    resources = ["arn:aws:s3:::myorg-terraform-state/*"]
    condition {
      test     = "Bool"
      variable = "aws:SecureTransport"
      values   = ["false"]
    }
  }
}

4) -/+ destroy/create #

If you see -/+ in a plan, the resource ID changes. For RDS, that’s data loss. It’s a part to look at closely.

  # aws_db_instance.blog must be replaced
-/+ resource "aws_db_instance" "blog" {
      ~ engine_version = "16.3" -> "17.0"  # forces replacement
    }

Do a change like this via a separate migration procedure. RDS has a separate in-place upgrade option.

5) Not pinning the provider version #

If you leave version unspecified in required_providers, the next init can break. Always use a pattern like ~> 5.0.

6) terraform destroy incident #

A case of accidentally running destroy in the prod environment. Put a guard in place.

Protecting important resources
resource "aws_db_instance" "blog" {
  # ...
  lifecycle {
    prevent_destroy = true
  }
}

A resource with prevent_destroy = true has its destroy / replace blocked at the plan stage.

Exercises #

  1. Write out, without looking, the four pains of a console-only operation (§“Why IaC”), and connect, in one sentence each, which Terraform feature (plan / git history / PR review / state) solves each pain.
  2. From the §“What happens when state breaks” table, lay out three reasons local .tfstate is dangerous in production, and pair which risk the S3 backend and which the DynamoDB backend each prevent.
  3. Explain in one paragraph, in connection with the data from Chapter 23 RDS integration, why you must stop and look closely when you see -/+ in terraform plan output. Also write out, distinguishing them, what prevent_destroy and deletion_protection each prevent.

In short: IaC turns infrastructure into declarative code, solving reproducibility, traceability, review, and safe deletion at once. Terraform is built from the five elements provider / resource / data / variable / output and cycles through init → plan → apply → destroy. The heart is state, and for production an S3 + DynamoDB backend is mandatory. Separate dev and prod by giving the same module different variables, surface the plan as a PR comment to stop incidents before the merge, and protect -/+ recreation and terraform destroy with lifecycle.

Next chapter #

The infrastructure is now code and deployment is automated. Now it’s time to look seriously at whether it’s running, and whether it’s running well. In the next Chapter 26 monitoring — CloudWatch alarms and X-Ray we cover the core metrics of ECS / RDS / ALB, operational Logs Insights queries, the flow of sending alarms to Slack, and capturing “why did only this request take 5 seconds?” with X-Ray distributed tracing.

X