IaC — Terraform Intro
Why IaC, the shape of Terraform's provider / resource / state, team collaboration with an S3 + DynamoDB backend, environment separation with modules, and the flow of codifying the previous chapters' infrastructure step by step.
The infrastructure we built in Chapter 22 ~ Chapter 24 is still handled directly via the console and CLI. Asked to stand up the same setup once more — from memory? from notes? — and it wobbles. Moving that work into Terraform is this chapter.
As the fourth chapter of Part 4, what it covers is as follows.
- Why IaC — repeatability / code review / drift tracking
- Terraform’s structure — provider, resource, data, variable, output, state
- state is the real heart — the S3 + DynamoDB lock backend
- modules — units of reuse, branching by environment
- codifying Chapter 22’s ECS infrastructure step by step
Why IaC #
There are four pains you meet in a console-only operation.
- Not reproducible — stand up staging exactly like prod? With human memory, subtle differences always remain.
- Changes not trackable — “who changed the SG last week?” means digging through CloudTrail. With code, it’s git log.
- Not reviewable — a one-line edit to a production cluster’s SG inbound gets no colleague’s eyes on it.
- The burden of delete / recreate — one thing built wrong and you’re afraid to fix it.
IaC (Infrastructure as Code) expresses infrastructure as declarative code and solves all four of the above at once.
| Tool | Role |
|---|---|
| Terraform | multi-cloud, the most standard. the star of this chapter |
| Pulumi | written in TypeScript / Python / Go. strong on dynamic logic |
| AWS CDK | TypeScript / Python → transpiled to CloudFormation |
| CloudFormation | AWS-native YAML/JSON. weak on dynamic expression |
| OpenTofu | the OSS fork of Terraform (after the license dispute) |
This book standardizes on Terraform. But as of 2026 you should know about Terraform’s license change and the OpenTofu option before getting into the tool in earnest, so we touch on it once here.
Terraform vs OpenTofu — the 2026 fork #
In 2023, HashiCorp changed Terraform’s license from open source (MPL 2.0) to the BSL (Business Source License) 1.1. Individuals and most companies can still use it, but building a competing product with Terraform is restricted. In response, the community forked the last MPL version into OpenTofu, governed as open source under the Linux Foundation. In 2025, IBM acquired HashiCorp.
The key point is that the two are effectively compatible.
- Same HCL syntax, same provider ecosystem, compatible state.
- Only the CLI changes from
terraform→tofu(tofu init/tofu plan/tofu apply). - Every
.tffile in this book works as-is on both Terraform and OpenTofu.
| When to choose | Pick |
|---|---|
| Fully open source / community governance / avoiding the BSL matters | OpenTofu |
| You need HCP Terraform (cloud state · policy · team features) or commercial support | Terraform |
| Learning / side projects | Either is fine (identical syntax) |
As of 2026, OpenTofu has matured enough that production adoptions (Boeing · Capital One, etc.) are growing. This book writes its explanations and commands against terraform, but if your company uses OpenTofu, just read the command as tofu and it works the same.
1) Terraform’s five components #
# 1) Provider — how to communicate with AWS
provider "aws" {
region = "ap-northeast-2"
}
# 2) Resource — the actual infrastructure to create
resource "aws_ecr_repository" "blog_api" {
name = "blog-api"
image_tag_mutability = "MUTABLE"
image_scanning_configuration {
scan_on_push = true
}
}
# 3) Data — look up an existing resource
data "aws_caller_identity" "current" {}
# 4) Variable — external input
variable "environment" {
type = string
default = "dev"
}
# 5) Output — expose the result you created
output "ecr_url" {
value = aws_ecr_repository.blog_api.repository_url
}When the five components gather in one file, they become one unit of infrastructure.
The 4-step workflow #
terraform init # download providers, initialize backend
terraform plan # preview what gets created/changed/deleted
terraform apply # apply
terraform destroy # deleteThe output of plan is Terraform’s greatest value. It stops incidents before the code merge.
Terraform will perform the following actions:
# aws_security_group.fargate will be created
+ resource "aws_security_group" "fargate" {
+ arn = (known after apply)
+ name = "sg-fargate"
+ ingress = [
+ {
+ from_port = 8000
+ to_port = 8000
+ protocol = "tcp"
+ ...
},
]
}
Plan: 1 to add, 0 to change, 0 to destroy.+ add / ~ change / - delete / -/+ recreate (always be conscious, since a changed ID is dreadful).
2) State — the real heart #
Terraform stores “the state of the infrastructure built so far” in state (a .tfstate file). This file is what lets the next plan compute the difference.
the actual AWS infrastructure ←────── Terraform code
│
▼
state (the result of the last apply)Terraform looks at the three-way consistency of code ↔ state ↔ AWS and then drafts the change plan.
What happens when state breaks #
| Situation | Result |
|---|---|
| state lost | Terraform recognizes “nothing ever built” → tries to create resources that already exist |
| two people apply at once | state breaks, or one side overwrites the other’s changes |
| state file in git in plaintext | password / key exposure (many resources have secrets in state) |
So local .tfstate is for learning only. For production, a remote backend is mandatory.
S3 + DynamoDB Backend #
The most common production pattern.
terraform {
required_version = ">= 1.7"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "myorg-terraform-state"
key = "blog-api/prod/terraform.tfstate"
region = "ap-northeast-2"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}The setup laid out:
| Role | |
|---|---|
| S3 bucket | stores the state file (versioning + encryption enabled) |
| DynamoDB table | blocks concurrent applies — the lock table |
| the bucket key prefix | separates environments with the <project>/<env>/terraform.tfstate pattern |
| encrypt = true | auto-encrypt with KMS |
The one-time bootstrap to set up the backend #
The S3 and DynamoDB themselves have to be created by someone first. It’s a chicken-and-egg problem. There are two flows.
- Create once manually via console / CLI (this chapter’s assumption)
- Create in a separate “bootstrap” folder with a local backend, then migrate the backend to S3
aws s3api create-bucket \
--bucket myorg-terraform-state \
--region ap-northeast-2 \
--create-bucket-configuration LocationConstraint=ap-northeast-2
aws s3api put-bucket-versioning \
--bucket myorg-terraform-state \
--versioning-configuration Status=Enabled
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region ap-northeast-2Never destroy these two resources with Terraform. The state lives inside them.
3) Directory structure — separation by environment #
infra/
├─ modules/
│ ├─ network/ ← VPC, Subnets, SGs
│ ├─ ecs-service/ ← ALB + Service + Auto Scaling
│ └─ rds/ ← DB
├─ envs/
│ ├─ dev/
│ │ ├─ main.tf
│ │ ├─ backend.tf
│ │ ├─ variables.tf
│ │ └─ terraform.tfvars
│ └─ prod/
│ ├─ main.tf
│ ├─ backend.tf
│ ├─ variables.tf
│ └─ terraform.tfvars
└─ bootstrap/ ← S3 / DynamoDB (once only)Separate state by giving each environment a different backend key.
terraform { backend "s3" {
bucket = "myorg-terraform-state"
key = "blog-api/dev/terraform.tfstate"
region = "ap-northeast-2"
dynamodb_table = "terraform-state-lock"
}}This fully separates dev and prod. A dev apply will never touch prod state.
4) Modules — units of reuse #
To avoid repeating the same infrastructure pattern across dev / prod, bundle it into a module.
variable "name" { type = string }
variable "cluster_arn" { type = string }
variable "image" { type = string }
variable "vpc_id" { type = string }
variable "subnet_ids" { type = list(string) }
variable "alb_sg_id" { type = string }
variable "desired_count" { type = number, default = 2 }
variable "cpu" { type = string, default = "512" }
variable "memory" { type = string, default = "1024" }
variable "container_port" { type = number, default = 8000 }resource "aws_security_group" "fargate" {
name = "sg-${var.name}-fargate"
description = "Fargate task SG"
vpc_id = var.vpc_id
ingress {
from_port = var.container_port
to_port = var.container_port
protocol = "tcp"
security_groups = [var.alb_sg_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_lb_target_group" "this" {
name = "tg-${var.name}"
port = var.container_port
protocol = "HTTP"
target_type = "ip"
vpc_id = var.vpc_id
health_check {
path = "/health"
healthy_threshold = 2
interval = 15
}
}
resource "aws_ecs_task_definition" "this" {
family = var.name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.task.arn
container_definitions = jsonencode([{
name = "api"
image = var.image
portMappings = [{ containerPort = var.container_port, protocol = "tcp" }]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.this.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "api"
}
}
}])
}
resource "aws_ecs_service" "this" {
name = var.name
cluster = var.cluster_arn
task_definition = aws_ecs_task_definition.this.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.subnet_ids
security_groups = [aws_security_group.fargate.id]
assign_public_ip = true
}
load_balancer {
target_group_arn = aws_lb_target_group.this.arn
container_name = "api"
container_port = var.container_port
}
deployment_circuit_breaker {
enable = true
rollback = true
}
}
output "target_group_arn" { value = aws_lb_target_group.this.arn }
output "service_name" { value = aws_ecs_service.this.name }Chapter 22’s console work has gathered here into one file.
Using the module #
module "network" {
source = "../../modules/network"
name = "blog-prod"
cidr = "10.0.0.0/16"
azs = ["ap-northeast-2a", "ap-northeast-2c"]
}
module "rds" {
source = "../../modules/rds"
name = "blog-prod"
vpc_id = module.network.vpc_id
db_subnet_ids = module.network.db_subnet_ids
fargate_sg_id = module.api.fargate_sg_id
multi_az = true
instance_class = "db.t4g.small"
deletion_protection = true
}
module "api" {
source = "../../modules/ecs-service"
name = "blog-prod"
cluster_arn = aws_ecs_cluster.blog.arn
image = var.image # injected by CI
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
alb_sg_id = module.network.alb_sg_id
desired_count = 4
cpu = "1024"
memory = "2048"
}The dev environment is a lighter shape with desired_count = 1, multi_az = false, instance_class = "db.t4g.micro". The key is same module + different variables.
5) Terraform ↔ CI/CD integration #
This is about how to bind with the GitHub Actions of Chapter 24 CI/CD.
Two flows #
| Role | |
|---|---|
| A. Separate infra / app | infra changes via a separate PR + apply, app deploy only updates the image |
| B. Bound in one workflow | image build → terraform apply puts the new image on the service |
At first, A is recommended. Infra changes are heavy, app deploys are frequent. The two flows carry different risk levels.
Plan as a PR comment #
name: Terraform Plan
on:
pull_request:
paths: ['infra/**']
permissions:
id-token: write
contents: read
pull-requests: write
jobs:
plan:
runs-on: ubuntu-latest
defaults:
run: { working-directory: infra/envs/prod }
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
aws-region: ap-northeast-2
- uses: hashicorp/setup-terraform@v3
with: { terraform_version: 1.9.0 }
- run: terraform init
- run: terraform plan -no-color -out=tfplan
- name: Comment Plan
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const out = require('child_process')
.execSync('terraform show -no-color tfplan', { cwd: 'infra/envs/prod' });
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '```\n' + out + '\n```'
});At the PR review stage, checking what’s changing in one place is the most effective way to stop production incidents before the code merge.
The terraform-plan role is fine with read-only permissions. Keep the apply permission in a separate role.
6) Drift tracking #
Resources you change by hand in the console diverge from state (drift). terraform plan shows the difference and asks “revert?”
terraform plan -detailed-exitcode
# exit 0 = no difference
# exit 2 = difference exists (not a failure)Run it once a day in CI, and on exit 2, notify via Slack.
Pitfalls — pitfalls of Terraform operations #
1) State lock won’t release #
If an apply is interrupted with ctrl-c, the DynamoDB lock stays put. The next apply fails with “Resource locked.”
terraform force-unlock <LOCK_ID>The LOCK_ID is shown in the error message. Always confirm that someone else isn’t actually working before doing this.
2) Editing state by hand #
Opening .tfstate in vim to edit it almost always ends in regret. Use the state commands instead.
terraform state list # list resources
terraform state show aws_ecr_repository.x # detail one resource
terraform state rm aws_ecr_repository.x # remove from state (doesn't delete the actual resource)
terraform state mv module.a.x module.b.x # move a resource
terraform import aws_ecr_repository.x my-repo # register an existing resource into state3) Passwords in state in plaintext #
aws_db_instance’s password and aws_secretsmanager_secret_version’s secret_string go into state in plaintext. State bucket encryption + access restriction are mandatory.
data "aws_iam_policy_document" "state_bucket" {
statement {
effect = "Deny"
actions = ["s3:*"]
resources = ["arn:aws:s3:::myorg-terraform-state/*"]
condition {
test = "Bool"
variable = "aws:SecureTransport"
values = ["false"]
}
}
}4) -/+ destroy/create
#
If you see -/+ in a plan, the resource ID changes. For RDS, that’s data loss. It’s a part to look at closely.
# aws_db_instance.blog must be replaced
-/+ resource "aws_db_instance" "blog" {
~ engine_version = "16.3" -> "17.0" # forces replacement
}Do a change like this via a separate migration procedure. RDS has a separate in-place upgrade option.
5) Not pinning the provider version #
If you leave version unspecified in required_providers, the next init can break. Always use a pattern like ~> 5.0.
6) terraform destroy incident
#
A case of accidentally running destroy in the prod environment. Put a guard in place.
resource "aws_db_instance" "blog" {
# ...
lifecycle {
prevent_destroy = true
}
}A resource with prevent_destroy = true has its destroy / replace blocked at the plan stage.
Exercises #
- Write out, without looking, the four pains of a console-only operation (§“Why IaC”), and connect, in one sentence each, which Terraform feature (plan / git history / PR review / state) solves each pain.
- From the §“What happens when state breaks” table, lay out three reasons local
.tfstateis dangerous in production, and pair which risk the S3 backend and which the DynamoDB backend each prevent. - Explain in one paragraph, in connection with the data from Chapter 23 RDS integration, why you must stop and look closely when you see
-/+interraform planoutput. Also write out, distinguishing them, whatprevent_destroyanddeletion_protectioneach prevent.
In short: IaC turns infrastructure into declarative code, solving reproducibility, traceability, review, and safe deletion at once. Terraform is built from the five elements provider / resource / data / variable / output and cycles through init → plan → apply → destroy. The heart is state, and for production an S3 + DynamoDB backend is mandatory. Separate dev and prod by giving the same module different variables, surface the plan as a PR comment to stop incidents before the merge, and protect
-/+recreation andterraform destroywith lifecycle.
Next chapter #
The infrastructure is now code and deployment is automated. Now it’s time to look seriously at whether it’s running, and whether it’s running well. In the next Chapter 26 monitoring — CloudWatch alarms and X-Ray we cover the core metrics of ECS / RDS / ALB, operational Logs Insights queries, the flow of sending alarms to Slack, and capturing “why did only this request take 5 seconds?” with X-Ray distributed tracing.