AWS in Practice #4: IaC — Terraform Fundamentals
The infrastructure built in #1 through #3 is still managed by hand through the console / CLI. If we had to recreate it from scratch — from memory? from notes? — the results would be inconsistent. This post moves that to Terraform.
What we’ll cover:
- Why IaC — repeatability / code review / drift tracking
- The shape of Terraform — provider, resource, data, variable, output, state
- State is the real core — S3 + DynamoDB lock backend
- Modules — reusable units, environment branching
- Code-ifying #1’s ECS infrastructure line by line
Why IaC #
Four pains you meet from the console-only approach:
- Can’t reproduce — “spin up staging just like prod”? Human memory always leaves subtle differences
- Can’t track changes — “who changed the SG last week?” → digging through CloudTrail. With code, it’s git log
- Can’t review — that one-line SG inbound change to production cluster doesn’t get peer eyes
- Delete / recreate fear — get one resource wrong and you’re afraid to fix it
IaC (Infrastructure as Code) expresses infrastructure as declarative code, addressing all four problems at once.
| Tool | Role |
|---|---|
| Terraform | Multi-cloud, the most standard. Star of this post |
| Pulumi | Written in TypeScript / Python / Go. Strong for dynamic logic |
| AWS CDK | TypeScript / Python → transpiles to CloudFormation |
| CloudFormation | AWS-native YAML/JSON. Weak in dynamic expression |
| OpenTofu | OSS fork of Terraform (after license dispute) |
This series unifies on Terraform. Even if your company uses OpenTofu by policy, the syntax is identical.
1) Terraform’s five blocks #
# 1) Provider — how to talk to AWS
provider "aws" {
region = "ap-northeast-2"
}
# 2) Resource — actual infrastructure to create
resource "aws_ecr_repository" "blog_api" {
name = "blog-api"
image_tag_mutability = "MUTABLE"
image_scanning_configuration {
scan_on_push = true
}
}
# 3) Data — query existing resources
data "aws_caller_identity" "current" {}
# 4) Variable — external input
variable "environment" {
type = string
default = "dev"
}
# 5) Output — expose results
output "ecr_url" {
value = aws_ecr_repository.blog_api.repository_url
}Five blocks in one place make a unit of infrastructure.
The 4-step workflow #
terraform init # download providers, init backend
terraform plan # preview what gets created/changed/destroyed
terraform apply # apply
terraform destroy # deleteThe output of plan is Terraform’s biggest value. It catches incidents before code merge.
Terraform will perform the following actions:
# aws_security_group.fargate will be created
+ resource "aws_security_group" "fargate" {
+ arn = (known after apply)
+ name = "sg-fargate"
+ ingress = [
+ {
+ from_port = 8000
+ to_port = 8000
+ protocol = "tcp"
+ ...
},
]
}
Plan: 1 to add, 0 to change, 0 to destroy.+ add / ~ change / - destroy / -/+ recreate (if the ID changes, the resource is replaced — always pay attention to this).
2) State — the real core #
Terraform stores “the state of the infrastructure built so far” in state (the .tfstate file). This file has to exist for the next plan to compute the diff.
Actual AWS infrastructure ←────── Terraform code
│
▼
state (last apply's result)Terraform looks at the 3-way consistency code ↔ state ↔ AWS and then plans changes.
What happens if state breaks #
| Situation | Result |
|---|---|
| State lost | Terraform thinks “nothing was created” → tries to recreate existing resources |
| Two people apply simultaneously | State breaks or one overwrites the other’s changes |
| State file as plaintext in git | Password / key exposure (state contains secrets in many resources) |
Local .tfstate is for learning only. Production needs a remote backend.
S3 + DynamoDB Backend #
The most common production pattern.
terraform {
required_version = ">= 1.7"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "myorg-terraform-state"
key = "blog-api/prod/terraform.tfstate"
region = "ap-northeast-2"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}Roles:
| Role | |
|---|---|
| S3 bucket | State file storage (versioning + encryption enabled) |
| DynamoDB table | Block concurrent applies — lock table |
| bucket key prefix | <project>/<env>/terraform.tfstate pattern for env separation |
| encrypt = true | Auto-encrypt with KMS |
One-time bootstrap to set up the backend #
S3 and DynamoDB themselves need to exist first. A classic chicken-and-egg problem. Two approaches:
- Manual creation via console / CLI once (this post’s assumption)
- Create with local backend in a separate “bootstrap” folder, then migrate backend to S3
aws s3api create-bucket \
--bucket myorg-terraform-state \
--region ap-northeast-2 \
--create-bucket-configuration LocationConstraint=ap-northeast-2
aws s3api put-bucket-versioning \
--bucket myorg-terraform-state \
--versioning-configuration Status=Enabled
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region ap-northeast-2Never destroy these two via Terraform. Your state lives inside.
3) Directory structure — environment separation #
infra/
├─ modules/
│ ├─ network/ ← VPC, Subnets, SGs
│ ├─ ecs-service/ ← ALB + Service + Auto Scaling
│ └─ rds/ ← DB
├─ envs/
│ ├─ dev/
│ │ ├─ main.tf
│ │ ├─ backend.tf
│ │ ├─ variables.tf
│ │ └─ terraform.tfvars
│ └─ prod/
│ ├─ main.tf
│ ├─ backend.tf
│ ├─ variables.tf
│ └─ terraform.tfvars
└─ bootstrap/ ← S3 / DynamoDB (one-time)Use different backend keys per environment to keep state separate:
terraform { backend "s3" {
bucket = "myorg-terraform-state"
key = "blog-api/dev/terraform.tfstate"
region = "ap-northeast-2"
dynamodb_table = "terraform-state-lock"
}}This fully separates dev and prod. dev’s apply can never touch prod state.
4) Modules — units of reuse #
Don’t repeat the same infrastructure pattern in dev / prod.
variable "name" { type = string }
variable "cluster_arn" { type = string }
variable "image" { type = string }
variable "vpc_id" { type = string }
variable "subnet_ids" { type = list(string) }
variable "alb_sg_id" { type = string }
variable "desired_count" { type = number, default = 2 }
variable "cpu" { type = string, default = "512" }
variable "memory" { type = string, default = "1024" }
variable "container_port" { type = number, default = 8000 }resource "aws_security_group" "fargate" {
name = "sg-${var.name}-fargate"
description = "Fargate task SG"
vpc_id = var.vpc_id
ingress {
from_port = var.container_port
to_port = var.container_port
protocol = "tcp"
security_groups = [var.alb_sg_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_lb_target_group" "this" {
name = "tg-${var.name}"
port = var.container_port
protocol = "HTTP"
target_type = "ip"
vpc_id = var.vpc_id
health_check {
path = "/health"
healthy_threshold = 2
interval = 15
}
}
resource "aws_ecs_task_definition" "this" {
family = var.name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.task.arn
container_definitions = jsonencode([{
name = "api"
image = var.image
portMappings = [{ containerPort = var.container_port, protocol = "tcp" }]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.this.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "api"
}
}
}])
}
resource "aws_ecs_service" "this" {
name = var.name
cluster = var.cluster_arn
task_definition = aws_ecs_task_definition.this.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.subnet_ids
security_groups = [aws_security_group.fargate.id]
assign_public_ip = true
}
load_balancer {
target_group_arn = aws_lb_target_group.this.arn
container_name = "api"
container_port = var.container_port
}
deployment_circuit_breaker {
enable = true
rollback = true
}
}
output "target_group_arn" { value = aws_lb_target_group.this.arn }
output "service_name" { value = aws_ecs_service.this.name }#1’s console work is now in this single file.
Using the module #
module "network" {
source = "../../modules/network"
name = "blog-prod"
cidr = "10.0.0.0/16"
azs = ["ap-northeast-2a", "ap-northeast-2c"]
}
module "rds" {
source = "../../modules/rds"
name = "blog-prod"
vpc_id = module.network.vpc_id
db_subnet_ids = module.network.db_subnet_ids
fargate_sg_id = module.api.fargate_sg_id
multi_az = true
instance_class = "db.t4g.small"
deletion_protection = true
}
module "api" {
source = "../../modules/ecs-service"
name = "blog-prod"
cluster_arn = aws_ecs_cluster.blog.arn
image = var.image # injected by CI
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
alb_sg_id = module.network.alb_sg_id
desired_count = 4
cpu = "1024"
memory = "2048"
}The dev environment uses a lighter configuration: desired_count = 1, multi_az = false, instance_class = "db.t4g.micro". Same module, different variables is the key.
5) Terraform ↔ CI/CD integration #
How to bundle with GitHub Actions from #3.
Two flows #
| Description | |
|---|---|
| A. Separate infra / app | Infra changes via separate PR + apply, app deploy just updates the image |
| B. One bundled workflow | Image build → terraform apply puts new image into service |
A is recommended to start. Infrastructure changes are infrequent and high-risk; app deploys are frequent and lower-risk. Keeping them separate reflects that difference.
Plan as a PR comment #
name: Terraform Plan
on:
pull_request:
paths: ['infra/**']
permissions:
id-token: write
contents: read
pull-requests: write
jobs:
plan:
runs-on: ubuntu-latest
defaults:
run: { working-directory: infra/envs/prod }
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
aws-region: ap-northeast-2
- uses: hashicorp/setup-terraform@v3
with: { terraform_version: 1.9.0 }
- run: terraform init
- run: terraform plan -no-color -out=tfplan
- name: Comment Plan
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const out = require('child_process')
.execSync('terraform show -no-color tfplan', { cwd: 'infra/envs/prod' });
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '```\n' + out + '\n```'
});The PR review stage surfaces what changes in one place — the most effective point for catching potential production incidents before code merges.
The terraform-plan role is enough with read-only permissions. Apply needs a separate role.
6) Drift tracking #
Anything changed by hand in the console diverges from state — this is called drift. terraform plan shows the diff, effectively asking “should I revert this?”
terraform plan -detailed-exitcode
# exit 0 = no diff
# exit 2 = diff exists (not a failure)Run daily in CI, alert Slack on exit 2.
Pitfalls — Terraform operations #
1) State lock not released #
Apply was ctrl-c’d → DynamoDB lock remains. Next apply fails with “Resource locked.”
terraform force-unlock <LOCK_ID>LOCK_ID is in the error message. Always confirm that no one else is actually working before doing this.
2) Manual state edits #
Opening .tfstate in vim and editing directly almost always ends in regret. Instead:
terraform state list # list resources
terraform state show aws_ecr_repository.x # show one resource
terraform state rm aws_ecr_repository.x # remove from state (doesn't delete actual resource)
terraform state mv module.a.x module.b.x # move resource
terraform import aws_ecr_repository.x my-repo # register existing resource into state3) Plaintext password in state #
aws_db_instance password, aws_secretsmanager_secret_version secret_string — go into state as plaintext. State bucket encryption + access restriction is essential.
data "aws_iam_policy_document" "state_bucket" {
statement {
effect = "Deny"
actions = ["s3:*"]
resources = ["arn:aws:s3:::myorg-terraform-state/*"]
condition {
test = "Bool"
variable = "aws:SecureTransport"
values = ["false"]
}
}
}4) -/+ destroy/create
#
If -/+ shows in plan, the resource ID changes. For RDS, that’s data loss. Things to look at carefully:
# aws_db_instance.blog must be replaced
-/+ resource "aws_db_instance" "blog" {
~ engine_version = "16.3" -> "17.0" # forces replacement
}A change like this requires a separate migration procedure. RDS has dedicated options for in-place major version upgrades.
5) Provider version not pinned #
Without version in required_providers, the next init may pull a breaking version. Always pin with a pattern like ~> 5.0.
6) terraform destroy accident
#
Accidentally destroying production. Protection:
resource "aws_db_instance" "blog" {
# ...
lifecycle {
prevent_destroy = true
}
}A resource with prevent_destroy = true blocks destroy / replace at the plan stage.
Wrapping up #
What we covered in this post:
- Why IaC — reproducibility / tracking / review / safe destroy
- Five blocks — provider, resource, data, variable, output
- Workflow — init → plan → apply → destroy. Plan is the biggest value
- State — Terraform’s core. Local state is for learning only
- S3 + DynamoDB Backend — production standard, encrypt, versioning
- Bootstrap — backend itself via console / separate shape
- Directory structure — modules/ + envs/{dev,prod}/, separate backend keys per env
- Modules — same pattern, different variables. dev = light, prod = full options
- CI/CD integration — Plan as PR comment, separate plan/apply permissions
- Drift tracking —
plan -detailed-exitcodeperiodically - Pitfalls — lock release, state edits, plaintext password,
-/+, provider version, destroy protection
Next — Monitoring #
Infrastructure is now code and deployment is automated. Now it’s time to seriously look at whether it’s running / running well.
In #5 Monitoring — CloudWatch alarms and X-Ray we’ll cover the core metrics of ECS / RDS / ALB, operational queries in Logs Insights, sending alarms to Slack, and X-Ray distributed tracing for a one-line answer to “why did this one request take 5 seconds?”