AWS in Practice #4: IaC — Terraform Fundamentals

10 min read

The infrastructure built in #1 through #3 is still managed by hand through the console / CLI. If we had to recreate it from scratch — from memory? from notes? — the results would be inconsistent. This post moves that to Terraform.

What we’ll cover:

  • Why IaC — repeatability / code review / drift tracking
  • The shape of Terraform — provider, resource, data, variable, output, state
  • State is the real core — S3 + DynamoDB lock backend
  • Modules — reusable units, environment branching
  • Code-ifying #1’s ECS infrastructure line by line

Why IaC #

Four pains you meet from the console-only approach:

  1. Can’t reproduce — “spin up staging just like prod”? Human memory always leaves subtle differences
  2. Can’t track changes — “who changed the SG last week?” → digging through CloudTrail. With code, it’s git log
  3. Can’t review — that one-line SG inbound change to production cluster doesn’t get peer eyes
  4. Delete / recreate fear — get one resource wrong and you’re afraid to fix it

IaC (Infrastructure as Code) expresses infrastructure as declarative code, addressing all four problems at once.

ToolRole
TerraformMulti-cloud, the most standard. Star of this post
PulumiWritten in TypeScript / Python / Go. Strong for dynamic logic
AWS CDKTypeScript / Python → transpiles to CloudFormation
CloudFormationAWS-native YAML/JSON. Weak in dynamic expression
OpenTofuOSS fork of Terraform (after license dispute)

This series unifies on Terraform. Even if your company uses OpenTofu by policy, the syntax is identical.

1) Terraform’s five blocks #

main.tf — the smallest shape
# 1) Provider — how to talk to AWS
provider "aws" {
  region = "ap-northeast-2"
}

# 2) Resource — actual infrastructure to create
resource "aws_ecr_repository" "blog_api" {
  name                 = "blog-api"
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}

# 3) Data — query existing resources
data "aws_caller_identity" "current" {}

# 4) Variable — external input
variable "environment" {
  type    = string
  default = "dev"
}

# 5) Output — expose results
output "ecr_url" {
  value = aws_ecr_repository.blog_api.repository_url
}

Five blocks in one place make a unit of infrastructure.

The 4-step workflow #

Terraform's 4 steps
terraform init      # download providers, init backend
terraform plan      # preview what gets created/changed/destroyed
terraform apply     # apply
terraform destroy   # delete

The output of plan is Terraform’s biggest value. It catches incidents before code merge.

plan output example
Terraform will perform the following actions:

  # aws_security_group.fargate will be created
  + resource "aws_security_group" "fargate" {
      + arn                    = (known after apply)
      + name                   = "sg-fargate"
      + ingress = [
          + {
              + from_port = 8000
              + to_port   = 8000
              + protocol  = "tcp"
              + ...
            },
        ]
    }

Plan: 1 to add, 0 to change, 0 to destroy.

+ add / ~ change / - destroy / -/+ recreate (if the ID changes, the resource is replaced — always pay attention to this).

2) State — the real core #

Terraform stores “the state of the infrastructure built so far” in state (the .tfstate file). This file has to exist for the next plan to compute the diff.

Where state lives
Actual AWS infrastructure  ←──────  Terraform code
                              state (last apply's result)

Terraform looks at the 3-way consistency code ↔ state ↔ AWS and then plans changes.

What happens if state breaks #

SituationResult
State lostTerraform thinks “nothing was created” → tries to recreate existing resources
Two people apply simultaneouslyState breaks or one overwrites the other’s changes
State file as plaintext in gitPassword / key exposure (state contains secrets in many resources)

Local .tfstate is for learning only. Production needs a remote backend.

S3 + DynamoDB Backend #

The most common production pattern.

backend.tf
terraform {
  required_version = ">= 1.7"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "blog-api/prod/terraform.tfstate"
    region         = "ap-northeast-2"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Roles:

Role
S3 bucketState file storage (versioning + encryption enabled)
DynamoDB tableBlock concurrent applies — lock table
bucket key prefix<project>/<env>/terraform.tfstate pattern for env separation
encrypt = trueAuto-encrypt with KMS

One-time bootstrap to set up the backend #

S3 and DynamoDB themselves need to exist first. A classic chicken-and-egg problem. Two approaches:

  1. Manual creation via console / CLI once (this post’s assumption)
  2. Create with local backend in a separate “bootstrap” folder, then migrate backend to S3
bootstrap
aws s3api create-bucket \
  --bucket myorg-terraform-state \
  --region ap-northeast-2 \
  --create-bucket-configuration LocationConstraint=ap-northeast-2

aws s3api put-bucket-versioning \
  --bucket myorg-terraform-state \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region ap-northeast-2

Never destroy these two via Terraform. Your state lives inside.

3) Directory structure — environment separation #

Production shape
infra/
├─ modules/
│   ├─ network/        ← VPC, Subnets, SGs
│   ├─ ecs-service/    ← ALB + Service + Auto Scaling
│   └─ rds/            ← DB
├─ envs/
│   ├─ dev/
│   │   ├─ main.tf
│   │   ├─ backend.tf
│   │   ├─ variables.tf
│   │   └─ terraform.tfvars
│   └─ prod/
│       ├─ main.tf
│       ├─ backend.tf
│       ├─ variables.tf
│       └─ terraform.tfvars
└─ bootstrap/          ← S3 / DynamoDB (one-time)

Use different backend keys per environment to keep state separate:

envs/dev/backend.tf
terraform { backend "s3" {
  bucket         = "myorg-terraform-state"
  key            = "blog-api/dev/terraform.tfstate"
  region         = "ap-northeast-2"
  dynamodb_table = "terraform-state-lock"
}}

This fully separates dev and prod. dev’s apply can never touch prod state.

4) Modules — units of reuse #

Don’t repeat the same infrastructure pattern in dev / prod.

modules/ecs-service/variables.tf
variable "name"          { type = string }
variable "cluster_arn"   { type = string }
variable "image"         { type = string }
variable "vpc_id"        { type = string }
variable "subnet_ids"    { type = list(string) }
variable "alb_sg_id"     { type = string }
variable "desired_count" { type = number, default = 2 }
variable "cpu"           { type = string, default = "512" }
variable "memory"        { type = string, default = "1024" }
variable "container_port" { type = number, default = 8000 }
modules/ecs-service/main.tf (excerpt)
resource "aws_security_group" "fargate" {
  name        = "sg-${var.name}-fargate"
  description = "Fargate task SG"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = var.container_port
    to_port         = var.container_port
    protocol        = "tcp"
    security_groups = [var.alb_sg_id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_lb_target_group" "this" {
  name        = "tg-${var.name}"
  port        = var.container_port
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    interval            = 15
  }
}

resource "aws_ecs_task_definition" "this" {
  family                   = var.name
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.cpu
  memory                   = var.memory
  execution_role_arn       = aws_iam_role.execution.arn
  task_role_arn            = aws_iam_role.task.arn

  container_definitions = jsonencode([{
    name  = "api"
    image = var.image
    portMappings = [{ containerPort = var.container_port, protocol = "tcp" }]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.this.name
        "awslogs-region"        = data.aws_region.current.name
        "awslogs-stream-prefix" = "api"
      }
    }
  }])
}

resource "aws_ecs_service" "this" {
  name            = var.name
  cluster         = var.cluster_arn
  task_definition = aws_ecs_task_definition.this.arn
  desired_count   = var.desired_count
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.subnet_ids
    security_groups  = [aws_security_group.fargate.id]
    assign_public_ip = true
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.this.arn
    container_name   = "api"
    container_port   = var.container_port
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

output "target_group_arn" { value = aws_lb_target_group.this.arn }
output "service_name"     { value = aws_ecs_service.this.name }

#1’s console work is now in this single file.

Using the module #

envs/prod/main.tf
module "network" {
  source       = "../../modules/network"
  name         = "blog-prod"
  cidr         = "10.0.0.0/16"
  azs          = ["ap-northeast-2a", "ap-northeast-2c"]
}

module "rds" {
  source            = "../../modules/rds"
  name              = "blog-prod"
  vpc_id            = module.network.vpc_id
  db_subnet_ids     = module.network.db_subnet_ids
  fargate_sg_id     = module.api.fargate_sg_id
  multi_az          = true
  instance_class    = "db.t4g.small"
  deletion_protection = true
}

module "api" {
  source         = "../../modules/ecs-service"
  name           = "blog-prod"
  cluster_arn    = aws_ecs_cluster.blog.arn
  image          = var.image  # injected by CI
  vpc_id         = module.network.vpc_id
  subnet_ids     = module.network.private_subnet_ids
  alb_sg_id      = module.network.alb_sg_id
  desired_count  = 4
  cpu            = "1024"
  memory         = "2048"
}

The dev environment uses a lighter configuration: desired_count = 1, multi_az = false, instance_class = "db.t4g.micro". Same module, different variables is the key.

5) Terraform ↔ CI/CD integration #

How to bundle with GitHub Actions from #3.

Two flows #

Description
A. Separate infra / appInfra changes via separate PR + apply, app deploy just updates the image
B. One bundled workflowImage build → terraform apply puts new image into service

A is recommended to start. Infrastructure changes are infrequent and high-risk; app deploys are frequent and lower-risk. Keeping them separate reflects that difference.

Plan as a PR comment #

.github/workflows/terraform-plan.yml
name: Terraform Plan
on:
  pull_request:
    paths: ['infra/**']

permissions:
  id-token: write
  contents: read
  pull-requests: write

jobs:
  plan:
    runs-on: ubuntu-latest
    defaults:
      run: { working-directory: infra/envs/prod }
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
          aws-region: ap-northeast-2
      - uses: hashicorp/setup-terraform@v3
        with: { terraform_version: 1.9.0 }
      - run: terraform init
      - run: terraform plan -no-color -out=tfplan
      - name: Comment Plan
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const out = require('child_process')
              .execSync('terraform show -no-color tfplan', { cwd: 'infra/envs/prod' });
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '```\n' + out + '\n```'
            });

The PR review stage surfaces what changes in one place — the most effective point for catching potential production incidents before code merges.

The terraform-plan role is enough with read-only permissions. Apply needs a separate role.

6) Drift tracking #

Anything changed by hand in the console diverges from state — this is called drift. terraform plan shows the diff, effectively asking “should I revert this?”

Periodic drift check
terraform plan -detailed-exitcode
# exit 0 = no diff
# exit 2 = diff exists (not a failure)

Run daily in CI, alert Slack on exit 2.

Pitfalls — Terraform operations #

1) State lock not released #

Apply was ctrl-c’d → DynamoDB lock remains. Next apply fails with “Resource locked.”

Force unlock (careful)
terraform force-unlock <LOCK_ID>

LOCK_ID is in the error message. Always confirm that no one else is actually working before doing this.

2) Manual state edits #

Opening .tfstate in vim and editing directly almost always ends in regret. Instead:

state commands
terraform state list                       # list resources
terraform state show aws_ecr_repository.x  # show one resource
terraform state rm aws_ecr_repository.x    # remove from state (doesn't delete actual resource)
terraform state mv module.a.x module.b.x   # move resource
terraform import aws_ecr_repository.x my-repo  # register existing resource into state

3) Plaintext password in state #

aws_db_instance password, aws_secretsmanager_secret_version secret_string — go into state as plaintext. State bucket encryption + access restriction is essential.

State bucket policy (example)
data "aws_iam_policy_document" "state_bucket" {
  statement {
    effect    = "Deny"
    actions   = ["s3:*"]
    resources = ["arn:aws:s3:::myorg-terraform-state/*"]
    condition {
      test     = "Bool"
      variable = "aws:SecureTransport"
      values   = ["false"]
    }
  }
}

4) -/+ destroy/create #

If -/+ shows in plan, the resource ID changes. For RDS, that’s data loss. Things to look at carefully:

  # aws_db_instance.blog must be replaced
-/+ resource "aws_db_instance" "blog" {
      ~ engine_version = "16.3" -> "17.0"  # forces replacement
    }

A change like this requires a separate migration procedure. RDS has dedicated options for in-place major version upgrades.

5) Provider version not pinned #

Without version in required_providers, the next init may pull a breaking version. Always pin with a pattern like ~> 5.0.

6) terraform destroy accident #

Accidentally destroying production. Protection:

Protect important resources
resource "aws_db_instance" "blog" {
  # ...
  lifecycle {
    prevent_destroy = true
  }
}

A resource with prevent_destroy = true blocks destroy / replace at the plan stage.

Wrapping up #

What we covered in this post:

  • Why IaC — reproducibility / tracking / review / safe destroy
  • Five blocks — provider, resource, data, variable, output
  • Workflow — init → plan → apply → destroy. Plan is the biggest value
  • State — Terraform’s core. Local state is for learning only
  • S3 + DynamoDB Backend — production standard, encrypt, versioning
  • Bootstrap — backend itself via console / separate shape
  • Directory structure — modules/ + envs/{dev,prod}/, separate backend keys per env
  • Modules — same pattern, different variables. dev = light, prod = full options
  • CI/CD integration — Plan as PR comment, separate plan/apply permissions
  • Drift trackingplan -detailed-exitcode periodically
  • Pitfalls — lock release, state edits, plaintext password, -/+, provider version, destroy protection

Next — Monitoring #

Infrastructure is now code and deployment is automated. Now it’s time to seriously look at whether it’s running / running well.

In #5 Monitoring — CloudWatch alarms and X-Ray we’ll cover the core metrics of ECS / RDS / ALB, operational queries in Logs Insights, sending alarms to Slack, and X-Ray distributed tracing for a one-line answer to “why did this one request take 5 seconds?”

X