21 Chapter

Step Functions Intro

The AWS workflow engine, all in one place. We cover the role of a State machine, the four states Task / Choice / Parallel / Map, Standard vs Express, the Amazon States Language (ASL), Lambda / ECS / SDK integration, Retry / Catch error handling, and patterns like Saga and Human-in-the-loop.

Through Chapter 17 Lambda basics, Chapter 18 API Gateway + Lambda, Chapter 19 EventBridge / SQS / SNS, and Chapter 20 Secrets Manager / Parameter Store, one axis was functions / messages / secrets. Step up one level, and what remains is the way to bundle multi-step function calls / branching / parallelism into one workflow.

Traditionally, this kind of code was written inside one Lambda with try / except and if. But once the steps go beyond 5, both visibility and debugging get hard. Step Functions replaces that approach.

This chapter is the last of Part 3, Containers · Serverless. When it ends, we move on to Part 4’s Chapter 22 ECS Fargate deployment skeleton — we move the mental model we built in the console into Terraform code, and start operating a real backend on ECS Fargate.

What Step Functions does #

AWS Step Functions is a managed workflow (state machine) engine. Define several steps in JSON, and AWS takes responsibility for step progression / retry / failure handling / visualization.

The way one Lambda did everything #

monolithic Lambda — 5-step processing

def handler(event, context):
    user = fetch_user(event["userId"])
    if user.plan == "pro":
        send_pro_email(user)
    else:
        send_basic_email(user)

    try:
        run_billing(user)
    except RateLimitError:
        time.sleep(60)
        run_billing(user)

    notify_slack(user)
    update_dashboard(user)

The problems are as follows.

It operates on top of one Lambda’s 15-minute limit.
When one step fails, you wander the logs for where it failed.
The per-step retry policy is embedded inside the code.
Adding a different variant of the same flow makes the ifs explode.
It’s hard for an operator to know “where is it now?”

The problem Step Functions solves #

The picture of Step Functions

input
   │
   ▼
┌──────────────────┐
│ FetchUser        │  Lambda call
└──────┬───────────┘
       ├ "pro" → SendProEmail
       └ "basic" → SendBasicEmail
       │
       ▼
┌──────────────────┐
│ RunBilling       │  retry: 3 times, backoff 60s
└──────┬───────────┘
       ▼
┌─ Parallel ───────────────┐
│ NotifySlack │ UpdateDash │
└─────────────┴────────────┘
       │
       ▼
done

Each step is visualized, and per execution you can see at a glance in the console where it stopped. The retry is declarative, and the branching is data-driven.

Standard vs Express #

The two modes of Step Functions.

	Standard	Express
Execution time	Up to 1 year	Up to 5 minutes
Pricing model	Per state transition	Per call + memory + time (similar to Lambda)
Execution history	Retained 90 days, visualized	Short, CloudWatch Logs only
Throughput	~25,000 / sec	~100,000 / sec
At-least-once vs Exactly-once	Exactly-once	At-least-once
When it fits	Business workflows a human must trace (orders, refunds)	Short / high-throughput (event processing, data transformation)

Use Standard at first. Once short processing + frequent calls become clear, go to Express.

Amazon States Language (ASL) #

A workflow is defined in JSON. It’s called ASL.

hello.asl.json

{
  "Comment": "first state machine",
  "StartAt": "SayHello",
  "States": {
    "SayHello": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-2:123456789012:function:hello-fn",
      "End": true
    }
  }
}

You start with StartAt and define each state / step inside States. Each state has a Type and a next state (Next or End: true).

Create + run #

create a state machine

SM_ARN=$(aws stepfunctions create-state-machine \
  --name hello-flow \
  --definition file://hello.asl.json \
  --role-arn arn:aws:iam::123456789012:role/stepfn-role \
  --type STANDARD \
  --query stateMachineArn --output text)

# run
aws stepfunctions start-execution \
  --state-machine-arn $SM_ARN \
  --input '{"name":"world"}'

In the console’s visualization, the nodes are drawn as a graph, and per execution the nodes get colored.

The 4 core states #

1) Task — the actual work #

The most-used state. Calls external resources like Lambda / ECS Task / SDK / SNS / SQS.

Task — Lambda call

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:ap-northeast-2:123456789012:function:fetch-user",
  "InputPath": "$",
  "OutputPath": "$",
  "ResultPath": "$.user",
  "Next": "BranchByPlan"
}

InputPath: which part of the incoming data to send to the function
OutputPath: which part to pass to the next state
ResultPath: where in the input data to merge the function’s result

Service Integration — direct integration #

Besides Lambda, it calls the AWS SDK directly. The work you used to do with boto3 inside a Lambda, you do straight from ASL.

put directly to DynamoDB

{
  "Type": "Task",
  "Resource": "arn:aws:states:::aws-sdk:dynamodb:putItem",
  "Parameters": {
    "TableName": "users",
    "Item": {
      "id": {"S.$": "$.user.id"},
      "name": {"S.$": "$.user.name"}
    }
  },
  "Next": "Done"
}

One Lambda is removed — the code / deployment / cold start disappear.

Optimized Integration #

The frequently used patterns are shortcut ARNs — .sync waits for the end result.

ECS Task synchronous execution

{
  "Type": "Task",
  "Resource": "arn:aws:states:::ecs:runTask.sync",
  "Parameters": {
    "Cluster": "prod-cluster",
    "TaskDefinition": "myapp:42",
    "LaunchType": "FARGATE",
    ...
  },
  "Next": "AfterEcs"
}

It waits until the ECS Task finishes (or fails). Step Functions does the polling automatically (connects to Chapter 15 ECS and Fargate).

2) Choice — branching #

Looks at a data value and decides the next state.

Choice

{
  "Type": "Choice",
  "Choices": [
    {
      "Variable": "$.user.plan",
      "StringEquals": "pro",
      "Next": "SendProEmail"
    },
    {
      "Variable": "$.user.plan",
      "StringEquals": "basic",
      "Next": "SendBasicEmail"
    }
  ],
  "Default": "SendDefaultEmail"
}

3) Parallel — parallel branches #

Runs several branches at once and merges all the results.

Parallel

{
  "Type": "Parallel",
  "Branches": [
    {
      "StartAt": "NotifySlack",
      "States": {
        "NotifySlack": { "Type": "Task", "Resource": "...", "End": true }
      }
    },
    {
      "StartAt": "UpdateDashboard",
      "States": {
        "UpdateDashboard": { "Type": "Task", "Resource": "...", "End": true }
      }
    }
  ],
  "Next": "Done"
}

Each branch runs and retries independently. If one branch fails (and isn’t caught), the whole thing fails.

4) Map — collection processing #

Repeats the same flow for each item in an array. The distributed version of for-each.

Map

{
  "Type": "Map",
  "ItemsPath": "$.orders",
  "MaxConcurrency": 10,
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "INLINE" },
    "StartAt": "ProcessOrder",
    "States": {
      "ProcessOrder": { "Type": "Task", "Resource": "...", "End": true }
    }
  },
  "End": true
}

It processes 100 orders 10 at a time concurrently, and finishes when all are done. The Distributed Map mode handles scales of 10k ~ 1M items too (bulk processing of S3 objects, ETL, etc.).

The auxiliary states #

Pass — data shaping only, no external call
Wait — wait for a set time / until a specific moment
Succeed / Fail — explicit termination

Error handling — Retry / Catch #

A big part of a workflow’s value. Set it declaratively per step.

Retry #

Retry — retry up to 3 times with backoff

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:...:run-billing",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed", "BillingRateLimitError"],
      "IntervalSeconds": 5,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    },
    {
      "ErrorEquals": ["States.Timeout"],
      "IntervalSeconds": 30,
      "MaxAttempts": 1
    }
  ],
  "Next": "AfterBilling"
}

States.TaskFailed is a general failure, States.Timeout is a timeout, or it’s a custom error name a Lambda threw. It backs off 5 sec → 10 sec → 20 sec.

Catch #

Catch — a separate flow on failure

{
  "Type": "Task",
  "Resource": "...",
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "HandleFailure"
    }
  ],
  "Next": "AfterTask"
}

When retries all fail, it goes to HandleFailure — send it to a compensating transaction (rollback) / notification / human-intervention queue.

Frequently used patterns #

1) Saga — compensating transactions #

Used in a distributed system when a transaction doesn’t apply. Each step’s forward direction + compensation on failure.

Saga

create order → payment → deduct stock → reserve shipping
   ↓             ↓          ↓             ↓ (failed!)
   cancel order  refund     restore stock  X

Put a Catch on each Task, and on failure run the compensation steps in reverse order.

2) Human-in-the-loop #

Waits, then proceeds when a person approves / rejects.

Wait for callback

{
  "Type": "Task",
  "Resource": "arn:aws:states:::sns:publish.waitForTaskToken",
  "Parameters": {
    "TopicArn": "...",
    "Message": {
      "TaskToken.$": "$$.Task.Token",
      "OrderId.$": "$.orderId"
    }
  },
  "Next": "AfterApproval"
}

waitForTaskToken sends the token to the outside (email, a Slack bot, etc.) and waits until someone calls the SendTaskSuccess / SendTaskFailure API. Up to 1 year.

3) Polling pattern #

Waits for a long external job to complete.

StartJob → WaitState(30s) → CheckJob → Choice
                                ↓
                         (continue) loop back to WaitState
                         (done) Next

4) Express workflow — event processing #

EventBridge / SQS triggers it → short processing (1 ~ 3 Lambdas) → put the result in DynamoDB / S3.

Express’s fast throughput and short time limit fit naturally (connects to Chapter 19 EventBridge / SQS / SNS).

Compared with Lambda — when which #

Cases where one Lambda is enough #

Short processing of 1 ~ 2 steps
No need for visualization / human tracing
Very frequent calls + very short (Step Functions’ state-transition cost is a burden)

Cases where Step Functions fits #

3 or more steps + branching / retry / parallelism
A person needs to see the failure / progress
A long interaction with an external system (human approval, external API)
The workflow itself is a business asset (tracking the edit history)

How to use them together #

Most of the time the two are used together. Step Functions controls the flow, and each step handles a Lambda / ECS / SDK call.

Pitfalls you’ll often hit #

1) JSONPath typo #

In "Variable": "$.user.plan", if even one dot / $ is off, the matching is 0. Check it one step at a time with the console’s input/output inspector.

2) Lambda’s output is too large #

The input / output payload limit of one Step Functions state is 256 KB. Put large data in S3 and pass only the key.

good pattern

{
  "s3Bucket": "myapp-pipeline",
  "s3Key": "jobs/abc123/input.json"
}

3) Cost on every state transition #

Standard mode is $0.025 per 1,000 state transitions. The total cost of a multi-step workflow is the number of steps × the number of calls. Putting too many short steps (Pass) is surprisingly large.

4) Lambda cold start on every step #

If each Task is a separate Lambda, a cold start occurs at every step. Use Express + Provisioned Concurrency, or combine several steps into one Lambda (see the cold start of Chapter 17 Lambda basics).

5) Retry’s BackoffRate runaway #

With MaxAttempts: 10, BackoffRate: 3.0, it’s 1 → 3 → 9 → 27 → 81 sec… a time users can’t wait through. Compute whether the sum is reasonable.

6) Catch runs only once #

After retries all finish, catch runs once. If it fails again inside catch, the workflow fails. Consider a retry option on the task inside catch too.

7) External calls invisible in the visualization #

The part you call with boto3 inside a Lambda isn’t caught in the visualization / tracing. Where possible, pull it out into an ASL Task state with Service Integration.

Exercises #

For a multi-step processing you have in mind (e.g., order → payment → shipping), judge against the criteria in §“Compared with Lambda — when which” whether one Lambda is enough or Step Functions is needed, and write the reason in one sentence.
Assuming you apply the Saga pattern to your workflow, draw a table of each forward step and its corresponding compensation step, like the figure in §“Frequently used patterns.” Also mark which step you should attach a Catch to.
From §“Pitfalls you’ll often hit,” explain why both the state-transition cost and the per-step cold start are proportional to the number of steps, and put together in one paragraph which choice from Chapter 17 Lambda basics (combining into one Lambda / Service Integration) you can use to reduce the steps.

In short: Step Functions is a managed workflow engine that bundles multi-step function and SDK calls in JSON (ASL), with visualization, retry, branching, and parallelism all declarative. For long business workflows that need tracing, use Standard; for short, high-throughput events, use Express. The core states are Task, Choice, Parallel, and Map, and Service Integration calls the SDK directly without Lambda. One or two steps are enough with one Lambda, and once you need branching, retry, or visualization at 3 or more steps, Step Functions fits; the common pitfalls are the 256 KB payload, state-transition cost, and multi-step cold start.

Next chapter #

The theory is all out. In Part 3 we gathered the toolbox of AWS operations: containers (ECS / ECR), serverless (Lambda / API Gateway), messaging (EventBridge / SQS / SNS), secrets (Secrets Manager / Parameter Store), and workflows (Step Functions). From the next Part 4’s Chapter 22 ECS Fargate deployment skeleton on, we move the mental model we built in the console into Terraform code and start putting a real backend on ECS Fargate in an operable form.