AWS Advanced #7: Step Functions

#3 Lambda, #4 API Gateway, #5 EventBridge / SQS / SNS, #6 Secrets Manager covered functions / messaging / secrets. One piece is left — chaining several function calls / branches / parallel paths into a single workflow.

Traditionally this code is one Lambda full of try/except and ifs. Past 5 steps it gets hard to see and hard to debug. Step Functions fits there.

This is the last post of the AWS Advanced series. After this we move to the practice 6 posts — operating an actual backend on ECS Fargate.

Where Step Functions fits #

AWS Step Functions is a managed workflow (state machine) engine. Define your steps in JSON; AWS handles step progression / retries / failure handling / visualization.

When one Lambda did it all #

monolithic Lambda — 5 steps
def handler(event, context):
    user = fetch_user(event["userId"])
    if user.plan == "pro":
        send_pro_email(user)
    else:
        send_basic_email(user)

    try:
        run_billing(user)
    except RateLimitError:
        time.sleep(60)
        run_billing(user)

    notify_slack(user)
    update_dashboard(user)

Problems:

  • Runs under one Lambda’s 15-minute cap
  • When a step fails, you have to hunt through logs to figure out where
  • Per-step retry policy is buried in code
  • Adding another variant of the same flow → if-explosion
  • Operators struggle to know “where are we right now?”

What Step Functions solves #

the Step Functions picture
input
┌──────────────────┐
│ FetchUser        │  Lambda call
└──────┬───────────┘
       ├ "pro" → SendProEmail
       └ "basic" → SendBasicEmail
┌──────────────────┐
│ RunBilling       │  retry: 3 times, backoff 60s
└──────┬───────────┘
┌─ Parallel ───────────────┐
│ NotifySlack │ UpdateDash │
└─────────────┴────────────┘
done

Each step is visualized and per-execution you can see in the console exactly where it stopped. Retries are declarative; branching is data-driven.

Standard vs Express #

Two Step Functions modes.

StandardExpress
Execution timeUp to 1 yearUp to 5 minutes
Pricing modelPer state transitionPer invocation + memory + duration (Lambda-like)
Execution historyRetained 90 days, visualizedShort, CloudWatch Logs only
Throughput~25,000 / sec~100,000 / sec
At-least-once vs Exactly-onceExactly-onceAt-least-once
Where it fitsBusiness workflows humans must trace (orders, refunds)Short / high-throughput (event processing, data transformation)

Start with Standard. Move to Express when short processing + frequent invocation becomes clear.

Amazon States Language (ASL) #

Workflows are defined in JSON, called ASL.

hello.asl.json
{
  "Comment": "first state machine",
  "StartAt": "SayHello",
  "States": {
    "SayHello": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-2:123456789012:function:hello-fn",
      "End": true
    }
  }
}

Start with StartAt, then define each state in States. Each state has a Type and a next state (Next or End: true).

Create + execute #

create state machine
SM_ARN=$(aws stepfunctions create-state-machine \
  --name hello-flow \
  --definition file://hello.asl.json \
  --role-arn arn:aws:iam::123456789012:role/stepfn-role \
  --type STANDARD \
  --query stateMachineArn --output text)

# execute
aws stepfunctions start-execution \
  --state-machine-arn $SM_ARN \
  --input '{"name":"world"}'

Visualization in the console — nodes are drawn as a graph, and per-execution they’re colored as the run progresses.

The four core states #

1) Task — actual work #

The most-used state. Calls Lambda / ECS Task / SDK / SNS / SQS — external resources.

Task — Lambda call
{
  "Type": "Task",
  "Resource": "arn:aws:lambda:ap-northeast-2:123456789012:function:fetch-user",
  "InputPath": "$",
  "OutputPath": "$",
  "ResultPath": "$.user",
  "Next": "BranchByPlan"
}
  • InputPath: which part of the incoming data to send to the function
  • OutputPath: which part to pass to the next state
  • ResultPath: where to merge the function’s result into the input data

Service Integration — direct #

Beyond Lambda, you can call AWS SDKs directly. The boto3-from-Lambda pattern becomes ASL-native:

put directly to DynamoDB
{
  "Type": "Task",
  "Resource": "arn:aws:states:::aws-sdk:dynamodb:putItem",
  "Parameters": {
    "TableName": "users",
    "Item": {
      "id": {"S.$": "$.user.id"},
      "name": {"S.$": "$.user.name"}
    }
  },
  "Next": "Done"
}

One less Lambda — code / deploy / cold-start disappears.

Optimized Integration #

Common patterns get short ARNs — .sync waits for the result:

ECS Task synchronous run
{
  "Type": "Task",
  "Resource": "arn:aws:states:::ecs:runTask.sync",
  "Parameters": {
    "Cluster": "prod-cluster",
    "TaskDefinition": "myapp:42",
    "LaunchType": "FARGATE",
    ...
  },
  "Next": "AfterEcs"
}

Wait until the ECS Task completes (or fails). Step Functions handles polling.

2) Choice — branching #

Decide the next state based on a data value.

Choice
{
  "Type": "Choice",
  "Choices": [
    {
      "Variable": "$.user.plan",
      "StringEquals": "pro",
      "Next": "SendProEmail"
    },
    {
      "Variable": "$.user.plan",
      "StringEquals": "basic",
      "Next": "SendBasicEmail"
    }
  ],
  "Default": "SendDefaultEmail"
}

3) Parallel — parallel branches #

Run multiple branches in parallel, then merge results.

Parallel
{
  "Type": "Parallel",
  "Branches": [
    {
      "StartAt": "NotifySlack",
      "States": {
        "NotifySlack": { "Type": "Task", "Resource": "...", "End": true }
      }
    },
    {
      "StartAt": "UpdateDashboard",
      "States": {
        "UpdateDashboard": { "Type": "Task", "Resource": "...", "End": true }
      }
    }
  ],
  "Next": "Done"
}

Each branch runs independently, with its own retries. One branch failing (uncaught) fails the whole.

4) Map — collection processing #

Apply the same flow to every item in an array. The distributed for-each.

Map
{
  "Type": "Map",
  "ItemsPath": "$.orders",
  "MaxConcurrency": 10,
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "INLINE" },
    "StartAt": "ProcessOrder",
    "States": {
      "ProcessOrder": { "Type": "Task", "Resource": "...", "End": true }
    }
  },
  "End": true
}

Process 100 orders, 10 at a time concurrently, end when all are done. Distributed Map mode handles 10K – 1M items (S3 bulk processing, ETL, etc.).

Auxiliary states #

  • Pass — data shaping only, no external call
  • Wait — wait until a duration / specific time
  • Succeed / Fail — explicit termination

Error handling — Retry / Catch #

A core value of the workflow tool. Per-step, declarative.

Retry #

Retry — up to 3 attempts with backoff
{
  "Type": "Task",
  "Resource": "arn:aws:lambda:...:run-billing",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed", "BillingRateLimitError"],
      "IntervalSeconds": 5,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    },
    {
      "ErrorEquals": ["States.Timeout"],
      "IntervalSeconds": 30,
      "MaxAttempts": 1
    }
  ],
  "Next": "AfterBilling"
}

States.TaskFailed is generic failure, States.Timeout is timeout, or you can use the user-defined error name a Lambda throws. 5s → 10s → 20s backoff.

Catch #

Catch — alternative flow on failure
{
  "Type": "Task",
  "Resource": "...",
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "HandleFailure"
    }
  ],
  "Next": "AfterTask"
}

When all retries are exhausted, go to HandleFailure — compensating transactions (rollbacks) / notifications / human-intervention queues.

Common patterns #

1) Saga — compensating transactions #

For workflows where transactions don’t span systems. Each step has a forward action and a compensating action on failure.

Saga
CreateOrder → Charge → DeductInventory → ScheduleShip
   ↓            ↓           ↓                ↓ (failure!)
   CancelOrder  Refund      RestoreInventory  X

Add Catch to each Task, and on failure run the compensating steps in reverse.

2) Human-in-the-loop #

Wait until a human approves / rejects:

Wait for callback
{
  "Type": "Task",
  "Resource": "arn:aws:states:::sns:publish.waitForTaskToken",
  "Parameters": {
    "TopicArn": "...",
    "Message": {
      "TaskToken.$": "$$.Task.Token",
      "OrderId.$": "$.orderId"
    }
  },
  "Next": "AfterApproval"
}

waitForTaskToken — hand the token off (email, Slack bot, etc.) and wait until someone calls SendTaskSuccess / SendTaskFailure. Up to 1 year.

3) Polling pattern #

Wait for a long-running external job to complete:

StartJob → WaitState(30s) → CheckJob → Choice
                         (still running) → back to WaitState
                         (done) → Next

4) Express workflow — event processing #

EventBridge / SQS triggers → short processing (1–3 Lambdas) → result to DynamoDB / S3.

Express’s high throughput and short time limit fit naturally.

Compared to Lambda — when to use what #

When one Lambda is enough #

  • 1–2 step short work
  • No need for visualization / human tracing
  • Very high frequency + very short (state transition cost is meaningful)

When Step Functions fits #

  • 3+ steps + branching / retry / parallel
  • Failures / progress need to be human-visible
  • Long-running interaction with external systems (human approval, external APIs)
  • The workflow itself is a business asset (with change history)

Together #

In most cases they go together. Step Functions controls flow; each step is a Lambda / ECS / SDK call.

Common pitfalls #

1) JSONPath typo #

A single misplaced dot or $ in "Variable": "$.user.plan" and matching produces zero results. Use the console’s input/output inspector per step.

2) Lambda output too big #

Step Functions per-state I/O payload cap is 256 KB. For big data, store in S3 and pass the key.

good pattern
{
  "s3Bucket": "myapp-pipeline",
  "s3Key": "jobs/abc123/input.json"
}

3) Cost per state transition #

Standard mode charges $0.025 per 1,000 transitions. A many-step workflow’s total cost is steps × executions. Lots of Pass states can add up unexpectedly.

4) Lambda cold start at every step #

If each Task invokes a separate Lambda, cold starts occur at every step. Express + Provisioned Concurrency, or merge multiple steps into one Lambda.

5) Retry BackoffRate explosion #

MaxAttempts: 10, BackoffRate: 3.0 → 1 → 3 → 9 → 27 → 81 sec… users won’t wait that long. Work out the total wait time before deciding on values.

6) Catch only runs once #

Catch fires exactly once, after all retries are exhausted. If the Catch task itself fails, the workflow fails. Consider retry options on the catch task too.

7) External calls invisible to visualization #

boto3 calls inside Lambda don’t show up in visualization / tracing. When possible, lift them to ASL Tasks via Service Integration.

Wrap-up #

What you took home this time:

  • Where Step Functions fits — function / SDK calls across multiple steps as a JSON workflow. Visualization / retry / branching / parallel are declarative
  • Standard vs Express — Standard for long, audit-worthy business / Express for short, high-throughput events
  • ASL — JSON definition. StartAt + States + per-state Type / Next
  • Four core states — Task (work) / Choice (branch) / Parallel (parallel) / Map (collection)
  • Service Integration — SDK directly without Lambda. ECS .sync, DynamoDB, SNS, etc.
  • Retry / Catch — declarative per step. Backoff and catch flows
  • Common patterns — Saga (compensating transactions), Human-in-the-loop (waitForTaskToken), Polling, Express event processing
  • Lambda vs Step Functions — 1–2 steps → one Lambda. 3+ steps + branching / retry / visualization → Step Functions
  • Pitfalls — JSONPath typo, 256KB payload (S3 workaround), state-transition cost, multi-step Lambda cold start, BackoffRate explosion, Catch fires once, external calls outside Service Integration

Wrapping up the series #

From #1 ECS / Fargate through 7 posts — containers (ECS / ECR), serverless (Lambda / API Gateway), messaging (EventBridge / SQS / SNS), secrets (Secrets Manager / Parameter Store), workflow (Step Functions) — the AWS ops toolbox is in place.

Layered on top of the Basics 7 posts (IAM / cost / security) and Intermediate 7 posts (EC2 / VPC / S3 / RDS / ALB / CloudFront), these 7 cover about 90% of what you need to run a backend on AWS.

Up next — AWS in Practice #

Theory’s all out. Time to build a real backend project.

AWS in Practice #1 — deploy a backend on ECS Fargate takes the API from Modern Python in Practice (FastAPI) / Django in Practice (DRF) and runs it on ECS Fargate, in operable form. RDS, ALB, ACM, Route 53, Secrets Manager all come together — the start of a 6-post practice track.

X