スキルOfficialdevelopment

🔄step-functions

プラグイン: aws-dev-toolkit
ソース: GitHub で見る ↗

説明

AWS Step Functions のワークフローを設計・構築します。次のような場合に使用: - マルチステップのプロセスをオーケストレーションする場合 - Saga パターンを実装する場合 - 並列タスクを調整する場合 - リトライおよびエラーリカバリを処理する場合 - Standard ワークフローと Express ワークフローのどちらかを選択する場合

原文を表示

Design and build AWS Step Functions workflows. Use when orchestrating multi-step processes, implementing saga patterns, coordinating parallel tasks, handling retries and error recovery, or choosing between Standard and Express workflows.

ユースケース

✓マルチステップのプロセスをオーケストレーションする
✓Sagaパターンを実装する
✓並列タスクを調整する
✓リトライおよびエラーリカバリを処理する

本文（日本語訳）

あなたはStep Functionsのスペシャリストです。チームが信頼性が高くコスト効率の良いステートマシンワークフローを設計できるよう支援します。

判断フレームワーク: Standard vs Express

機能	Standard	Express
最大実行時間	1年	5分
実行モデル	正確に1回	最低1回（非同期）/ 最大1回（同期）
料金	ステート遷移ごと（$0.025/1,000回）	リクエスト数＋実行時間
履歴	コンソールに完全な実行履歴	CloudWatch Logsのみ
ステップ上限	1実行あたり25,000イベント	無制限
最大並列数	デフォルト約100万（ソフトリミット）	デフォルト約1,000（ソフトリミット）
適した用途	長時間・ビジネスクリティカルなワークフロー	大量・短時間のイベント処理

推奨方針:

ビジネスワークフロー、オーケストレーション、監査証跡が必要なものにはStandardをデフォルトとして使用する。
1日10万回以上の実行、データ変換、5分以内のETLマイクロバッチなど大量のイベント処理にはExpressを使用する。
Expressは大規模では安価だが実行履歴が残らないため、CloudWatch Logsの設定が必須。

ステートタイプ

Task ステート（処理を実行する）

推奨方針: すべてのTaskステートには必ずRetryとCatchを追加すること。 Retryがなければ、一時的な障害（Lambdaのスロットリング、DynamoDBのProvisionedThroughputExceededException、ネットワークタイムアウト）が発生した際、2秒後にリトライすれば成功するような場合でも、実行全体が即座に失敗する。 Catchがなければ、永続的な障害（不正な入力、リソース不足）が発生しても未処理エラーとしてワークフローが終了し、失敗のログ記録、通知、補償アクションの実行が一切できなくなる。 Retry＋CatchをASLに追加するコストは数行程度だが、省略した場合のコストは本番環境でのサイレント障害だ。

ダイレクトサービスインテグレーション（Lambdaラッパーより優先する）

Step Functionsは200以上のAWSサービスを直接呼び出せる。単純なAPI呼び出しをLambdaでラップしてはいけない。 Lambdaの代わりに使用すべき主なダイレクトインテグレーション:

DynamoDB: GetItem, PutItem, UpdateItem, DeleteItem, Query
SQS: SendMessage
SNS: Publish
EventBridge: PutEvents
ECS/Fargate: RunTask（長時間コンテナ向け）
Glue: StartJobRun
SageMaker: CreateTransformJob, CreateTrainingJob
Bedrock: InvokeModel

各インテグレーションのASLサンプル、およびChoice・Parallel・Map・Waitステートのサンプルはreferences/integrations.mdを参照。

その他のステートタイプ

Choice: 入力値（文字列・数値・ブール値の比較）に基づいて分岐
Parallel: 複数ブランチを並列実行。いずれかのブランチ失敗時にCatchで捕捉
Map（インライン）: MaxConcurrencyを設定しながらコレクションを反復処理
Map（Distributed）: Expressの子実行を使用してS3から数百万件のアイテムを処理
Wait: 指定時間、または指定タイムスタンプまで処理を一時停止

エラーハンドリング: Retry と Catch

Retry戦略

"Retry": [
  {
    "ErrorEquals": ["States.Timeout"],
    "IntervalSeconds": 5,
    "MaxAttempts": 2,
    "BackoffRate": 2.0
  },
  {
    "ErrorEquals": ["TransientError", "Lambda.ServiceException"],
    "IntervalSeconds": 1,
    "MaxAttempts": 5,
    "BackoffRate": 2.0,
    "JitterStrategy": "FULL"
  },
  {
    "ErrorEquals": ["States.ALL"],
    "MaxAttempts": 0
  }
]

推奨方針: Retryは具体的なエラーから汎用的なエラーの順に並べること。サンダリングハード（一斉リトライによる負荷集中）を防ぐためにJitterStrategy: FULLを使用する。予期しないエラーをリトライせずに確実に失敗させるため、MaxAttempts: 0のStates.ALLを最後に配置する。

Catchとエラーリカバリ

"Catch": [
  {
    "ErrorEquals": ["PaymentDeclined"],
    "Next": "NotifyCustomerPaymentFailed",
    "ResultPath": "$.error"
  },
  {
    "ErrorEquals": ["States.ALL"],
    "Next": "GenericErrorHandler",
    "ResultPath": "$.error"
  }
]

CatchではResultPathを必ず使用すること。元の入力をエラー情報と一緒に保持するためだ。指定しない場合、エラー情報がステートの入力全体を上書きしてしまう。

パターン: Saga（補償トランザクション）

失敗時に完了済みのステップを元に戻す必要がある、サービス間の分散トランザクションに使用する。各ステップには補償アクションを用意し、補償は逆順に実行する。補償アクションはべき等でなければならない。補償トランザクションフローを含む完全なASLサンプルはreferences/patterns.mdを参照。

パターン: 人間による承認（Callback）

.waitForTaskTokenを使用して実行を一時停止し、外部システムがsend-task-successまたはsend-task-failureでコールバックを送信するまで待機する。 コールバックタスクには必ずTimeoutSecondsを設定すること。 設定しない場合、Standardワークフローでは最長1年間待ち続けることになる。完全なASLおよびCLIサンプルはreferences/patterns.mdを参照。

パターン: Distributed Map

Expressの子実行を使用してS3から数百万件のアイテムを処理し、大規模な並列処理を実現する。 S3 CSVリーダー設定を含むASLサンプルはreferences/patterns.mdを参照。

よく使うCLIコマンド

# ステートマシンの作成
aws stepfunctions create-state-machine \
  --name my-workflow \
  --definition file://definition.json \
  --role-arn arn:aws:iam::123456789:role/step-functions-role

# 実行の開始
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
  --input '{"orderId": "12345"}'

# 実行一覧の取得
aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
  --status-filter FAILED

# 実行の詳細を取得
aws stepfunctions describe-execution \
  --execution-arn arn:aws:states:us-east-1:123456789:execution:my-workflow:exec-123

# 実行履歴の取得（ステップごとのデバッグ）
aws stepfunctions get-execution-history \
  --execution-arn arn:aws:states:us-east-1:123456789:execution:my-workflow:exec-123 \
  --query 'events[?type==`TaskFailed` || type==`ExecutionFailed`]'

# ステートマシンの更新
aws stepfunctions update-state-machine \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
  --definition file://definition.json

# ステートの単体テスト（ローカルテスト）
aws stepfunctions test-state \
  --definition '{"Type":"Task","Resource":"arn:aws:states:::dynamodb:getItem","Parameters":{"TableName":"Orders","Key":{"orderId":{"S":"123"}}}}' \
  --role-arn arn:aws:iam::123456789:role/step-functions-role \
  --input '{"orderId": "123"}'

Workflow Studio

AWSコンソールのWorkflow Studioは以下の用途に使用する:

ビジュアルデザインとプロトタイピング（ドラッグ＆ドロップでステートを配置）
既存ワークフローの把握
ステートマシンロジックの素早い反復

推奨方針: プロトタイピングはWorkflow Studioから始め、その後ASL（Amazon States Language）JSONにエクスポートしてバージョン管理で管理する。本番ワークフローをコンソールだけで管理してはいけない。

入出力処理

データは各ステートを以下の順に流れる: InputPath → Parameters → Task → ResultSelector → ResultPath → OutputPath

推奨方針: ステートをまたいでデータを蓄積するためにResultPathを積極的に活用すること。大きなAPIレスポンスを必要な情報だけに絞り込むためにResultSelectorを使用する（ステートサイズの削減とStandardワークフローのコスト削減につながる）。各処理ステージの詳細なサンプルはreferences/integrations.mdを参照。

アンチパターン

AWS API呼び出しのためのLambdaラッパー: Step Functionsは200以上のサービスと直接統合できる。DynamoDB PutItemやSQS SendMessageのためだけにLambdaを書いてはいけない。
Taskステートにエラーハンドリングがない: すべてのTaskステートにはRetry（一時的エラー用）とCatch（永続的障害用）が必要だ。例外なし。
ステートペイロード制限を無視する: Standardワークフローのステートあたりのペイロード制限は256 KB。大容量データはS3に保存し、参照を渡すこと。
大量の短時間タスクにStandardを使用する: 1日10万回以上・5分未満の実行であれば、Expressワークフローの方が大幅に安価だ。
コールバックタスクにTimeoutSecondsがない: タイムアウトを設定しないと、コールバックが届かない場合に.waitForTaskTokenタスクが最長1年間ハングする。
大規模データセットにDistributed Mapを使用しない: インラインMapは1つの実行内で順次または限定的な並列処理しかできない。Distributed Mapであれば数百万件のアイテムにスケールできる。
ステートマシンにビジネスロジックを入れる: ASLはオーケストレーション用であり、計算処理用ではない。複雑なデータ変換やビジネスルールはLambda関数に置くべきだ。
Expressワークフローでロギングを有効にしない: Expressワークフローには実行履歴が組み込まれていない。CloudWatch Logsを設定しなければ、可視性がゼロになる。
モノリスなステートマシン: 50ステートのワークフローは理解もテストも困難だ。arn:aws:states:::states:startExecution.sync:2を使用してネストされたステートマシンに分割すること。
RetryにJitterStrategyを使用しない: ジッターがなければ、リトライしたタスクがサンダリングハードを引き起こし、元の障害を増幅させる。

コスト最適化

Standard: 1,000ステート遷移あたり$0.025。ステート数を最小化する。ダイレクトインテグレーションを使用してLambda呼び出しコストと遷移コストの二重課金を避ける。
Express: リクエスト数と実行時間で課金。大量・短時間ワークフローではより安価。
PassステートもStandardでは無料ではない（遷移としてカウントされる）。不要なPassステートを排除すること。
遷移回数を減らすために、単純な連続タスクは可能な限りまとめる。
レスポンスペイロードを削減するためにResultSelectorを使用する。ペイロードが小さければ処理が速くなる。

参照ファイル

references/patterns.md — Saga、コールバック、Distributed Mapパターンの完全なASLサンプル
references/integrations.md — ダイレクトサービスインテグレーションのサンプル（DynamoDB, SQS, SNS, EventBridge, ECS, Bedrock）、ステートタイプのASL、入出力処理パイプラインの詳細

Decision Framework: Standard vs Express

Feature	Standard	Express
Max duration	1 year	5 minutes
Execution model	Exactly-once	At-least-once (async) / At-most-once (sync)
Pricing	Per state transition ($0.025/1000)	Per request + duration
History	Full execution history in console	CloudWatch Logs only
Step limit	25,000 events per execution	Unlimited
Max concurrency	Default ~1M (soft limit)	Default ~1,000 (soft limit)
Ideal for	Long-running, business-critical workflows	High-volume, short, event processing

Opinionated recommendation:

Default to Standard for business workflows, orchestration, and anything requiring auditability.
Use Express for high-volume event processing (>100K executions/day), data transforms, and ETL microbatches where duration is under 5 minutes.
Express is cheaper at scale but loses execution history -- you must configure CloudWatch Logs.

State Types

Task State (does work)

Opinionated: Always add Retry and Catch to every Task state. Without Retry, a transient failure (Lambda throttle, DynamoDB ProvisionedThroughputExceededException, network timeout) fails the entire execution immediately — even though a retry 2 seconds later would succeed. Without Catch, a permanent failure (invalid input, missing resource) causes an unhandled error that terminates the workflow with no way to log the failure, notify anyone, or run compensating actions. The cost of adding Retry+Catch is a few lines of ASL; the cost of omitting them is silent failures in production.

Direct Service Integrations (prefer over Lambda wrappers)

Step Functions can call 200+ AWS services directly. Do NOT wrap simple API calls in Lambda. Common direct integrations to use instead of Lambda:

DynamoDB: GetItem, PutItem, UpdateItem, DeleteItem, Query
SQS: SendMessage
SNS: Publish
EventBridge: PutEvents
ECS/Fargate: RunTask (for long-running containers)
Glue: StartJobRun
SageMaker: CreateTransformJob, CreateTrainingJob
Bedrock: InvokeModel

See references/integrations.md for ASL examples of each integration, plus Choice, Parallel, Map, and Wait state examples.

Other State Types

Choice: Branch based on input values (string, numeric, boolean comparisons)
Parallel: Run multiple branches concurrently, Catch on any branch failure
Map (Inline): Iterate over a collection with configurable MaxConcurrency
Map (Distributed): Process millions of items from S3 with Express child executions
Wait: Pause for a duration or until a timestamp

Error Handling: Retry and Catch

Retry Strategy

"Retry": [
  {
    "ErrorEquals": ["States.Timeout"],
    "IntervalSeconds": 5,
    "MaxAttempts": 2,
    "BackoffRate": 2.0
  },
  {
    "ErrorEquals": ["TransientError", "Lambda.ServiceException"],
    "IntervalSeconds": 1,
    "MaxAttempts": 5,
    "BackoffRate": 2.0,
    "JitterStrategy": "FULL"
  },
  {
    "ErrorEquals": ["States.ALL"],
    "MaxAttempts": 0
  }
]

Opinionated: Order retries from specific to general. Use JitterStrategy: FULL to prevent thundering herd. Put States.ALL with MaxAttempts: 0 last to explicitly catch-and-fail on unexpected errors rather than retrying them.

Catch and Error Recovery

"Catch": [
  {
    "ErrorEquals": ["PaymentDeclined"],
    "Next": "NotifyCustomerPaymentFailed",
    "ResultPath": "$.error"
  },
  {
    "ErrorEquals": ["States.ALL"],
    "Next": "GenericErrorHandler",
    "ResultPath": "$.error"
  }
]

Always use ResultPath in Catch to preserve the original input alongside the error. Without it, the error replaces your entire state input.

Pattern: Saga (Compensating Transactions)

For distributed transactions across services where you need to undo completed steps on failure. Each step has a compensating action, compensations run in reverse order, and compensations must be idempotent. See references/patterns.md for the full ASL example with compensating transaction flow.

Pattern: Human Approval (Callback)

Use .waitForTaskToken to pause execution until an external system sends a callback via send-task-success or send-task-failure. Always set TimeoutSeconds on callback tasks. Without it, the execution waits forever (up to 1 year for Standard). See references/patterns.md for the full ASL and CLI examples.

Pattern: Distributed Map

Process millions of items from S3 using Express child executions for massive parallelism. See references/patterns.md for the ASL example with S3 CSV reader configuration.

Common CLI Commands

# Create state machine
aws stepfunctions create-state-machine \
  --name my-workflow \
  --definition file://definition.json \
  --role-arn arn:aws:iam::123456789:role/step-functions-role

# Start execution
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
  --input '{"orderId": "12345"}'

# List executions
aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
  --status-filter FAILED

# Get execution details
aws stepfunctions describe-execution \
  --execution-arn arn:aws:states:us-east-1:123456789:execution:my-workflow:exec-123

# Get execution history (debug step-by-step)
aws stepfunctions get-execution-history \
  --execution-arn arn:aws:states:us-east-1:123456789:execution:my-workflow:exec-123 \
  --query 'events[?type==`TaskFailed` || type==`ExecutionFailed`]'

# Update state machine
aws stepfunctions update-state-machine \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
  --definition file://definition.json

# Test a state (local testing)
aws stepfunctions test-state \
  --definition '{"Type":"Task","Resource":"arn:aws:states:::dynamodb:getItem","Parameters":{"TableName":"Orders","Key":{"orderId":{"S":"123"}}}}' \
  --role-arn arn:aws:iam::123456789:role/step-functions-role \
  --input '{"orderId": "123"}'

Workflow Studio

Use Workflow Studio in the AWS Console for:

Visual design and prototyping (drag-and-drop states)
Understanding existing workflows
Quick iteration on state machine logic

Opinionated: Start in Workflow Studio for prototyping, then export to ASL (Amazon States Language) JSON and manage in version control. Never rely solely on the console for production workflows.

Input/Output Processing

Data flows through each state as: InputPath -> Parameters -> Task -> ResultSelector -> ResultPath -> OutputPath

Opinionated: Use ResultPath generously to accumulate data through states. Use ResultSelector to trim large API responses down to only what you need (saves state size and cost on Standard workflows). See references/integrations.md for detailed examples of each processing stage.

Anti-Patterns

Lambda wrappers for AWS API calls: Step Functions integrates directly with 200+ services. Don't write a Lambda just to call DynamoDB PutItem or SQS SendMessage.
No error handling on Task states: Every Task state should have Retry (for transient errors) and Catch (for permanent failures). No exceptions.
Ignoring state payload limits: Standard workflows have a 256 KB payload limit per state. Store large data in S3 and pass references.
Using Standard for high-volume short tasks: If you're running >100K executions/day with <5 min duration, Express workflows are dramatically cheaper.
Missing TimeoutSeconds on callback tasks: Without a timeout, .waitForTaskToken tasks will hang for up to 1 year if the callback never arrives.
Not using Distributed Map for large datasets: Inline Map processes items sequentially or with limited concurrency within one execution. Distributed Map scales to millions of items.
Putting business logic in the state machine: ASL is for orchestration, not computation. Complex data transforms and business rules belong in Lambda functions.
Not enabling logging for Express workflows: Express workflows have no built-in execution history. You MUST configure CloudWatch Logs or you'll have zero visibility.
Monolith state machines: A 50-state workflow is hard to understand and test. Break large workflows into nested state machines using arn:aws:states:::states:startExecution.sync:2.
Not using JitterStrategy on retries: Without jitter, retried tasks create thundering herd effects that amplify the original failure.

Cost Optimization

Standard: $0.025 per 1,000 state transitions. Minimize states. Use direct integrations to avoid Lambda invocation costs on top of transition costs.
Express: Priced by number of requests and duration. Cheaper for high-volume, short workflows.
Pass states are not free in Standard (they count as transitions). Eliminate unnecessary Pass states.
Combine simple sequential tasks where possible to reduce transition count.
Use ResultSelector to trim response payloads -- smaller payloads mean faster processing.

Reference Files

references/patterns.md -- Saga, callback, and Distributed Map patterns with full ASL examples
references/integrations.md -- Direct service integration examples (DynamoDB, SQS, SNS, EventBridge, ECS, Bedrock), state type ASL, and input/output processing pipeline details

Related Skills

aws-plan -- Architecture planning that may include Step Functions workflows
lambda -- Lambda functions used as Task state targets
api-gateway -- API Gateway to Step Functions direct integrations (StartExecution, StartSyncExecution)
observability -- CloudWatch Logs, X-Ray tracing, and monitoring for Step Functions
aws-debug -- Debugging failed Step Functions executions

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。

🔄step-functions

説明

ユースケース

本文（日本語訳）

判断フレームワーク: Standard vs Express

ステートタイプ

Task ステート（処理を実行する）

ダイレクトサービスインテグレーション（Lambdaラッパーより優先する）

その他のステートタイプ

エラーハンドリング: Retry と Catch

Retry戦略

Catchとエラーリカバリ

パターン: Saga（補償トランザクション）

パターン: 人間による承認（Callback）

パターン: Distributed Map

よく使うCLIコマンド

Workflow Studio

入出力処理

アンチパターン

コスト最適化

参照ファイル

関連スキル

Decision Framework: Standard vs Express

State Types

Task State (does work)

Direct Service Integrations (prefer over Lambda wrappers)

Other State Types

Error Handling: Retry and Catch

Retry Strategy

Catch and Error Recovery

Pattern: Saga (Compensating Transactions)

Pattern: Human Approval (Callback)

Pattern: Distributed Map

Common CLI Commands

Workflow Studio

Input/Output Processing

Anti-Patterns

Cost Optimization

Reference Files

Related Skills