スキルOfficialdevelopment

🔍output-eval-error-analysis

プラグイン: outputai
ソース: GitHub で見る ↗

説明

ワークフローのトレースを体系的にレビューし、evaluatorを構築する前に失敗パターンを特定します。次のような場合に使用: - evalプロジェクトを開始するとき - パイプラインに大きな変更を加えた後 - 本番環境での品質が低下したとき

原文を表示

Systematically review workflow traces to identify failure modes before building evaluators. Use when starting an eval project, after significant pipeline changes, or when production quality drops.

ユースケース

✓evalプロジェクトを開始するとき
✓パイプラインに大きな変更を加えた後
✓本番環境での品質が低下したとき

本文（日本語訳）

ワークフロー評価のためのエラー分析

概要

エバリュエーターを作成する前に、実際のワークフロートレースを確認し、ワークフローがどのように失敗するかを分類します。エラー分析なしに構築されたエバリュエーターは、ワークフローが実際に壊れる具体的な原因ではなく、「これは良いか？」といった汎用的な品質を対象にしてしまいます。このスキルでは、その分析プロセスを順を追って説明します。

次のような場合に使用

既存ワークフローに対して新しい評価プロジェクトを開始するとき
本番環境の品質が低下し、その原因を把握したいとき
プロンプト・モデル・パイプラインに大きな変更を加えた後
あるワークフローに対して初めてのエバリュエーターを構築する前

ステップ 1: トレースを収集する

ワークフローの実行記録を 50〜100 件集めます。トレース数が多いほど、失敗カテゴリの信頼性が高まります。

最近の実行記録から取得する

直近のワークフロー実行を一覧表示し、トレースを取得します:

# ワークフローの最近の実行を一覧表示する
npx output workflow runs list <workflowName>

# 特定のトレースを JSON 形式で取得する
npx output workflow debug <workflowId> --json

本番環境からまとめてダウンロードする

本番環境のトレースをデータセット YAML ファイルに直接ダウンロードします:

# 最新のトレースを最大 20 件ダウンロードしてデータセットファイルとして保存する
npx output workflow dataset generate <workflowName> --download --limit 20

このコマンドにより、実際の実行記録から取得した input および last_output フィールドが入力済みの YAML ファイルが tests/datasets/ 配下に作成されます。

シナリオ駆動で生成する

本番トレースが少ない場合は、シナリオのインプットからトレースを生成します:

# シナリオファイルからデータセットを生成する
npx output workflow dataset generate <workflowName> basic --name basic_trace

# インライン JSON から生成する
npx output workflow dataset generate <workflowName> --input '{"topic": "AI safety"}' --name ai_safety_trace

50 件以上のトレースが得られるまで十分なインプットを実行してください。量よりも多様性を優先し、重要と思われる軸に沿ってインプットを変化させましょう。

ステップ 2: トレースを 1 件ずつ確認する

各トレースを 1 件ずつ確認します。それぞれのトレースについて、以下の項目を記録してください:

フィールド	記入内容
トレース ID	ワークフロー実行の ID
判定	Pass または Fail（この段階では二択のみ — "partial" は不可）
根本原因	Fail の場合: 具体的に何がどのように失敗したか
備考	驚いた点や記憶しておく価値があること

レビュー用テンプレート

確認内容を記録するファイルを作成します。シンプルなマークダウン表で十分です:

# エラー分析: <workflow_name>
# 日付: YYYY-MM-DD
# 確認済みトレース: 0 / 50

| # | トレース ID | 判定 | 根本原因 | 備考 |
|---|------------|------|----------|------|
| 1 | abc-123  | Fail | 存在しない URL をハルシネーションした | 技術的なトピックで頻発 |
| 2 | def-456  | Pass | — | 出力はクリーン |
| 3 | ghi-789  | Fail | "フォーマルなトーン" の要件を無視した | インプットに矛盾するシグナルがあった |

各トレースで確認すべき点

JSON トレースを開き、以下を確認します:

最終出力 — ユーザーの意図を満たしているか？内容は正確か？
ステップごとのデータフロー — 各ステップが正しいインプットを受け取り、妥当なアウトプットを生成したか？
LLM のレスポンス — モデルは指示に従ったか？ハルシネーションは発生したか？
エラー状態 — いずれかのステップが失敗・リトライ・予期しないエラーを起こしたか？

重要ルール: まず読む、分類は後

失敗カテゴリに名前をつける前に、少なくとも 30 件のトレースを確認してください。早まった分類は、実在しないパターンを見出したり、実在するパターンを見落としたりする原因になります。この段階では、観察したことをそのまま記録するだけにとどめましょう。

ステップ 3: 失敗カテゴリにグルーピングする

30 件以上確認するとパターンが見えてきます。表面的な症状ではなく、根本原因に基づいて失敗を 5〜10 のカテゴリに分類してください。

良いカテゴリ（根本原因ベース）

「ハルシネーション URL」— モデルが存在しないリンクを生成する
「トーン不一致」— 出力のトーンが要求されたペルソナと合っていない
「必須セクションの欠落」— インプットで明示的に要求されたセクションが出力から省略される
「事実誤り」— 検証可能な誤った主張が出力に含まれる
「プロンプトインジェクションの漏洩」— ユーザーインプットがシステムプロンプトを操作する

悪いカテゴリ（表面的な症状ベース）

「悪い出力」— 曖昧すぎて対処できない
「LLM エラー」— 具体的な失敗を特定していない
「品質の問題」— 何でも当てはまってしまう

分割・統合の指針

3 件未満のカテゴリは、より広いカテゴリに統合するか、「まれなケース」として記録する
15 件以上あり、明確なサブパターンを含むカテゴリは分割する
カテゴリは相互に排他的であること — 各失敗はちょうど 1 つのカテゴリに属する

カテゴリ分類の例

ブログ生成ワークフローで 60 件のトレースを確認した結果:

カテゴリ	件数	割合	例
ハルシネーション URL	8	13%	存在しないページへのリンクを生成
トーン不一致	6	10%	フォーマルを要求したのにカジュアルなトーンになった
トピックの逸脱	5	8%	「AI」についてのブログが無関係な ML の歴史に流れた
セクションの欠落	4	7%	明示的に要求した「まとめ」をスキップした
文字数不足	3	5%	500 字以上を要求したのに 200 字未満
失敗合計	26	43%
Pass	34	57%

ステップ 4: データセットにラベルを付ける

エバリュエーターが検証に使用できるよう、データセット YAML ファイルに ground_truth ラベルを追加します。各失敗カテゴリは、今後作成するエバリュエーターの名前に対応します。

YAML 構造

name: ai_safety_trace
input:
  topic: "AI safety"
  tone: "formal"
  min_length: 500
last_output:
  output:
    title: "Understanding AI Safety"
    blog_post: "AI safety is super important and stuff..."
  executionTimeMs: 3200
  date: '2026-03-25T00:00:00.000Z'
ground_truth:
  # グローバルな ground truth（すべてのエバリュエーターから参照可能）
  human_verdict: fail
  failure_categories:
    - tone_mismatch
  notes: "フォーマルなトーンを要求したにもかかわらずカジュアルな表現が使われた"
  # エバリュエーターごとの ground truth
  evals:
    check_tone:
      expected_tone: formal
      verdict: fail
    check_length:
      min_length: 500
      verdict: pass
    check_hallucinated_urls:
      verdict: pass

ground_truth.evals.<evaluator_name> フィールドは、verify() で使用するエバリュエーター名に直接対応します。各エバリュエーターは、context.ground_truth を通じてトップレベルの ground_truth とマージされた形で、自身の ground_truth を受け取ります。

効率的なラベリング

すべてのデータセットにすべてのカテゴリのラベルを付ける必要はありません。以下の優先順位で進めましょう:

すべてのデータセットに、グローバルな human_verdict（pass/fail）をラベリングする
失敗率の高い上位 3 カテゴリのデータセットにラベリングする
各エバリュエーターを構築しながら、エバリュエーターごとのラベルを追加する

ステップ 5: 修正すべき点と評価すべき点を決定する

すべての失敗カテゴリにエバリュエーターが必要なわけではありません。以下のデシジョンツリーで判断してください:

この失敗は、修正可能なプロンプト／ツールのギャップが原因か？
├─ YES → まずプロンプトを修正するか、不足しているツールを追加する
│        修正後にエラー分析を再実施する
└─ NO  → この失敗は再発する可能性があり、継続的なモニタリングが必要か？
         ├─ YES → エバリュエーターを構築する
         │        決定論的なコードでチェックできるか？
         │        ├─ YES → Verdict.* ヘルパーを使用する
         │        │        （contains, matches, gte など）
         │        └─ NO  → LLM ジャッジプロンプトを使って
         │                  judgeVerdict() を使用する
         └─ NO  → ドキュメントに記録して先に進む（まれなエッジケース）

失敗率の高い順に優先する

最も失敗率が高いカテゴリから順にエバリュエーターを構築してください。 13% の失敗は、2% の失敗より重要です。

まずコードベースのチェックを検討する

主観的に思える失敗にも、客観的な代替手段がある場合が多くあります:

失敗	一見…	実際には…でチェックできる
「文字数不足」	主観的	`Verdict.gte(output.length, threshold)`
「セクションの欠落」	LLM が必要	`Verdict.contains(output, "## Conclusion")`
「ハルシネーション URL」	LLM が必要	正規表現で URL を抽出し、HTTP HEAD で検証
「フォーマット誤り」	LLM が必要	`Verdict.matches(output, expectedPattern)`

LLM ジャッジは、トーン・関連性・忠実性・一貫性など、真に主観的な基準のためにとっておきましょう。

ステップ 6: カテゴリをエバリュエーターにマッピングする

失敗カテゴリと計画中のエバリュエーターを対応付けるマッピングドキュメントを作成します:

# エバリュエーター計画: blog_generator

| カテゴリ | 割合 | エバリュエータータイプ | エバリュエーター名 | 重要度 |
|----------|------|----------------------|-------------------|--------|
| ハルシネーション URL | 13% | コード（URL 抽出 + HTTP チェック） | check_urls | required |
| トーン不一致 | 10% | LLM ジャッジ | check_tone | required |
| トピックの逸脱 | 8% | LLM ジャッジ | check_topic | required |
| セクションの欠落 | 7% | コード（文字列の contains） | check_sections | required |
| 文字数不足 | 5% | コード（文字数チェック） | check_length | informational |

これが実装のロードマップになります。判定をブロックすべき失敗カテゴリには criticality: 'required' を、あると望ましいチェックには 'informational' を使用してください。

次のステップ

原文（English）を表示

Error Analysis for Workflow Evaluation

Overview

Review real workflow traces and categorize how your workflow fails before writing any evaluators. Evaluators built without error analysis target generic qualities ("is this good?") instead of the specific ways your workflow actually breaks. This skill walks you through the process.

When to Use

Starting a new eval project for an existing workflow
Production quality has dropped and you need to understand why
After significant prompt, model, or pipeline changes
Before building your first evaluator for a workflow

Step 1: Collect Traces

Gather 50-100 representative workflow executions. More traces = more reliable failure categories.

From recent runs

List recent workflow executions and pull their traces:

# List recent runs for a workflow
npx output workflow runs list <workflowName>

# Pull a specific trace as JSON
npx output workflow debug <workflowId> --json

From production (bulk download)

Download production traces directly into dataset YAML files:

# Download up to 20 recent traces as dataset files
npx output workflow dataset generate <workflowName> --download --limit 20

This creates YAML files in tests/datasets/ with the input and last_output fields populated from real executions.

From scenario-driven generation

If production traces are sparse, generate traces from scenario inputs:

# Generate a dataset from a scenario file
npx output workflow dataset generate <workflowName> basic --name basic_trace

# Generate from inline JSON
npx output workflow dataset generate <workflowName> --input '{"topic": "AI safety"}' --name ai_safety_trace

Run enough inputs to get 50+ traces. Prioritize diversity over volume — vary inputs across the dimensions you expect to matter.

Step 2: Review Traces Individually

Review each trace one at a time. For each trace, record:

Field	What to write
Trace ID	The workflow execution ID
Verdict	Pass or Fail (binary — no "partial" at this stage)
Root cause	If Fail: what specifically went wrong and why
Notes	Anything surprising or worth remembering

Review template

Create a file to track your reviews. A simple markdown table works:

# Error Analysis: <workflow_name>
# Date: YYYY-MM-DD
# Traces reviewed: 0 / 50

| # | Trace ID | Verdict | Root Cause | Notes |
|---|----------|---------|------------|-------|
| 1 | abc-123  | Fail    | Hallucinated a URL that doesn't exist | Common with technical topics |
| 2 | def-456  | Pass    | — | Clean output |
| 3 | ghi-789  | Fail    | Ignored the "formal tone" requirement | Input had conflicting signals |

What to look for in each trace

Open the JSON trace and examine:

Final output — Does it meet the user's intent? Is it correct?
Step-by-step data flow — Did each step receive the right input and produce reasonable output?
LLM responses — Did the model follow instructions? Did it hallucinate?
Error states — Did any step fail, retry, or produce unexpected errors?

Critical rule: read first, categorize second

Review at least 30 traces before naming any failure categories. Premature categorization causes you to see patterns that aren't there and miss patterns that are. Just record what you observe.

Step 3: Group Into Failure Categories

After reviewing 30+ traces, patterns will emerge. Group your failures into 5-10 categories based on root cause, not surface symptoms.

Good categories (root cause)

"Hallucinated URLs" — model invents links that don't exist
"Tone mismatch" — output tone doesn't match the requested persona
"Missing required section" — output omits a section the input explicitly requested
"Factual error" — output contains verifiably wrong claims
"Prompt injection leak" — user input manipulates the system prompt

Bad categories (surface symptoms)

"Bad output" — too vague, not actionable
"LLM error" — doesn't identify the specific failure
"Quality issue" — could mean anything

Splitting and merging

If a category has fewer than 3 examples, merge it into a broader category or note it as rare
If a category has 15+ examples and contains distinct sub-patterns, split it
Categories should be mutually exclusive — each failure belongs to exactly one category

Example categorization

For a blog generation workflow after reviewing 60 traces:

Category	Count	Rate	Example
Hallucinated URLs	8	13%	Invented links to non-existent pages
Tone mismatch	6	10%	Casual tone when formal was requested
Off-topic drift	5	8%	Blog about "AI" drifted to unrelated ML history
Missing sections	4	7%	Skipped "conclusion" when explicitly requested
Too short	3	5%	Under 200 words when 500+ requested
Total failures	26	43%
Passes	34	57%

Step 4: Label Datasets

Add ground_truth labels to your dataset YAML files so evaluators can validate against them. Each failure category maps to a future evaluator name.

YAML structure

name: ai_safety_trace
input:
  topic: "AI safety"
  tone: "formal"
  min_length: 500
last_output:
  output:
    title: "Understanding AI Safety"
    blog_post: "AI safety is super important and stuff..."
  executionTimeMs: 3200
  date: '2026-03-25T00:00:00.000Z'
ground_truth:
  # Global ground truth (available to all evaluators)
  human_verdict: fail
  failure_categories:
    - tone_mismatch
  notes: "Used casual language despite formal tone request"
  # Per-evaluator ground truth
  evals:
    check_tone:
      expected_tone: formal
      verdict: fail
    check_length:
      min_length: 500
      verdict: pass
    check_hallucinated_urls:
      verdict: pass

The ground_truth.evals.<evaluator_name> fields map directly to the evaluator names you'll use in verify(). Each evaluator receives its own ground truth merged with the top-level ground truth via context.ground_truth.

Labeling efficiently

You don't need to label every dataset for every category. Focus on:

Label all datasets with the global human_verdict (pass/fail)
Label datasets for the top 3 failure categories by rate
Add per-evaluator labels as you build each evaluator

Step 5: Decide What to Fix vs. Evaluate

Not every failure category needs an evaluator. Use this decision tree:

Is this failure caused by a fixable prompt/tool gap?
├─ YES → Fix the prompt or add the missing tool first
│        Re-run error analysis after the fix
└─ NO  → Will this failure recur and need ongoing monitoring?
         ├─ YES → Build an evaluator
         │        Can it be checked with deterministic code?
         │        ├─ YES → Use Verdict.* helpers (contains, matches, gte, etc.)
         │        └─ NO  → Use judgeVerdict() with an LLM judge prompt
         └─ NO  → Document it and move on (rare edge case)

Prioritize by failure rate

Build evaluators for the highest-rate failure categories first. A failure at 13% matters more than one at 2%.

Code-based checks first

Many failures that seem subjective have objective proxies:

Failure	Seems like...	But you can check with...
"Too short"	Subjective	`Verdict.gte(output.length, threshold)`
"Missing section"	Needs LLM	`Verdict.contains(output, "## Conclusion")`
"Hallucinated URLs"	Needs LLM	Extract URLs with regex, verify with HTTP HEAD
"Wrong format"	Needs LLM	`Verdict.matches(output, expectedPattern)`

Reserve LLM judges for genuinely subjective criteria: tone, relevance, faithfulness, coherence.

Step 6: Map Categories to Evaluators

Create a mapping document that connects your failure categories to planned evaluators:

# Evaluator Plan: blog_generator

| Category | Rate | Evaluator Type | Evaluator Name | Criticality |
|----------|------|----------------|----------------|-------------|
| Hallucinated URLs | 13% | Code (URL extraction + HTTP check) | check_urls | required |
| Tone mismatch | 10% | LLM judge | check_tone | required |
| Off-topic drift | 8% | LLM judge | check_topic | required |
| Missing sections | 7% | Code (string contains) | check_sections | required |
| Too short | 5% | Code (length check) | check_length | informational |

This becomes your implementation roadmap. Use criticality: 'required' for failure categories that should block a passing verdict. Use 'informational' for nice-to-have checks.

Next Steps

Build evaluators — Follow output-dev-eval-testing to implement each evaluator with verify() and wire them into evalWorkflow()
Design judge prompts — For LLM-based evaluators, follow output-eval-judge-prompt to write effective .prompt files
Expand datasets — If your traces don't cover enough failure regions, follow output-eval-dataset-design to generate diverse test cases
Re-run after changes — After fixing prompts, switching models, or modifying pipeline logic, repeat this error analysis to find new failure modes

Anti-Patterns

Building evaluators without error analysis — You'll evaluate the wrong things
Categorizing before reviewing 30+ traces — Premature categories cause confirmation bias
Surface-level categories ("bad output", "LLM error") — Split by root cause
One giant evaluator — One evaluator per failure mode, not one evaluator for everything
Skipping code-based checks — Don't use an LLM judge when Verdict.contains() works
Never re-running — Error analysis is not a one-time activity; repeat after significant changes

Related Skills

output-dev-eval-testing — Implement evaluators with verify(), Verdict, and evalWorkflow()
output-eval-judge-prompt — Design LLM judge prompts for subjective failure modes
output-eval-dataset-design — Generate diverse datasets when real traces are sparse
output-eval-validate-judge — Validate LLM judges against human labels
output-eval-audit — Audit an existing eval suite for trustworthiness
output-workflow-trace — Retrieve and analyze workflow execution traces

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。