スキルOfficialdevelopment

🧪output-eval-dataset-design

プラグイン: outputai
ソース: GitHub で見る ↗

説明

次元ベースのバリエーションを活用して、多様な評価データセットを設計します。次のような場合に使用: - 評価データセットをゼロから構築したいとき - 実トレースが少ないとき - 既存のデータセットがエッジケースを網羅できていないとき

原文を表示

Design diverse eval datasets using dimension-based variation. Use when bootstrapping eval datasets, when real traces are sparse, or when existing datasets miss edge cases.

ユースケース

✓評価データセットをゼロから構築したいとき
✓実トレースが少ないとき
✓既存データセットがエッジケースを網羅できていないとき

本文（日本語訳）

Evalデータセットの設計

概要

多様なデータセットは、より多くの失敗を検出できます。このスキルでは、ディメンションベースのデータセット設計を学びます。ワークフローの失敗が発生しやすい領域を狙って、複数の軸に沿って入力を体系的に変化させる手法です。成果物は、output workflow test で使用できる状態のYAMLデータセットファイル群です。

次のような場合に使用

新しいワークフロー向けにevalデータセットをゼロから構築するとき
既存のデータセットが正常系（happy path）しかカバーしていないとき
実際の本番トレースが少ない（50件未満）とき
エラー分析から得た特定の失敗仮説をストレステストしたいとき

次のような場合には使用しない

代表的な実トレースがすでに100件以上ある場合 — 合成データを生成するのではなく、層化サンプリングを使用してください
まだエラー分析を行っていない場合 — 先にそちらを実施してください（output-eval-error-analysis）。ディメンションが推測ではなく実際の失敗モードを対象にするためです

ステップ1: ディメンションの定義

想定される失敗モードを狙った、3軸以上の入力変動軸を特定します。各ディメンションは、出力品質に影響すると予想される入力の一側面を変化させるものにしてください。

エラー分析からディメンションを導出する

失敗カテゴリーを、それを引き起こす入力プロパティにマッピングします:

失敗カテゴリー	引き起こす入力プロパティ	ディメンション
トピックの逸脱	曖昧または広範なトピック	トピック具体性: 具体的 / 広範 / 曖昧
トーンの不一致	相反するトーン指示	トーン難易度: 単純 / 微妙 / 矛盾
出力が短すぎる	短い・漠然とした入力	入力詳細度: 最小限 / 中程度 / 詳細
セクションの欠落	明示的な要件が多い	要件数: 0 / 1〜2 / 5以上
URLの幻覚	実在エンティティを含む技術的トピック	エンティティ密度: なし / 少ない / 多い

ブログ生成ワークフローのディメンション例

ディメンション	値	理由
トピックの複雑さ	シンプル、技術的、曖昧	技術的・曖昧なトピックで幻覚や逸脱が増える
トーン指定	なし、フォーマル、カジュアル、矛盾	明示的なトーン指定でトーン一致の失敗が露わになる
長さ制約	なし、短い（100語）、長い（1000語）	極端な長さ制約で切り捨てやパディングが発生する
必須セクション	なし、1セクション、3セクション以上	複数の必須セクションで構造的な準拠がストレステストされる

ディメンション数は3〜5個を目標にしてください。 5個を超えると、組み合わせの爆発で管理が困難になります。

ステップ2: タプルの作成

ディメンション値の組み合わせを約20個作成します。極端な値と、失敗を最も引き起こしやすい組み合わせを優先してカバーしてください。

タプル選定の方針

各ディメンション値を最低2回カバーする — 見落としをなくすため
難しい値同士を組み合わせる — 「曖昧なトピック + 矛盾するトーン + 3セクション以上」が失敗の集中する箇所
簡単な組み合わせもいくつか含める — 通常条件でワークフローが機能することを確認するため
ほぼ重複するものは避ける — 各タプルは異なるシナリオをテストすること

タプル例

#	トピックの複雑さ	トーン	長さ	必須セクション
1	シンプル	なし	なし	なし
2	シンプル	フォーマル	短い	1セクション
3	技術的	フォーマル	長い	3セクション以上
4	技術的	カジュアル	短い	なし
5	曖昧	フォーマル	長い	1セクション
6	曖昧	矛盾	なし	3セクション以上
7	シンプル	矛盾	長い	なし
8	技術的	なし	なし	3セクション以上
9	曖昧	カジュアル	短い	1セクション
10	シンプル	カジュアル	長い	3セクション以上
11	技術的	フォーマル	短い	1セクション
12	曖昧	なし	長い	なし
13	シンプル	フォーマル	なし	3セクション以上
14	技術的	矛盾	長い	1セクション
15	曖昧	フォーマル	短い	3セクション以上
16	シンプル	なし	短い	1セクション
17	技術的	カジュアル	長い	3セクション以上
18	曖昧	矛盾	短い	なし
19	技術的	フォーマル	なし	なし
20	曖昧	カジュアル	長い	3セクション以上

各タプルを見直し、「これはユーザーが実際に作成しうる現実的なシナリオか？」と自問してください。そうでないものは除外します。

ステップ3: タプルをワークフロー入力に変換する

各タプルを、types.ts に定義されたワークフローの inputSchema に合致するJSONオブジェクトに変換します。

まずスキーマを確認する

# 入力スキーマを確認
cat src/workflows/<workflow_name>/types.ts

手動変換（小規模データセット）

各タプルに対応するJSON入力を記述します:

タプル3: 技術的 + フォーマル + 長い + 3セクション以上

{
  "topic": "Quantum error correction techniques in superconducting qubit architectures",
  "tone": "formal",
  "min_length": 1000,
  "required_sections": ["Introduction", "Technical Background", "Current Approaches", "Challenges", "Conclusion"]
}

タプル6: 曖昧 + 矛盾 + なし + 3セクション以上

{
  "topic": "Things that matter",
  "tone": "Write in a formal academic style but keep it super casual and fun",
  "required_sections": ["Overview", "Deep Dive", "Takeaways"]
}

リアルで自然な入力を使用してください。「Test topic 1」や「Lorem ipsum」のようなテスト用データは避けてください。

LLMを活用した変換（大規模データセット）

タプルが20件以上ある場合は、LLMを使ってタプルをリアルな入力に一括変換します。タプルの値と inputSchema を渡し、自然なJSON入力を生成するプロンプトを作成してください。使用前に、生成されたすべての入力の現実性を必ず確認してください。

ステップ4: データセットファイルの生成

各入力をワークフローで実行して、実際の出力を取得します:

# データセットを1件ずつ生成する
npx output workflow dataset generate blog_generator \
  --input '{"topic": "Quantum error correction", "tone": "formal", "min_length": 1000}' \
  --name technical_formal_long

npx output workflow dataset generate blog_generator \
  --input '{"topic": "Things that matter", "tone": "Write formally but keep it casual"}' \
  --name ambiguous_contradictory

各コマンドは tests/datasets/ 配下にYAMLファイルを作成し、input と last_output が格納されます。

命名規則

主要なディメンションがわかる名前を使用してください:

tests/datasets/
├── simple_no_constraints.yml
├── simple_formal_short.yml
├── technical_formal_long_sections.yml
├── technical_casual_short.yml
├── ambiguous_formal_long.yml
├── ambiguous_contradictory_sections.yml
└── ...

一括生成

データセット数が多い場合は、シェルスクリプトを作成します:

#!/bin/bash
# generate_datasets.sh

inputs=(
  '{"topic": "Solar panels", "tone": "formal"}|simple_formal'
  '{"topic": "Quantum error correction in superconducting qubits", "tone": "casual", "min_length": 1000}|technical_casual_long'
  '{"topic": "Things that matter", "required_sections": ["Overview", "Dive", "Takeaways"]}|ambiguous_sections'
)

for entry in "${inputs[@]}"; do
  IFS='|' read -r input name <<< "$entry"
  echo "Generating: $name"
  npx output workflow dataset generate blog_generator --input "$input" --name "$name"
done

ステップ5: 正解ラベルの追加

生成後、各データセットのYAMLを編集して ground_truth ラベルを追加します。 last_output を確認し、各evaluatorごとに判定（verdict）を割り当ててください。

name: ambiguous_contradictory_sections
input:
  topic: "Things that matter"
  tone: "Write in a formal academic style but keep it super casual and fun"
  required_sections: ["Overview", "Deep Dive", "Takeaways"]
last_output:
  output:
    title: "Things That Matter: A Casual Academic Exploration"
    blog_post: "So here's the thing about stuff that matters..."
  executionTimeMs: 4200
  date: '2026-03-25T00:00:00.000Z'
ground_truth:
  human_verdict: fail
  notes: "Contradictory tone caused drift; missing 'Deep Dive' section"
  evals:
    check_tone:
      expected_tone: "formal academic style"
      verdict: fail
    check_topic:
      required_topic: "Things that matter"
      verdict: partial
    check_sections:
      required_sections: ["Overview", "Deep Dive", "Takeaways"]
      verdict: fail
    check_length:
      min_length: 100
      verdict: pass

ラベリングのヒント

すべてのデータセットに対して グローバルな human_verdict（pass/fail）をラベル付けする
失敗率の上位3つのevaluatorから優先してラベル付けする
判断が難しいケースは、決断してコミットする — 曖昧なラベルは残さない
将来の参照のために、判断の根拠を notes に記録する

ステップ6: 実データによる補完

本番トレースが利用可能な場合は、それを使ってカバレッジのギャップを埋めます。

# 本番トレースをダウンロード
npx output workflow dataset generate blog_generator --download --limit 30

層化サンプリング

ダウンロード後、不足しているディメンションを確認します:

ダウンロードしたデータセットを確認する
各データセットにおおよそのディメンション値をタグ付けする
ギャップを特定する（例: 曖昧なトピックがない、矛盾するトーンがないなど）
不足している組み合わせに特化した合成データセットを生成する

実データと合成データの混在

優れたevalデータセットは両者を組み合わせます:

実トレース（60〜70%）— 実際のユーザー行動や予期しなかったエッジケースを捉える
合成トレース（30〜40%）— カバレッジのギャップを埋め、特定の失敗領域をストレステストする

ステップ7: 品質フィルタリング

evalで使用する前に、すべてのデータセットを確認してください。

以下に該当するデータセットは除外する

入力が非現実的またはテスト用データっぽく見える
別のデータセットとほぼ同一の入力（冗長）
ワークフローが完全にエラーになった（品質の問題ではなく、テストインフラの問題）
ground_truthラベルが曖昧または不確定

目標データセット数

ワークフローの複雑さ	最小データセット数	目標
シンプル（1〜2ステップ、入力の幅が狭い）	10

原文（English）を表示

Designing Eval Datasets

Overview

Diverse datasets catch more failures. This skill teaches dimension-based dataset design — systematically varying inputs along axes that target failure-prone regions of your workflow. The output is a set of YAML dataset files ready for output workflow test.

When to Use

Bootstrapping an eval dataset for a new workflow
Existing datasets only cover happy paths
Real production traces are sparse (fewer than 50)
Stress-testing specific failure hypotheses from error analysis

When NOT to Use

You already have 100+ representative real traces — use stratified sampling from those instead of generating synthetic data
You haven't done error analysis yet — do that first (output-eval-error-analysis) so your dimensions target real failure modes, not guesses

Step 1: Define Dimensions

Identify 3+ axes of input variation that target anticipated failure modes. Each dimension should vary one aspect of the input that you expect to influence output quality.

Deriving dimensions from error analysis

Map failure categories to input properties that trigger them:

Failure Category	Triggering Input Property	Dimension
Off-topic drift	Ambiguous or broad topics	Topic specificity: specific / broad / ambiguous
Tone mismatch	Conflicting tone signals	Tone difficulty: simple / nuanced / contradictory
Too short	Short or vague input	Input detail: minimal / moderate / comprehensive
Missing sections	Many explicit requirements	Requirement count: 0 / 1-2 / 5+
Hallucinated URLs	Technical topics with real entities	Entity density: none / few / many

Example dimensions for a blog generation workflow

Dimension	Values	Why
Topic complexity	simple, technical, ambiguous	Technical and ambiguous topics trigger more hallucination and drift
Tone request	none, formal, casual, contradictory	Explicit tone requests reveal tone-matching failures
Length constraint	none, short (100w), long (1000w)	Extreme length constraints trigger truncation and padding
Required sections	none, 1 section, 3+ sections	Multiple required sections stress structural compliance

Aim for 3-5 dimensions. More than 5 creates an unmanageable combinatorial space.

Step 2: Draft Tuples

Create ~20 combinations of dimension values. Cover the extremes and the combinations most likely to cause failures.

Tuple selection strategy

Cover every dimension value at least twice — ensures no blind spots
Pair difficult values together — "ambiguous topic + contradictory tone + 3+ required sections" is where failures cluster
Include a few easy combinations — confirms the workflow works under normal conditions
Avoid near-duplicates — each tuple should test a distinct scenario

Example tuples

#	Topic Complexity	Tone	Length	Required Sections
1	simple	none	none	none
2	simple	formal	short	1 section
3	technical	formal	long	3+ sections
4	technical	casual	short	none
5	ambiguous	formal	long	1 section
6	ambiguous	contradictory	none	3+ sections
7	simple	contradictory	long	none
8	technical	none	none	3+ sections
9	ambiguous	casual	short	1 section
10	simple	casual	long	3+ sections
11	technical	formal	short	1 section
12	ambiguous	none	long	none
13	simple	formal	none	3+ sections
14	technical	contradictory	long	1 section
15	ambiguous	formal	short	3+ sections
16	simple	none	short	1 section
17	technical	casual	long	3+ sections
18	ambiguous	contradictory	short	none
19	technical	formal	none	none
20	ambiguous	casual	long	3+ sections

Review each tuple and ask: "Is this a realistic scenario a user might create?" Discard any that aren't.

Step 3: Convert Tuples to Workflow Inputs

Transform each tuple into a JSON object matching the workflow's inputSchema from types.ts.

Read the schema first

# Find the input schema
cat src/workflows/<workflow_name>/types.ts

Manual conversion (small datasets)

For each tuple, write the corresponding JSON input:

Tuple 3: technical + formal + long + 3+ sections

{
  "topic": "Quantum error correction techniques in superconducting qubit architectures",
  "tone": "formal",
  "min_length": 1000,
  "required_sections": ["Introduction", "Technical Background", "Current Approaches", "Challenges", "Conclusion"]
}

Tuple 6: ambiguous + contradictory + none + 3+ sections

{
  "topic": "Things that matter",
  "tone": "Write in a formal academic style but keep it super casual and fun",
  "required_sections": ["Overview", "Deep Dive", "Takeaways"]
}

Use realistic, natural-sounding inputs. Avoid test-looking data like "Test topic 1" or "Lorem ipsum."

LLM-assisted conversion (larger datasets)

For 20+ tuples, use an LLM to batch-convert tuples into realistic inputs. Create a prompt that takes the tuple values and the inputSchema, then generates a natural JSON input. Review every generated input for realism before using it.

Step 4: Generate Dataset Files

Run each input through the workflow to capture real output:

# Generate datasets one at a time
npx output workflow dataset generate blog_generator \
  --input '{"topic": "Quantum error correction", "tone": "formal", "min_length": 1000}' \
  --name technical_formal_long

npx output workflow dataset generate blog_generator \
  --input '{"topic": "Things that matter", "tone": "Write formally but keep it casual"}' \
  --name ambiguous_contradictory

Each command creates a YAML file in tests/datasets/ with input and last_output populated.

Naming convention

Use names that encode the key dimensions:

tests/datasets/
├── simple_no_constraints.yml
├── simple_formal_short.yml
├── technical_formal_long_sections.yml
├── technical_casual_short.yml
├── ambiguous_formal_long.yml
├── ambiguous_contradictory_sections.yml
└── ...

Batch generation

For many datasets, create a shell script:

#!/bin/bash
# generate_datasets.sh

inputs=(
  '{"topic": "Solar panels", "tone": "formal"}|simple_formal'
  '{"topic": "Quantum error correction in superconducting qubits", "tone": "casual", "min_length": 1000}|technical_casual_long'
  '{"topic": "Things that matter", "required_sections": ["Overview", "Dive", "Takeaways"]}|ambiguous_sections'
)

for entry in "${inputs[@]}"; do
  IFS='|' read -r input name <<< "$entry"
  echo "Generating: $name"
  npx output workflow dataset generate blog_generator --input "$input" --name "$name"
done

Step 5: Add Ground Truth

After generation, edit each dataset YAML to add ground_truth labels. Review the last_output and assign verdicts per evaluator.

name: ambiguous_contradictory_sections
input:
  topic: "Things that matter"
  tone: "Write in a formal academic style but keep it super casual and fun"
  required_sections: ["Overview", "Deep Dive", "Takeaways"]
last_output:
  output:
    title: "Things That Matter: A Casual Academic Exploration"
    blog_post: "So here's the thing about stuff that matters..."
  executionTimeMs: 4200
  date: '2026-03-25T00:00:00.000Z'
ground_truth:
  human_verdict: fail
  notes: "Contradictory tone caused drift; missing 'Deep Dive' section"
  evals:
    check_tone:
      expected_tone: "formal academic style"
      verdict: fail
    check_topic:
      required_topic: "Things that matter"
      verdict: partial
    check_sections:
      required_sections: ["Overview", "Deep Dive", "Takeaways"]
      verdict: fail
    check_length:
      min_length: 100
      verdict: pass

Labeling tips

Label the global human_verdict for every dataset (pass/fail)
Label the top 3 evaluators by failure rate first
For borderline cases, decide and commit — don't leave ambiguous labels
Record notes explaining your reasoning for future reference

Step 6: Supplement with Real Data

If you have production traces available, use them to fill coverage gaps.

# Download production traces
npx output workflow dataset generate blog_generator --download --limit 30

Stratified sampling

After downloading, check which dimensions are underrepresented:

Review the downloaded datasets
Tag each with its approximate dimension values
Identify gaps (e.g., no ambiguous topics, no contradictory tones)
Generate synthetic datasets specifically for the missing combinations

Mixing real and synthetic

A good eval dataset combines both:

Real traces (60-70%) — capture authentic user behavior and edge cases you didn't anticipate
Synthetic traces (30-40%) — fill coverage gaps and stress-test specific failure regions

Step 7: Quality Filter

Review every dataset before using it in evals.

Discard datasets where

The input is unrealistic or test-looking
The input is nearly identical to another dataset (redundant)
The workflow errored out entirely (test infrastructure issues, not quality issues)
The ground truth labels are ambiguous or contested

Target dataset count

Workflow Complexity	Minimum Datasets	Target
Simple (1-2 steps, narrow input)	10	20
Medium (3-5 steps, moderate variation)	20	40
Complex (5+ steps, wide input space)	40	80+

More datasets improve eval reliability. Aim for at least 5 datasets per failure category to get meaningful failure rates.

Verification

# List all datasets
npx output workflow dataset list blog_generator

# Run evals on all datasets with cached output
npx output workflow test blog_generator --cached

# Run evals on a subset
npx output workflow test blog_generator --cached --dataset technical_formal_long,ambiguous_contradictory

Check that:

[ ] All datasets load without errors
[ ] Dataset names are descriptive and encode key dimensions
[ ] Every dataset has a human_verdict in ground_truth
[ ] The top 3 evaluators have per-evaluator ground truth in most datasets
[ ] Datasets cover all dimension values from your tuple table
[ ] No two datasets are near-duplicates
[ ] Mix of real and synthetic traces (if production data is available)

Related Skills

output-eval-error-analysis — Identify failure modes that inform dimension selection
output-dev-eval-testing — Dataset YAML format, CLI commands, evalWorkflow() setup
output-dev-scenario-file — Scenario JSON files as seed inputs for dataset generation
output-eval-validate-judge — Split datasets into train/dev/test for judge validation
output-eval-judge-prompt — Design LLM judge prompts that consume these datasets

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。