スキルOfficialdevelopment

🔍output-eval-audit

プラグイン: outputai
ソース: GitHub で見る ↗

説明

既存のEvalスイートの信頼性を監査します。次のような場合に使用: - 既存のEvalを引き継いだとき - Evalが実際の障害を見逃している可能性があるとき - パイプラインに大きな変更を加えた後

原文を表示

Audit an existing eval suite for trustworthiness. Use when inheriting evals, suspecting evals miss real failures, or after significant pipeline changes.

ユースケース

✓既存のEvalを引き継いだとき
✓Evalが障害を見逃している可能性があるとき
✓パイプラインに大きな変更を加えた後

本文（日本語訳）

Eval スイートの監査

概要

Eval スイートを監査し、実際の障害を適切に検出できているかを確認します。このスキルは、エラー分析・evaluator 設計・judge 検証・データセットカバレッジにおけるギャップを特定する体系的な診断を提供し、各発見事項に対して具体的な改善手順を示します。

次のような場合に使用

別チームや別の開発者から eval スイートを引き継いだ場合
eval はパスしているにもかかわらず、本番環境での品質が低いと疑われる場合
モデルの切り替え、プロンプトの書き直し、またはパイプロジックの変更後
定期的なヘルスチェック（四半期ごと、または大型リリース後）

ステップ 1: 成果物の収集

監査対象ワークフローの eval インフラファイルを読み込みます。

src/workflows/<workflow_name>/
├── tests/
│   ├── datasets/           # YAML データセットファイル
│   │   ├── *.yml
│   │   └── ...
│   └── evals/
│       ├── evaluators.ts   # Evaluator 定義
│       ├── workflow.ts      # Eval ワークフロー定義
│       └── *.prompt         # Judge プロンプトファイル

存在する成果物を一覧化します。

成果物	ファイル	数
Evaluator	`tests/evals/evaluators.ts`	?
Eval ワークフロー	`tests/evals/workflow.ts`	? （`evals` 配列のエントリ数）
Judge プロンプト	`tests/evals/*.prompt`	?
データセット	`tests/datasets/*.yml`	?
`ground_truth` あり	上記のうち	?
`last_output` あり	上記のうち	?

これらのいずれかが完全に欠落している場合は、その旨を記録したうえで、末尾の「ゼロから始める場合」へスキップしてください。

ステップ 2: 診断の実施

以下の4つの領域をそれぞれ評価します。各領域に対して次のステータスを割り当てます。

Pass — 基準を満たしている
Warn — 基準を部分的に満たしているが、改善が必要
Fail — 基準を満たしておらず、重大なリスクがある

領域 1: エラー分析の根拠

問い: Evaluator は、実際のワークフロートレースで観察された障害モードから導出されているか？

確認事項:

障害カテゴリが存在しているか（ファイル・コメント・コミット履歴に文書化されているか）
各 evaluator が特定の障害カテゴリにマッピングされているか
または、汎用的な品質指標（「品質スコア」「総合評価」など）を測定しているだけか

Pass 基準:

各 evaluator が特定の障害モードを対象としている（例: check_tone はトーンの不一致を対象としており、「全体的な品質を評価する」ものではない）
障害カテゴリが実際のトレースのレビューから導出されている（ブレインストーミングではなく）

よくある失敗例:

evaluate_quality、check_overall、rate_output のような evaluator 名 — 汎用的で、観察された障害に基づいていない
Evaluator が実際に何が失敗するかではなく、重要そうに思えたことをもとに作成されている
Evaluator 作成前にトレースをレビューした証拠がない

改善策: output-eval-error-analysis — Evaluator を修正する前に 50 件以上のトレースをレビューし、実際の障害モードを分類する

領域 2: Evaluator の設計

問い: Evaluator は、信頼性の高い自動評価に適した設計になっているか？

tests/evals/evaluators.ts 内の各 evaluator を確認します。

確認項目	確認ポイント
Judge ごとに障害モードが 1 つ	各 `judgeVerdict()` evaluator がちょうど 1 つの基準を対象としている
二値判定	Judge プロンプトが pass/fail を使用しており、リッカートスケール（1〜5）や多軸評価ではない
可能な限りコードベース	客観的なチェックには LLM judge ではなく `Verdict.*` ヘルパーを使用している
Judge にフューショット例がある	Judge の `.prompt` ファイルに pass・fail・境界線上の例が含まれている
判定前に批評を行う	Judge プロンプトが、structured output の判定前に批評・推論を要求している
適切な重要度設定	ブロッキング障害には `required`、あると望ましいチェックには `informational`
正しい interpret タイプ	`interpret` の設定が evaluator の返り値と一致している

Pass 基準:

すべての evaluator において、上記のすべてのチェックが満たされている

よくある失敗例:

1 つの judge プロンプトが 3 つ以上の基準を同時に評価している（「トーン・正確性・完全性を評価してください」）
Judge プロンプトにフューショット例がない
決定論的なチェック（長さ・文字列の包含・正規表現）に Verdict.* ではなく LLM judge を使用している
interpret タイプが evaluator の返り値と一致していない（例: judgeVerdict() に interpret: { type: 'boolean' } を設定している）

改善策: output-eval-judge-prompt — 4 コンポーネント構造に従って judge プロンプトを再設計する

領域 3: Judge の検証

問い: LLM judge は人間のラベルに対して検証されているか？

LLM ベースの evaluator（judgeVerdict()・judgeScore()・judgeLabel() を使用するもの）ごとに確認します。

確認項目	確認ポイント
人間のラベルが存在する	データセットに `ground_truth.evals.<evaluator_name>.verdict` が設定されている
TPR/TNR が測定されている	検証結果が文書化されている（ファイル・コメント・コミットなど）
Train/dev/test の分割	Judge プロンプトのフューショット例は、指定された train 分割から取得されており、計測に使用したデータと同一でない
指標が閾値を満たしている	TPR > 80% かつ TNR > 80%（目標: > 90%）

Pass 基準:

すべての LLM judge に対して、TPR/TNR の指標が 80% 超で文書化されている
Train/dev/test 分割が使用されている（データ漏洩がない）

よくある失敗例:

一切の検証なし — Judge を作成して即デプロイしている
Judge プロンプトのフューショット例が、指標計測に使用したものと同一である（データ漏洩）
定量的な計測なしに「うまく動いているように見える」という判断
クラスの不均衡を隠してしまう生の精度（raw accuracy）のみを報告している

改善策: output-eval-validate-judge — TPR/TNR を用いて各 judge を人間のラベルに対してキャリブレーションする

領域 4: データセットのカバレッジ

問い: データセットは障害空間を適切にカバーしているか？

確認事項:

確認項目	確認ポイント
データセット数	単純なワークフローでは最低 10 件、複雑なものでは 20 件以上
多様性	データセットが複数の入力次元にわたって変化しており、ハッピーパスだけでない
障害の代表性	データセットの少なくとも 30% の `ground_truth` に `human_verdict: fail` がある
ground_truth が設定されている	ほとんどのデータセットに evaluator ごとのラベルを含む `ground_truth` がある
実データと合成データの混在	本番トレースと合成テストケースの両方が含まれている
ほぼ重複するものがない	各データセットが意味のある異なるシナリオをテストしている

Pass 基準:

ground_truth を持つ多様なデータセットが 20 件以上ある
pass ケースと fail ケースの両方が含まれている（95% が pass ではない）
データセットが異なる入力次元をカバーしている

よくある失敗例:

データセットが 3〜5 件のみで、すべてがハッピーパスのバリエーション
データセットの 100% が pass（judge を検証するための障害ケースがない）
データセットが合成のみで、実際の本番トレースがない
ground_truth フィールドが空または欠落している

改善策: output-eval-dataset-design — 次元ベースの変化を用いて多様なデータセットを設計する

ステップ 3: レポートのまとめ

以下の構造化フォーマットで発見事項をまとめます。

# Eval 監査: <workflow_name>
# 日付: YYYY-MM-DD
# 監査者: <name>

## サマリー

| 領域 | ステータス | 主な発見事項 |
|------|-----------|------------|
| エラー分析の根拠 | Warn | Evaluator は妥当に見えるが、トレースレビューの記録がない |
| Evaluator の設計 | Fail | 1 つの judge が 3 つの基準を同時に評価している |
| Judge の検証 | Fail | どの LLM judge に対しても検証が実施されていない |
| データセットのカバレッジ | Warn | データセットは 12 件あるが、障害ケースは 2 件のみ |

## 発見事項

### 1. エラー分析の根拠 — WARN
Evaluator は妥当な基準（トーン・トピック・長さ）を対象としているが、
これらが観察された障害から導出されたという証拠がない。
Eval スイートがこのワークフローの実際の主要な障害モードを見落としている可能性がある。

**次のステップ:** 50 件以上の本番トレースに対してエラー分析を実施する（`output-eval-error-analysis`）

### 2. Evaluator の設計 — FAIL
`evaluators.ts` 内の `evaluate_overall_quality` が、1 回の `judgeVerdict()` 呼び出しで
トーン・正確性・完全性を同時に評価している。
これでは障害が発生しても対処できない — どの基準で失敗したかが不明なため。

**次のステップ:** 3 つの集中した judge に分割する（`output-eval-judge-prompt`）

### 3. Judge の検証 — FAIL
どの LLM judge に対しても TPR/TNR の指標が存在しない。
`judge_quality@v1.prompt` にはフューショット例がない。

**次のステップ:** 100 件のデータセットにラベルを付け、各 judge を検証する（`output-eval-validate-judge`）

### 4. データセットのカバレッジ — WARN
キャッシュされた出力を持つデータセットが 12 件存在する。
`ground_truth.human_verdict: fail` を持つものは 2 件のみ。
すべての入力がシンプルなトピックで、エッジケースがない。

**次のステップ:** 20 件以上の多様なデータセットを設計する（`output-eval-dataset-design`）

## 優先順位
1. エラー分析（基盤となる作業 — 必要な evaluator が変わる可能性がある）
2. 総合的な judge を集中した judge に分割する
3. データセットを 30 件以上に拡張し、pass/fail のバランスを取る
4. すべての LLM judge を検証する

ゼロから始める場合

ワークフローに eval インフラが一切ない場合:

エラー分析から始める — output-eval-error-analysis。50 件以上のワークフロートレースをレビューする。
データセットを構築する — output-eval-dataset-design。20 件以上の多様なデータセットを作成する。
Evaluator を実装する — output-dev-eval-testing。verify() evaluator と evalWorkflow() を作成する。
Judge プロンプトを設計する — output-eval-judge-prompt。主観的な基準にのみ使用する。
Judge を検証する — output-eval-validate-judge。LLM judge を信頼する前に実施する。

エラー分析をスキップしてはいけません。ワークフローがどのように失敗するかを理解せずに evaluator を構築すると、的外れなことに労力を費やすことになります。

Auditing an Eval Suite

Overview

Audit your eval suite to determine whether it actually catches real failures. This skill provides a structured diagnostic that identifies gaps in error analysis, evaluator design, judge validation, and dataset coverage, with concrete remediation steps for each finding.

When to Use

Inheriting an eval suite from another team or developer
Suspecting that evals pass but production quality is poor
After switching models, rewriting prompts, or changing pipeline logic
Periodic health check (quarterly or after major releases)

Step 1: Gather Artifacts

Read the eval infrastructure files for the workflow being audited:

src/workflows/<workflow_name>/
├── tests/
│   ├── datasets/           # YAML dataset files
│   │   ├── *.yml
│   │   └── ...
│   └── evals/
│       ├── evaluators.ts   # Evaluator definitions
│       ├── workflow.ts      # Eval workflow definition
│       └── *.prompt         # Judge prompt files

Inventory what exists:

Artifact	File(s)	Count
Evaluators	`tests/evals/evaluators.ts`	?
Eval workflow	`tests/evals/workflow.ts`	? entries in `evals` array
Judge prompts	`tests/evals/*.prompt`	?
Datasets	`tests/datasets/*.yml`	?
Datasets with ground_truth	? of above	?
Datasets with last_output	? of above	?

If any of these are missing entirely, note it and skip to "Starting From Zero" at the bottom.

Step 2: Run the Diagnostic

Evaluate each of the four areas below. For each, assign a status:

Pass — Meets the standard
Warn — Partially meets the standard, improvements needed
Fail — Does not meet the standard, significant risk

Area 1: Error Analysis Grounding

Question: Were the evaluators derived from observed failure modes in real workflow traces?

Check:

Do failure categories exist (documented in a file, comments, or commit history)?
Does each evaluator map to a specific failure category?
Or are evaluators measuring generic qualities ("quality score", "overall rating")?

Pass criteria:

Each evaluator targets a named failure mode (e.g., "check_tone" targets tone mismatch, not "evaluate general quality")
Failure categories were derived from reviewing real traces (not brainstormed)

Common failures:

Evaluators named evaluate_quality, check_overall, rate_output — generic, not grounded in observed failures
Evaluators were written based on what seemed important, not what actually fails
No evidence of trace review before evaluator creation

Remediation: output-eval-error-analysis — Review 50+ traces and categorize actual failure modes before modifying evaluators

Area 2: Evaluator Design

Question: Are the evaluators well-designed for reliable automated evaluation?

Check each evaluator in tests/evals/evaluators.ts:

Check	What to look for
One failure mode per judge	Each `judgeVerdict()` evaluator targets exactly one criterion
Binary verdicts	Judge prompts use pass/fail, not Likert scales (1-5) or multi-axis ratings
Code-based where possible	Objective checks use `Verdict.*` helpers, not LLM judges
Few-shot examples in judges	Judge `.prompt` files include pass, fail, and borderline examples
Critique before verdict	Judge prompts request critique/reasoning before the verdict in structured output
Appropriate criticality	`required` for blocking failures, `informational` for nice-to-have checks
Correct interpret type	`interpret` config matches what the evaluator returns

Pass criteria:

All checks above are met for every evaluator

Common failures:

A single judge prompt evaluates 3+ criteria simultaneously ("Rate tone, accuracy, and completeness")
Judge prompts have no few-shot examples
Deterministic checks (length, string contains, regex) use LLM judges instead of Verdict.*
interpret type doesn't match evaluator return type (e.g., judgeVerdict() with interpret: { type: 'boolean' })

Remediation: output-eval-judge-prompt — Redesign judge prompts following the four-component structure

Area 3: Judge Validation

Question: Have LLM judges been validated against human labels?

Check for each LLM-based evaluator (those using judgeVerdict(), judgeScore(), judgeLabel()):

Check	What to look for
Human labels exist	Datasets have `ground_truth.evals.<evaluator_name>.verdict` populated
TPR/TNR measured	Validation results documented (file, comment, or commit)
Train/dev/test split	Few-shot examples in the judge prompt come from a designated train split, not from the same data used for measurement
Metrics meet threshold	TPR > 80% and TNR > 80% (target: > 90%)

Pass criteria:

Every LLM judge has documented TPR/TNR metrics above 80%
Train/dev/test split was used (no data leakage)

Common failures:

No validation at all — judges were written and immediately deployed
Few-shot examples in the judge prompt are the same examples used to measure metrics (data leakage)
"It seems to work" without quantitative measurement
Only raw accuracy reported (masks class imbalance)

Remediation: output-eval-validate-judge — Calibrate each judge against human labels using TPR/TNR

Area 4: Dataset Coverage

Question: Do the datasets adequately cover the failure space?

Check:

Check	What to look for
Dataset count	Minimum 10 for simple workflows, 20+ for complex ones
Diversity	Datasets vary across multiple input dimensions, not just happy paths
Failure representation	At least 30% of datasets have `human_verdict: fail` in ground_truth
Ground truth populated	Most datasets have `ground_truth` with per-evaluator labels
Real + synthetic mix	Includes production traces alongside synthetic test cases
No near-duplicates	Each dataset tests a meaningfully different scenario

Pass criteria:

20+ diverse datasets with ground truth
Both pass and fail cases represented (not 95% passes)
Datasets cover different input dimensions

Common failures:

Only 3-5 datasets, all happy-path variations
100% of datasets pass (no failure cases to validate judges against)
Datasets are synthetic-only with no real production traces
Ground truth fields are empty or missing

Remediation: output-eval-dataset-design — Design diverse datasets using dimension-based variation

Step 3: Compile the Report

Summarize findings in a structured format:

# Eval Audit: <workflow_name>
# Date: YYYY-MM-DD
# Auditor: <name>

## Summary

| Area | Status | Key Finding |
|------|--------|-------------|
| Error Analysis Grounding | Warn | Evaluators seem reasonable but no documented trace review |
| Evaluator Design | Fail | Single judge evaluates 3 criteria simultaneously |
| Judge Validation | Fail | No validation performed on any LLM judge |
| Dataset Coverage | Warn | 12 datasets but only 2 are failure cases |

## Findings

### 1. Error Analysis Grounding — WARN
Evaluators target reasonable criteria (tone, topic, length) but there is no evidence
that these were derived from observed failures. The eval suite may be missing the
workflow's actual top failure modes.

**Next step:** Run error analysis on 50+ production traces (`output-eval-error-analysis`)

### 2. Evaluator Design — FAIL
`evaluate_overall_quality` in evaluators.ts uses a single judgeVerdict() call that
assesses tone, accuracy, and completeness simultaneously. This makes failures
unactionable — when it fails, you don't know which criterion failed.

**Next step:** Split into three focused judges (`output-eval-judge-prompt`)

### 3. Judge Validation — FAIL
No TPR/TNR metrics exist for any LLM judge. The judge_quality@v1.prompt has no
few-shot examples.

**Next step:** Label 100 datasets, validate each judge (`output-eval-validate-judge`)

### 4. Dataset Coverage — WARN
12 datasets exist with cached output. Only 2 have ground_truth.human_verdict: fail.
All inputs are simple topics with no edge cases.

**Next step:** Design 20+ diverse datasets (`output-eval-dataset-design`)

## Priority Order
1. Error analysis (foundational — may change which evaluators are needed)
2. Split holistic judge into focused judges
3. Expand datasets to 30+ with balanced pass/fail
4. Validate all LLM judges

Starting From Zero

If the workflow has no eval infrastructure at all:

Start with error analysis — output-eval-error-analysis. Review 50+ workflow traces.
Build datasets — output-eval-dataset-design. Create 20+ diverse datasets.
Implement evaluators — output-dev-eval-testing. Write verify() evaluators and evalWorkflow().
Design judge prompts — output-eval-judge-prompt. For subjective criteria only.
Validate judges — output-eval-validate-judge. Before trusting any LLM judge.

Do not skip error analysis. Building evaluators without understanding how the workflow fails wastes effort on the wrong things.

Related Skills

output-eval-error-analysis — Systematic trace review and failure categorization
output-eval-judge-prompt — Design effective LLM judge prompts
output-eval-dataset-design — Generate diverse test datasets
output-eval-validate-judge — Calibrate LLM judges against human labels
output-dev-eval-testing — Implementation reference for offline eval testing
output-dev-evaluator-function — Implementation reference for runtime evaluators

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。

🔍output-eval-audit

説明

ユースケース

本文（日本語訳）

Eval スイートの監査

概要

次のような場合に使用

ステップ 1: 成果物の収集

ステップ 2: 診断の実施

領域 1: エラー分析の根拠

領域 2: Evaluator の設計

領域 3: Judge の検証

領域 4: データセットのカバレッジ

ステップ 3: レポートのまとめ

ゼロから始める場合

関連スキル

Auditing an Eval Suite

Overview

When to Use

Step 1: Gather Artifacts

Step 2: Run the Diagnostic

Area 1: Error Analysis Grounding

Area 2: Evaluator Design

Area 3: Judge Validation

Area 4: Dataset Coverage

Step 3: Compile the Report

Starting From Zero

Related Skills