スキルOfficialdevelopment

📊output-eval-validate-judge

プラグイン: outputai
ソース: GitHub で見る ↗

説明

LLM判定器を、TPR/TNRメトリクスおよびtrain/dev/testの分割を用いて、人間のラベルと照合し検証します。次のような場合に使用: 判定プロンプトを作成した後、その判定結果が人間の判断と一致しているかどうかを確認したいとき。

原文を表示

Validate LLM judges against human labels using TPR/TNR metrics and train/dev/test splits. Use after writing a judge prompt to verify it agrees with human judgment.

ユースケース

✓判定プロンプトの精度を検証したい
✓人間のラベルとの一致度を確認したい
✓TPR/TNRメトリクスで性能を評価したい

本文（日本語訳）

LLM Judgeの検証

概要

LLM judgeが有用であるためには、人間の判断と一致している必要があります。このスキルでは、TPR（真陽性率）とTNR（真陰性率）の指標を使用して、人間がラベル付けしたデータに対してjudgeを校正する手順を説明します。

evalスイート内の judgeVerdict()、judgeScore()、judgeLabel() いずれかの evaluatorを信頼する前に、必ずこの検証を実施してください。

前提条件

judgeの .prompt ファイル — output-eval-judge-prompt に従って作成されたもの
人間がラベル付けしたトレース約100件 — このjudgeが対象とする失敗モードに対して、 pass/failの二値ラベルが付いているもの。pass約50件・fail約50件を目安に。最低限: pass 20件・fail 20件
データセットYAMLに保存されたラベル — 各データセットに ground_truth.evals.<evaluator_name>.verdict: pass または fail が記載されていること

このプロセスはLLMベースのjudgeにのみ適用されます。コードベースの Verdict.* evaluatorの場合は、代わりにユニットテストを作成してください。

ステップ1: データの分割

ラベル付きデータセットを3つのグループに分割します。

分割	データの割合	目的	例（100データセットの場合）
Train	10〜20%	judgeプロンプト内のfew-shotサンプルの源泉	15データセット
Dev	40〜45%	judgeプロンプトの反復改善、TPR/TNRの計測	42データセット
Test	40〜45%	最終的なホールドアウト計測（1回のみ実行）	43データセット

分割の整理方法

命名規則またはサブディレクトリを使って分割を分離します。

オプションA: 名前のプレフィックス

tests/datasets/
├── train_formal_pass_01.yml
├── train_casual_fail_01.yml
├── dev_technical_pass_01.yml
├── dev_ambiguous_fail_01.yml
├── test_simple_pass_01.yml
├── test_contradictory_fail_01.yml
└── ...

オプションB: サブディレクトリ

tests/datasets/
├── train/
│   ├── formal_pass_01.yml
│   └── casual_fail_01.yml
├── dev/
│   ├── technical_pass_01.yml
│   └── ambiguous_fail_01.yml
└── test/
    ├── simple_pass_01.yml
    └── contradictory_fail_01.yml

分割のルール

各分割でpass/failをバランスよく配分する — failをすべてdevに、passをすべてtestに集めるようなことはしない
ランダム化する — 難易度やトピックで並び替えない
プロンプト内のトレーニングサンプル — judgeの .prompt ファイルのfew-shotには train分割のサンプルのみを使用する。devやtestのサンプルは絶対に使わない（データリークになる）
test分割をロックする — 一度作成したら、最終計測まではtestデータを参照しない

ステップ2: Dev セットでjudgeを実行する

dev分割のデータセットのみを対象にevalワークフローを実行します。

# devデータセットに対してキャッシュ済み出力で実行
npx output workflow test <workflowName> --cached \
  --dataset dev_technical_pass_01,dev_ambiguous_fail_01,dev_formal_pass_02,...

サブディレクトリを使用している場合は、devデータセット名を列挙します。

npx output workflow test <workflowName> --cached \
  --dataset $(ls tests/datasets/dev/ | sed 's/.yml//' | tr '\n' ',')

出力を保存してください。グランドトゥルースと照合するために、各データセットに対するjudgeのverdictが必要になります。

結果の抽出

--json を使用して機械可読な結果を取得します。

npx output workflow test <workflowName> --cached --dataset <dev_datasets> --json

出力には、データセットごと・evaluatorごとのverdictが含まれており、 ground_truth.evals.<evaluator_name>.verdict と比較できます。

ステップ3: TPRとTNRを計算する

検証対象のevaluatorについて、devの結果から混同行列を作成します。

定義

「fail」を陽性クラス（検出したいもの）として使用します。

	Judgeがfailと判定	Judgeがpassと判定
人間がfailと判定	真陽性 (TP)	偽陰性 (FN)
人間がpassと判定	偽陽性 (FP)	真陰性 (TN)

TPR（真陽性率） = TP / (TP + FN)

「実際の失敗のうち、judgeが検出できた割合は？」
TPRが低いということは、judgeが実際の失敗を見逃している（危険）

TNR（真陰性率） = TN / (TN + FP)

「実際のpassのうち、judgeが正しくpassと判定した割合は？」
TNRが低いということは、judgeがpassのトレースをfailとして誤検出している（ノイジー）

計算例

check_tone evaluatorのdevセット結果（42データセット）:

	Judge: Fail	Judge: Pass
人間: Fail	18 (TP)	3 (FN)
人間: Pass	2 (FP)	19 (TN)

TPR = 18 / (18 + 3) = 85.7%
TNR = 19 / (19 + 2) = 90.5%

なぜ単純な正解率ではダメなのか？

単純な正解率 = (TP + TN) / 総数 = (18 + 19) / 42 = 88.1%

一見問題なさそうに見えますが、実態を隠してしまいます。仮にデータセットの90%がpass（クラス不均衡）の場合、常に「pass」と答えるjudgeは正解率90%を達成しながら、失敗を1件も検出できません（TPR = 0%）。 TPRとTNRは、失敗の検出と誤検出の防止という本質的に重要なことを計測します。

ステップ4: 不一致箇所を調査する

judgeが人間のラベルと一致しない全ケースについて、根本原因を特定します。

偽陰性（judgeが実際の失敗を見逃した場合）

judgeが「pass」と判定したが、人間は「fail」と判定したケース。各ケースについて:

トレースとjudgeの批評を読む
見逃した原因を特定する:
- 判定基準が狭すぎる — プロンプトの失敗定義が狭すぎる。fail定義を広げる
- few-shotサンプルの不足 — その失敗パターンがサンプルに含まれていない。 train分割から似た境界例を追加する
- コンテキストの不足 — judgeがこの失敗を検出するために必要な情報がない。不足している変数をプロンプトに追加する

偽陽性（judgeがpassのトレースをfailと判定した場合）

judgeが「fail」と判定したが、人間は「pass」と判定したケース。各ケースについて:

トレースとjudgeの批評を読む
誤検出した原因を特定する:
- 判定基準が広すぎる — プロンプトの失敗定義が広すぎる。fail定義を絞り込む
- few-shotサンプルが誤解を招いている — 境界例が過度に一般化されている。サンプルを明確化するか差し替える
- 過度に厳格 — judgeが意図より厳しく基準を適用している。プロンプトに明示的な例外を追加する

不一致のログ記録

プロンプトの反復改善に役立てるため、各不一致を記録します。

データセット	人間	Judge	根本原因	修正内容
dev_technical_pass_03	pass	fail	「it's」をカジュアルと判定したが、直接引用部分だった	例外を追加:「直接引用内の短縮形は許容される」
dev_ambiguous_fail_02	fail	pass	第3段落の微妙なトーンの変化を見逃した	文章途中でのトーン変化を示す境界few-shotサンプルを追加

ステップ5: 反復改善する

ステップ4の修正をjudgeの .prompt ファイルに適用し、devセットで再実行します。

npx output workflow test <workflowName> --cached --dataset <dev_datasets>

TPRとTNRを再計算します。両指標が目標値を達成するまで繰り返します。

目標値

指標	目標値	最低許容値
TPR	> 90%	> 80%
TNR	> 90%	> 80%

3〜4回の反復後も80%/80%に届かない場合は、以下を検討してください。

モデルをアップグレードする — .prompt のフロントマターでHaikuからSonnetに切り替える
判定基準を分割する — 失敗モードに2つの異なるサブ失敗が含まれており、個別のjudgeが必要な可能性がある
ラベルを見直す — 一部の人間ラベルが不一致している可能性がある。第二のレビュアーと共に不一致箇所を再ラベリングする

反復チェックリスト

各反復で確認:

[ ] 各不一致の根本原因を特定した
[ ] .prompt ファイルに的を絞った修正を適用した（ランダムな変更はしない）
[ ] devセットで再実行した
[ ] TPRとTNRを再計算した
[ ] 反復番号、変更内容、結果の指標を記録した

ステップ6: テストセットで最終計測を行う

devの指標が目標値を達成したら、ホールドアウトしたtestセットでjudgeを1回だけ実行します。

npx output workflow test <workflowName> --cached --dataset <test_datasets> --json

テスト結果からTPRとTNRを計算し、最終指標として記録します。

最終結果の解釈

testの指標がdevの指標に近い — judgeの汎化性能が高い。リリースしてよい
testの指標が大幅に低い — judgeがdevセットのパターンに過学習している可能性がある。 testセットで反復改善してはならない。代わりに、より多くのラベル付きデータを収集し、再分割してステップ2からやり直す

結果の記録

最終的な検証結果をjudgeプロンプトの隣に文書化します。

# Validation: check_tone (judge_tone@v1.prompt)
# Date: 2026-03-25
# Model: claude-haiku-4-5-20251001

## Dev Set (42 datasets)
- TPR: 90.5% (19/21)
- TNR: 95.2% (20/21)

## Test Set (43 datasets)
- TPR: 88.0% (22/25)
- TNR: 94.4% (17/18)

## Conclusion: APPROVED — both metrics above 80% minimum

この内容を VALIDATION.md ファイルとして、judgeプロンプトの隣またはevaluatorのドキュメント内に保存します。

アンチパターン

検証なしにjudgeが機能すると思い込む — 未検証のjudgeは、失敗を一貫して見逃したり、passのトレースを誤検出したりする可能性がある
dev/testのサンプルをfew-shotに使用する — データリークにより指標が過大評価され、実際の性能が隠れてしまう
単純な正解率を最適化する — TPRとTNRを使用すること。正解率はクラス不均衡の問題を隠してしまう
testセットで反復改善する — testはホールドアウトデータ。testの指標が悪い場合は、データを追加収集して再分割する
不一致の分析を省略する — 根本原因を理解しないままランダムにプロンプトを調整しても収束しない
ラベル付きサンプルが少なすぎる — 合計40件未満（pass 20件 + fail 20件）では、サンプルサイズが小さす

原文（English）を表示

Validating LLM Judges

Overview

An LLM judge is only useful if it agrees with human judgment. This skill walks you through calibrating a judge against human-labeled data using True Positive Rate (TPR) and True Negative Rate (TNR) metrics. Do this before trusting any judgeVerdict(), judgeScore(), or judgeLabel() evaluator in your eval suite.

Prerequisites

A judge .prompt file — Written following output-eval-judge-prompt
~100 human-labeled traces — With binary pass/fail labels for the failure mode this judge targets. Aim for ~50 pass and ~50 fail. Minimum: 20 pass and 20 fail.
Labels stored in dataset YAML — Each dataset has ground_truth.evals.<evaluator_name>.verdict: pass or fail

This process applies only to LLM-based judges. For code-based Verdict.* evaluators, write unit tests instead.

Step 1: Create Data Splits

Split your labeled datasets into three groups:

Split	% of Data	Purpose	Example (100 datasets)
Train	10-20%	Source of few-shot examples in the judge prompt	15 datasets
Dev	40-45%	Iterate on judge prompt, measure TPR/TNR	42 datasets
Test	40-45%	Final held-out measurement, run once	43 datasets

Organizing splits

Use a naming convention or subdirectories to separate splits:

Option A: Name prefixes

tests/datasets/
├── train_formal_pass_01.yml
├── train_casual_fail_01.yml
├── dev_technical_pass_01.yml
├── dev_ambiguous_fail_01.yml
├── test_simple_pass_01.yml
├── test_contradictory_fail_01.yml
└── ...

Option B: Subdirectories

tests/datasets/
├── train/
│   ├── formal_pass_01.yml
│   └── casual_fail_01.yml
├── dev/
│   ├── technical_pass_01.yml
│   └── ambiguous_fail_01.yml
└── test/
    ├── simple_pass_01.yml
    └── contradictory_fail_01.yml

Splitting rules

Balance pass/fail in each split — Don't put all failures in dev and all passes in test
Randomize — Don't sort by difficulty or topic
Training examples in the prompt — Use only train-split examples as few-shot in the judge .prompt file. Never use dev or test examples — that's data leakage
Lock the test split — Once created, do not look at test data until final measurement

Step 2: Run the Judge on Dev Set

Execute the eval workflow against only the dev-split datasets:

# Run with cached output on dev datasets
npx output workflow test <workflowName> --cached \
  --dataset dev_technical_pass_01,dev_ambiguous_fail_01,dev_formal_pass_02,...

Or if using subdirectories, list the dev dataset names:

npx output workflow test <workflowName> --cached \
  --dataset $(ls tests/datasets/dev/ | sed 's/.yml//' | tr '\n' ',')

Save the output. You need the judge's verdict for each dataset to compare against ground truth.

Extracting results

Use --json to get machine-readable results:

npx output workflow test <workflowName> --cached --dataset <dev_datasets> --json

The output includes per-dataset, per-evaluator verdicts that you can compare against ground_truth.evals.<evaluator_name>.verdict.

Step 3: Compute TPR and TNR

For the evaluator you're validating, build a confusion matrix from the dev results.

Definitions

Using "fail" as the positive class (what you're trying to detect):

	Judge says Fail	Judge says Pass
Human says Fail	True Positive (TP)	False Negative (FN)
Human says Pass	False Positive (FP)	True Negative (TN)

TPR (True Positive Rate) = TP / (TP + FN)

"Of all the real failures, what fraction did the judge catch?"
Low TPR means the judge misses real failures (dangerous)

TNR (True Negative Rate) = TN / (TN + FP)

"Of all the real passes, what fraction did the judge correctly approve?"
Low TNR means the judge flags passing traces as failures (noisy)

Example computation

Dev set results for check_tone evaluator (42 datasets):

	Judge: Fail	Judge: Pass
Human: Fail	18 (TP)	3 (FN)
Human: Pass	2 (FP)	19 (TN)

TPR = 18 / (18 + 3) = 85.7%
TNR = 19 / (19 + 2) = 90.5%

Why not raw accuracy?

Raw accuracy = (TP + TN) / total = (18 + 19) / 42 = 88.1%

This looks fine, but masks problems. If your dataset were 90% pass (class imbalance), a judge that always says "pass" would get 90% accuracy while catching zero failures (TPR = 0%). TPR and TNR measure what actually matters: catching failures and not crying wolf.

Step 4: Inspect Disagreements

For every case where the judge disagrees with the human label, determine the root cause.

False Negatives (judge missed a real failure)

The judge said "pass" but the human said "fail." For each:

Read the trace and the judge's critique
Determine why the judge missed it:
- Criterion too narrow — The prompt defines failure too narrowly. Broaden the fail definition.
- Missing few-shot example — The failure pattern isn't represented in examples. Add a similar borderline example from the train split.
- Insufficient context — The judge doesn't have the information needed to detect this failure. Add the missing variable to the prompt.

False Positives (judge flagged a passing trace)

The judge said "fail" but the human said "pass." For each:

Read the trace and the judge's critique
Determine why the judge flagged it:
- Criterion too broad — The prompt defines failure too broadly. Tighten the fail definition.
- Misleading few-shot example — A borderline example is being overgeneralized. Clarify or replace it.
- Overly strict — The judge applies the criterion more strictly than intended. Add explicit exceptions to the prompt.

Logging disagreements

Track each disagreement to guide prompt iteration:

Dataset	Human	Judge	Root Cause	Fix
dev_technical_pass_03	pass	fail	Judge flagged "it's" as casual but context was a direct quote	Add exception: "Contractions within direct quotes are acceptable"
dev_ambiguous_fail_02	fail	pass	Judge missed subtle tone shift in paragraph 3	Add borderline few-shot example showing mid-text tone drift

Step 5: Iterate

Apply the fixes from Step 4 to the judge .prompt file. Then re-run on the dev set:

npx output workflow test <workflowName> --cached --dataset <dev_datasets>

Recompute TPR and TNR. Repeat until both metrics meet the target.

Targets

Metric	Target	Minimum Acceptable
TPR	> 90%	> 80%
TNR	> 90%	> 80%

If you can't reach 80%/80% after 3-4 iterations:

Upgrade the model — Switch from Haiku to Sonnet in the .prompt frontmatter
Split the criterion — The failure mode may contain two distinct sub-failures that need separate judges
Revisit the labels — Some human labels may be inconsistent. Re-label disagreements with a second reviewer

Iteration checklist

Each iteration:

[ ] Identified root cause for each disagreement
[ ] Applied targeted fix to .prompt file (not random changes)
[ ] Re-ran on dev set
[ ] Recomputed TPR and TNR
[ ] Logged the iteration number, changes made, and resulting metrics

Step 6: Final Measurement on Test Set

Once dev metrics meet the target, run the judge on the held-out test set exactly once:

npx output workflow test <workflowName> --cached --dataset <test_datasets> --json

Compute TPR and TNR on the test results. Record these as the final metrics.

Interpreting final results

Test metrics close to dev metrics — The judge generalizes well. Ship it.
Test metrics significantly lower — The judge may be overfit to dev set patterns. Do not iterate on the test set. Instead, gather more labeled data, re-split, and restart from Step 2.

Recording results

Document the final validation results alongside the judge prompt:

# Validation: check_tone (judge_tone@v1.prompt)
# Date: 2026-03-25
# Model: claude-haiku-4-5-20251001

## Dev Set (42 datasets)
- TPR: 90.5% (19/21)
- TNR: 95.2% (20/21)

## Test Set (43 datasets)
- TPR: 88.0% (22/25)
- TNR: 94.4% (17/18)

## Conclusion: APPROVED — both metrics above 80% minimum

Store this in a VALIDATION.md file next to the judge prompt or in the evaluator's documentation.

Anti-Patterns

Assuming judges work without validation — An unvalidated judge may consistently miss failures or flag passing traces
Using dev/test examples as few-shot — Data leakage inflates metrics and hides real performance
Optimizing for raw accuracy — Use TPR and TNR instead; accuracy hides class imbalance problems
Iterating on the test set — Test is held-out. If test metrics are bad, gather more data and re-split
Skipping disagreement analysis — Random prompt tweaks without understanding root causes don't converge
Too few labeled examples — Below 40 total (20 pass + 20 fail), metrics are unreliable due to small sample size

Related Skills

output-eval-judge-prompt — Design the judge prompt being validated
output-eval-error-analysis — Source of human-labeled data for validation
output-eval-dataset-design — Generate additional labeled datasets if you need more data
output-dev-eval-testing — output workflow test CLI, --cached and --dataset flags
output-eval-audit — Audit whether existing judges have been validated

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。