🧪output-dev-eval-testing
- プラグイン
- outputai
- ソース
- GitHub で見る ↗
説明
@outputai/evals を使用して、Output SDK ワークフロー向けのオフライン評価テストを作成します。 次のような場合に使用: - `verify()` を用いたテスト評価器の実装 - データセット YAML ファイルの作成 - 評価ワークフローの構築 - CLI 経由でのワークフローテストの実行
原文を表示
Create offline evaluation tests for Output SDK workflows using @outputai/evals. Use when implementing test evaluators with verify(), creating dataset YAML files, building eval workflows, or running workflow tests via CLI.
ユースケース
- ✓verify()を用いたテスト評価器を実装するとき
- ✓データセット YAML ファイルを作成するとき
- ✓評価ワークフローを構築するとき
- ✓CLI経由でワークフローテストを実行するとき
本文
Offline Evaluation Testing
Overview
The @outputai/evals package provides an offline evaluation framework for testing workflow quality using datasets and evaluators. This is complementary to the runtime evaluator() from @outputai/core:
| Aspect | Runtime Evaluators (@outputai/core) |
Offline Eval Tests (@outputai/evals) |
|---|---|---|
| When | During workflow execution | After execution, at test time |
| Where | evaluators.ts in workflow folder |
tests/evals/ in workflow folder |
| Purpose | Live quality scoring with confidence | Dataset-driven pass/fail verification |
| Triggered by | Workflow orchestration | output workflow test CLI command |
| Returns | EvaluationBooleanResult, etc. |
Verdict helpers (pass/partial/fail) |
Use offline eval testing when you want to validate workflow behavior against known datasets, build regression test suites, or assess subjective quality with LLM judges.
When to Use This Skill
- Creating files in
tests/evals/ortests/datasets/ - Writing evaluators that use
verify()from@outputai/evals - Creating YAML dataset files for test cases
- Building eval workflows with
evalWorkflow() - Running
output workflow testcommands - Setting up ground truth data for evaluators
Directory Structure
Add a tests/ directory inside the workflow folder:
src/workflows/{workflow_name}/
├── workflow.ts
├── steps.ts
├── evaluators.ts # Runtime evaluators (optional)
├── types.ts
└── tests/
├── datasets/
│ ├── happy_path.yml
│ └── edge_case.yml
└── evals/
├── evaluators.ts # Offline eval test evaluators
├── workflow.ts # Eval workflow definition
└── judge_topic@v1.prompt # LLM judge prompts (optional)
Creating Evaluators with verify()
Import verify and Verdict from @outputai/evals (not @outputai/core):
// tests/evals/evaluators.ts
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';
verify() Signature
verify(options, checkFn)
Options:
name— unique evaluator identifier (snake_case)input— Zod schema for the workflow input (optional, defaults toz.any())output— Zod schema for the workflow output (optional, defaults toz.any())
Check function receives:
{
input, // typed workflow input
output, // typed workflow output
context: {
ground_truth: Record<string, unknown> // from dataset YAML
}
}
Returns: any Verdict helper result.
Basic Example
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';
export const evaluateSum = verify(
{
name: 'evaluate_sum',
input: z.object({ values: z.array(z.number()) }),
output: z.object({ result: z.number() })
},
({ input, output }) =>
Verdict.equals(output.result, input.values.reduce((a, b) => a + b, 0))
);
Using Ground Truth
Ground truth values come from the dataset YAML and are available via context.ground_truth:
export const lengthCheck = verify(
{ name: 'length_check', input: blogInput, output: blogOutput },
({ output, context }) =>
Verdict.gte(output.blog_post.length, Number(context.ground_truth.min_length ?? 100))
);
Verdict Helpers
All deterministic helpers return results with confidence 1.0.
Equality & Comparison
| Method | Description |
|---|---|
Verdict.equals(actual, expected) |
Strict equality (===) |
Verdict.closeTo(actual, expected, tolerance) |
Within numeric tolerance |
Verdict.gt(actual, threshold) |
Greater than |
Verdict.gte(actual, threshold) |
Greater than or equal |
Verdict.lt(actual, threshold) |
Less than |
Verdict.lte(actual, threshold) |
Less than or equal |
Verdict.inRange(actual, min, max) |
Within inclusive range |
String & Array
| Method | Description |
|---|---|
Verdict.contains(haystack, needle) |
String includes substring |
Verdict.matches(value, pattern) |
Regex match |
Verdict.includesAll(actual, expected) |
Array contains all expected values |
Verdict.includesAny(actual, expected) |
Array contains at least one expected value |
Boolean
| Method | Description |
|---|---|
Verdict.isTrue(value) |
Value is true |
Verdict.isFalse(value) |
Value is false |
Manual Verdicts
| Method | Description |
|---|---|
Verdict.pass(reasoning?) |
Explicit pass |
Verdict.partial(confidence, reasoning?, feedback?) |
Partial pass with confidence |
Verdict.fail(reasoning, feedback?) |
Explicit fail |
LLM Judge Evaluators
Before writing a judge prompt, identify the specific failure mode via error analysis (output-eval-error-analysis). Design the judge following output-eval-judge-prompt. After writing it, validate against human labels using output-eval-validate-judge.
For subjective quality assessments, use judge functions with .prompt files:
import { verify, judgeVerdict, judgeScore, judgeLabel } from '@outputai/evals';
// Returns pass/partial/fail verdict from an LLM
export const evaluateTopic = verify(
{ name: 'evaluate_topic', input: blogInput, output: blogOutput },
async ({ input, output, context }) =>
judgeVerdict({
prompt: 'judge_topic@v1',
variables: {
blog_title: output.title,
blog_post: output.blog_post,
required_topic: String(context.ground_truth.required_topic ?? input.topic)
}
})
);
// Returns a numeric score from an LLM
export const evaluateQuality = verify(
{ name: 'evaluate_quality', input: blogInput, output: blogOutput },
async ({ input, output }) =>
judgeScore({
prompt: 'judge_quality@v1',
variables: { blog_title: output.title, blog_post: output.blog_post, topic: input.topic }
})
);
// Returns a string label from an LLM
export const evaluateTone = verify(
{ name: 'evaluate_tone', input: blogInput, output: blogOutput },
async ({ output }) =>
judgeLabel({
prompt: 'judge_tone@v1',
variables: { blog_title: output.title, blog_post: output.blog_post }
})
);
Judge .prompt File Format
Judge prompt files live alongside evaluators in tests/evals/:
# tests/evals/judge_topic@v1.prompt
---
provider: anthropic
# current as of 2026-05-04 — run output-dev-model-selection for the latest
model: claude-haiku-4-5-20251001
temperature: 0
maxTokens: 1000
---
<system>
You are an evaluation judge. Assess whether a blog post is faithfully about the required topic.
Return a JSON object with:
- verdict: "pass" if the blog clearly focuses on the topic, "partial" if it mentions the topic but lacks depth, "fail" if it is not about the topic
- reasoning: a brief explanation of your judgment
</system>
<user>
Required topic: {{ required_topic }}
Blog title: {{ blog_title }}
Blog post:
{{ blog_post }}
Judge whether this blog post is faithfully about the required topic.
</user>
Creating Eval Workflows
The eval workflow wires evaluators together and defines how to interpret results.
// tests/evals/workflow.ts
import { evalWorkflow } from '@outputai/evals';
import { evaluateSum } from './evaluators.js';
export default evalWorkflow({
name: 'simple_eval',
evals: [
{
evaluator: evaluateSum,
criticality: 'required',
interpret: { type: 'boolean' }
}
]
});
Eval Definition Fields
Each entry in the evals array has:
evaluator— the function created byverify()criticality—'required'(affects pass/fail) or'informational'(reported but doesn't block)interpret— how to convert the evaluator's return value into a verdict
Interpret Types
| Type | Evaluator Returns | Mapping |
|---|---|---|
{ type: 'boolean' } |
Verdict.equals(), Verdict.gte(), etc. |
true = pass, false = fail |
{ type: 'verdict' } |
judgeVerdict() or Verdict.pass/partial/fail() |
Direct pass-through |
{ type: 'number', pass: 0.7, partial: 0.4 } |
judgeScore() |
>=pass = pass, >=partial = partial, else fail |
{ type: 'string', pass: ['a', 'b'], partial: ['c'] } |
judgeLabel() |
Label in pass list = pass, in partial list = partial, else fail |
Full Example with Mixed Evaluators
export default evalWorkflow({
name: 'blog_generator_eval',
evals: [
{
evaluator: lengthOfOutput,
criticality: 'required',
interpret: { type: 'boolean' }
},
{
evaluator: evaluateTopic,
criticality: 'required',
interpret: { type: 'verdict' }
},
{
evaluator: evaluateQuality,
criticality: 'required',
interpret: { type: 'number', pass: 0.7, partial: 0.4 }
},
{
evaluator: evaluateContent,
criticality: 'informational',
interpret: { type: 'boolean' }
},
{
evaluator: evaluateTone,
criticality: 'informational',
interpret: { type: 'string', pass: ['professional', 'informative'], partial: ['casual'] }
}
]
});
Naming Convention
The eval workflow name must end in _eval and match the pattern {workflow_name}_eval. The CLI resolves this automatically — output workflow test blog_generator looks for blog_generator_eval.
Dataset Files
For methodology on designing diverse datasets that cover failure-prone regions, see output-eval-dataset-design.
Datasets are YAML files in tests/datasets/. Each file represents one test case.
Basic Format
name: basic_input
input:
values:
- 1
- 2
- 3
- 4
- 5
last_output:
output:
result: 15
executionTimeMs: 100
date: '2026-02-13T00:00:00.000Z'
With Ground Truth
Ground truth provides expected values for evaluators. You can set global values and per-evaluator overrides:
name: stripe_blog
input:
topic: "Stripe the payment processor"
requirements: "Include a link to https://stripe.com/en-gb/pricing"
last_output:
output:
title: "Stripe: The Modern Payment Processing Platform"
blog_post: |
Stripe has revolutionized online payment processing...
executionTimeMs: 5000
date: '2026-02-16T00:00:00.000Z'
ground_truth:
notes: "Known good case"
evals:
length_of_output:
min_length: 100
evaluate_topic:
required_topic: "Stripe the payment processor"
evaluate_content:
required_content: "https://stripe.com/en-gb/pricing"
The ground_truth.evals.<evaluator_name> values are merged with the top-level ground truth and passed to the evaluator via context.ground_truth.
CLI Commands
output workflow test <workflow_name>
Runs evaluations against all datasets for a workflow.
| Flag | Description |
|---|---|
--cached |
Use cached output from dataset files (skip workflow execution) |
--save |
Run workflow fresh and save output + eval results back to dataset files |
--dataset <names> |
Comma-separated list of dataset names to run (default: all) |
--json |
Output machine-readable JSON instead of the rendered report |
Execution flow:
- Loads all dataset YAML files from
tests/datasets/ - Without
--cached: executes the workflow for each dataset to get fresh output - Sends all datasets to the
{workflow_name}_evalworkflow - Reports per-dataset and per-evaluator verdicts
- Exits with code 1 if any required evaluator fails
output workflow dataset list <workflow_name>
Lists all datasets for a workflow with their cached status.
| Flag | Description |
|---|---|
--format <type> |
Output format: table (default) or text |
--json |
Output machine-readable JSON |
output workflow dataset generate <workflow_name> [scenario]
Generates a new dataset file by running the workflow.
| Flag | Description |
|---|---|
--input <json> |
Workflow input as a JSON string or file path |
--name <name> |
Dataset filename (defaults to scenario name) |
--trace <path> |
Generate from a local trace file instead of running the workflow |
--download |
Download traces from S3 and convert to datasets |
--limit <n> |
Max traces to download from S3 (default: 5) |
Common Usage
# Generate dataset from inline JSON input
output workflow dataset generate my_workflow --input '{"key": "value"}' --name my_test
# Generate from a scenario file
output workflow dataset generate my_workflow basic
# Run evals with cached output (fast, no re-execution)
output workflow test my_workflow --cached
# Run evals fresh and save results
output workflow test my_workflow --save
# Run specific datasets only
output workflow test my_workflow --dataset happy_path,edge_case
# List all datasets
output workflow dataset list my_workflow
Typical Workflow
# 1. Start the dev server
npm run output:dev
# 2. Generate datasets from real workflow runs
output workflow dataset generate blog_generator --input '{"topic": "AI"}' --name ai_post
# 3. Edit the dataset YAML to add ground_truth values for your evaluators
# 4. Run evals with --save to cache output and eval results
output workflow test blog_generator --save
# 5. Iterate on evaluators, re-run with cached output (fast)
output workflow test blog_generator --cached
# 6. List all datasets
output workflow dataset list blog_generator
Verification Checklist
- [ ] Evaluators import
verify,Verdictfrom@outputai/evals(not@outputai/core) - [ ] Eval workflow imports
evalWorkflowfrom@outputai/evals - [ ] All imports use
.jsextension - [ ] Eval workflow name follows
{workflow_name}_evalpattern - [ ] Dataset YAML files are in
tests/datasets/ - [ ] Evaluator files are in
tests/evals/ - [ ] Each evaluator has a unique
namein snake_case - [ ]
criticalityis set to'required'or'informational'for each eval - [ ]
interprettype matches evaluator return type - [ ] Ground truth keys in dataset match evaluator names
- [ ] Judge
.promptfiles are intests/evals/alongside evaluators - [ ]
zis imported from@outputai/core(notzod)
Related Skills
output-dev-evaluator-function— Runtime evaluators usingevaluator()from@outputai/coreoutput-dev-scenario-file— Creating scenario JSON files for workflow executionoutput-dev-folder-structure— Understanding project directory layoutoutput-dev-prompt-file— Creating.promptfiles for LLM operationsoutput-eval-error-analysis— Identify failure modes before building evaluatorsoutput-eval-judge-prompt— Design effective LLM judge promptsoutput-eval-dataset-design— Generate diverse test datasetsoutput-eval-validate-judge— Validate LLM judges against human labelsoutput-eval-audit— Audit an existing eval suite for trustworthiness
原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。