📋output-dev-evaluator-function
- プラグイン
- outputai
- ソース
- GitHub で見る ↗
説明
次のような場合に使用: 品質評価、バリデーションロジック、またはコンテンツ評価を実装する場合。 Output SDK ワークフロー向けに、`evaluators.ts` 内にエバリュエーター関数を作成します。
原文を表示
Create evaluator functions in evaluators.ts for Output SDK workflows. Use when implementing quality assessment, validation logic, or content evaluation.
ユースケース
- ✓品質評価を実装するとき
- ✓バリデーションロジックを実装するとき
- ✓コンテンツ評価を実装するとき
本文
Creating Evaluator Functions
Overview
This skill documents how to create evaluator functions in evaluators.ts for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.
When to Use This Skill
- Implementing quality assessment for workflow outputs
- Adding validation logic with confidence scores
- Creating LLM-powered content evaluation
- Building reusable evaluation components
File Organization
Option 1: Flat File (Default)
For smaller workflows, use a single evaluators.ts file:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators.ts # All evaluators in one file
├── types.ts
└── ...
Option 2: Folder-Based (Large workflows)
For larger workflows with many evaluators, use an evaluators/ folder:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators/ # Evaluators split into individual files
│ ├── quality.ts
│ ├── accuracy.ts
│ └── completeness.ts
├── types.ts
└── ...
Component Location Rules
Important: evaluator() calls MUST be in files containing 'evaluators' in the path:
src/workflows/my_workflow/evaluators.ts✓src/workflows/my_workflow/evaluators/quality.ts✓src/shared/evaluators/common_evaluators.ts✓src/workflows/my_workflow/helpers.ts✗ (cannot contain evaluator() calls)
Activity Isolation Constraints
Evaluators are Temporal activities with strict import rules to ensure deterministic replay.
Evaluators CAN import from:
- Local workflow files:
./utils.js,./types.js,./helpers.js - Local subdirectories:
./lib/helpers.js - Shared utilities:
../../shared/utils/*.js - Shared clients:
../../shared/clients/*.js - Shared services:
../../shared/services/*.js
Evaluators CANNOT import:
- Other evaluator files (activity isolation)
- Step files
- Workflow files
Example of WRONG imports:
// WRONG - evaluators cannot import other evaluators
import { otherEvaluator } from '../../shared/evaluators/other.js'; // ✗
import { anotherEvaluator } from './other_evaluators.js'; // ✗
Critical Import Patterns
Core Imports
// CORRECT - Import from @outputai/core
import {
evaluator,
z,
EvaluationBooleanResult,
EvaluationNumberResult,
EvaluationStringResult,
EvaluationFeedback
} from '@outputai/core';
// WRONG - Never import z from zod
import { z } from 'zod';
LLM Client Import (for LLM-powered evaluators)
// CORRECT - Use @outputai/llm wrapper
import { generateText, Output } from '@outputai/llm';
// WRONG - Never call LLM providers directly
import OpenAI from 'openai';
ES Module Imports
All imports MUST use .js extension:
// CORRECT
import { BlogContent } from './types.js';
// WRONG - Missing .js extension
import { BlogContent } from './types';
Basic Structure
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const myEvaluator = evaluator( {
name: 'my_evaluator',
description: 'Description of what this evaluator assesses',
inputSchema: z.object( { /* input schema */ } ),
fn: async input => {
// Evaluation logic
return new EvaluationBooleanResult( {
value: true,
confidence: 0.95
} );
}
} );
Required Properties
name (string)
Unique identifier for the evaluator. Use snake_case.
name: 'evaluate_content_quality'
description (string)
Human-readable description of what the evaluator assesses.
description: 'Evaluate the quality and completeness of generated content'
inputSchema (Zod schema)
Schema for validating evaluator input.
inputSchema: z.object( {
content: z.string(),
expectedLength: z.number()
} )
fn (async function)
The evaluator execution function. Returns an evaluation result with value and confidence.
fn: async input => {
const isValid = input.content.length >= input.expectedLength;
return new EvaluationBooleanResult( {
value: isValid,
confidence: 0.95
} );
}
Result Types
EvaluationBooleanResult
Use for pass/fail or true/false evaluations:
import { EvaluationBooleanResult } from '@outputai/core';
return new EvaluationBooleanResult( {
value: true, // boolean result
confidence: 0.95, // 0.0 to 1.0
reasoning: 'Optional explanation of the evaluation'
} );
EvaluationNumberResult
Use for numeric scores or ratings:
import { EvaluationNumberResult } from '@outputai/core';
return new EvaluationNumberResult( {
value: 85, // numeric result (e.g., 0-100 score)
confidence: 0.85, // 0.0 to 1.0
reasoning: 'Optional explanation of the score'
} );
EvaluationStringResult
Use for categorical or text-based evaluations:
import { EvaluationStringResult } from '@outputai/core';
return new EvaluationStringResult( {
value: 'positive', // string result (e.g., category, sentiment, label)
confidence: 0.9, // 0.0 to 1.0
reasoning: 'Optional explanation of the classification'
} );
Result Properties
| Property | Type | Required | Description |
|---|---|---|---|
value |
boolean, number, or string |
Yes | The evaluation result |
confidence |
number (0.0-1.0) |
Yes | Confidence in the evaluation |
reasoning |
string |
No | Explanation of the evaluation |
name |
string |
No | Name for this specific result (useful in dimensions) |
feedback |
EvaluationFeedback[] |
No | Array of feedback objects with issues and suggestions |
dimensions |
EvaluationResult[] |
No | Nested results for multi-dimensional evaluation |
Simple Evaluator Examples
Boolean Evaluator - Content Validation
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateCompleteness = evaluator( {
name: 'evaluate_completeness',
description: 'Check if content meets minimum length requirements',
inputSchema: z.object( {
content: z.string(),
minLength: z.number().default( 100 )
} ),
fn: async ( { content, minLength } ) => {
const isComplete = content.length >= minLength;
return new EvaluationBooleanResult( {
value: isComplete,
confidence: 1.0,
reasoning: isComplete ?
`Content has ${content.length} characters, meets minimum of ${minLength}` :
`Content has ${content.length} characters, below minimum of ${minLength}`
} );
}
} );
Boolean Evaluator - Pattern Detection
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateGibberish = evaluator( {
name: 'evaluate_gibberish',
description: 'Check if a given string is gibberish',
inputSchema: z.string(),
fn: async content => {
const gibberishPatterns = [ 'foo', 'bar', 'lorem', 'ipsum' ];
const isGibberish = gibberishPatterns.some( p => content.toLowerCase().includes( p ) );
return new EvaluationBooleanResult( {
value: !isGibberish,
confidence: 0.95
} );
}
} );
Number Evaluator - Quality Score
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
export const evaluateReadability = evaluator( {
name: 'evaluate_readability',
description: 'Calculate readability score based on sentence structure',
inputSchema: z.object( {
content: z.string()
} ),
fn: async ( { content } ) => {
const sentences = content.split( /[.!?]+/ ).filter( s => s.trim() );
const words = content.split( /\s+/ ).filter( w => w.trim() );
const avgWordsPerSentence = words.length / Math.max( sentences.length, 1 );
// Simple readability score (lower avg words = more readable)
const score = Math.max( 0, Math.min( 100, 100 - ( avgWordsPerSentence - 15 ) * 5 ) );
return new EvaluationNumberResult( {
value: Math.round( score ),
confidence: 0.8,
reasoning: `Average ${avgWordsPerSentence.toFixed( 1 )} words per sentence`
} );
}
} );
String Evaluator - Sentiment Classification
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
export const evaluateSentiment = evaluator( {
name: 'evaluate_sentiment',
description: 'Classify the sentiment of content',
inputSchema: z.object( {
content: z.string()
} ),
fn: async ( { content } ) => {
const positiveWords = [ 'great', 'excellent', 'amazing', 'good', 'love' ];
const negativeWords = [ 'bad', 'terrible', 'awful', 'hate', 'poor' ];
const lowerContent = content.toLowerCase();
const positiveCount = positiveWords.filter( w => lowerContent.includes( w ) ).length;
const negativeCount = negativeWords.filter( w => lowerContent.includes( w ) ).length;
const { sentiment, confidence } = positiveCount > negativeCount ?
{ sentiment: 'positive', confidence: Math.min( 0.95, 0.6 + positiveCount * 0.1 ) } :
negativeCount > positiveCount ?
{ sentiment: 'negative', confidence: Math.min( 0.95, 0.6 + negativeCount * 0.1 ) } :
{ sentiment: 'neutral', confidence: 0.7 };
return new EvaluationStringResult( {
value: sentiment,
confidence,
reasoning: `Found ${positiveCount} positive and ${negativeCount} negative indicators`
} );
}
} );
LLM-Powered Evaluator Examples
Note: Evaluators are self-contained components that don't share schemas across steps, so defining Output.object() schemas inline is acceptable here. For workflow steps that share schemas, define them in types.ts instead.
Using generateText with Output.object() for Evaluation
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateSignalToNoise = evaluator( {
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of content',
inputSchema: z.object( {
title: z.string(),
content: z.string()
} ),
fn: async ( { title, content } ) => {
const { output } = await generateText( {
prompt: 'signal_noise@v1', // References prompts/signal_noise@v1.prompt
variables: {
title,
content
},
output: Output.object( {
schema: z.object( {
score: z.number().describe( 'Signal-to-noise score 0-100' )
} )
} )
} );
return new EvaluationNumberResult( {
value: output.score,
confidence: 0.85
} );
}
} );
LLM Boolean Evaluation
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateFactualAccuracy = evaluator( {
name: 'evaluate_factual_accuracy',
description: 'Check if content contains factual claims that can be verified',
inputSchema: z.object( {
content: z.string(),
topic: z.string()
} ),
fn: async ( { content, topic } ) => {
const { output } = await generateText( {
prompt: 'factual_check@v1',
variables: { content, topic },
output: Output.object( {
schema: z.object( {
isFactual: z.boolean().describe( 'Whether content appears factually accurate' ),
confidence: z.number().describe( 'Confidence in assessment 0-1' ),
issues: z.array( z.string() ).optional().describe( 'Any factual issues found' )
} )
} )
} );
return new EvaluationBooleanResult( {
value: output.isFactual,
confidence: output.confidence,
reasoning: output.issues?.length ?
`Issues found: ${output.issues.join( ', ' )}` :
'No factual issues detected'
} );
}
} );
LLM String Evaluation - Content Classification
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateContentCategory = evaluator( {
name: 'evaluate_content_category',
description: 'Classify content into a category',
inputSchema: z.object( {
content: z.string(),
categories: z.array( z.string() )
} ),
fn: async ( { content, categories } ) => {
const { output } = await generateText( {
prompt: 'categorize_content@v1',
variables: {
content,
categories: categories.join( ', ' )
},
output: Output.object( {
schema: z.object( {
category: z.string().describe( 'The best matching category' ),
confidence: z.number().describe( 'Confidence in classification 0-1' ),
explanation: z.string().describe( 'Why this category was chosen' )
} )
} )
} );
return new EvaluationStringResult( {
value: output.category,
confidence: output.confidence,
reasoning: output.explanation
} );
}
} );
EvaluationResult with Feedback
Use the feedback field to provide actionable improvement suggestions alongside your evaluation result. Import EvaluationFeedback from @outputai/core to create feedback objects.
import { evaluator, z, EvaluationStringResult, EvaluationFeedback } from '@outputai/core';
export const evaluateWithFeedback = evaluator( {
name: 'evaluate_with_feedback',
description: 'Evaluate content quality and provide actionable feedback',
inputSchema: z.string(),
fn: async response => {
const feedback = [];
if ( response.length < 50 ) {
feedback.push( new EvaluationFeedback( {
issue: 'Response is too short',
suggestion: 'Expand the response with more detail',
priority: 'medium'
} ) );
}
return new EvaluationStringResult( {
value: feedback.length === 0 ? 'good' : 'needs_improvement',
confidence: 0.85,
feedback: feedback
} );
}
} );
EvaluationFeedback Properties
| Property | Type | Description |
|---|---|---|
issue |
string |
The problem identified |
suggestion |
string |
Recommended fix |
priority |
string |
Priority level (e.g., 'low', 'medium', 'high') |
Multi-Dimensional Evaluation
Use the dimensions field to nest EvaluationResult instances for sub-scores. Each dimension should use the name field to identify it.
import { evaluator, z, EvaluationStringResult, EvaluationNumberResult } from '@outputai/core';
export const evaluateMultiDimensional = evaluator( {
name: 'evaluate_multi_dimensional',
description: 'Evaluate content across multiple quality dimensions',
inputSchema: z.string(),
fn: async response => {
const coherenceScore = calculateCoherence( response );
const relevanceScore = calculateRelevance( response );
const overallScore = ( coherenceScore + relevanceScore ) / 2;
return new EvaluationStringResult( {
value: overallScore > 0.7 ? 'high_quality' : 'low_quality',
confidence: 0.9,
dimensions: [
new EvaluationNumberResult( {
value: coherenceScore,
confidence: 0.85,
name: 'coherence'
} ),
new EvaluationNumberResult( {
value: relevanceScore,
confidence: 0.88,
name: 'relevance'
} )
]
} );
}
} );
Complete Example
Based on a real workflow evaluator file:
import { evaluator, z, EvaluationBooleanResult, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { blogContentSchema } from './types.js';
import type { BlogContent, QualityMetrics } from './types.js';
// Simple boolean evaluator
export const evaluateMinimumLength = evaluator( {
name: 'evaluate_minimum_length',
description: 'Check if blog content meets minimum length requirements',
inputSchema: blogContentSchema,
fn: async ( input: BlogContent ) => {
const MIN_TOKENS = 500;
const meetsRequirement = input.tokenCount >= MIN_TOKENS;
return new EvaluationBooleanResult( {
value: meetsRequirement,
confidence: 1.0,
reasoning: `Content has ${input.tokenCount} tokens (minimum: ${MIN_TOKENS})`
} );
}
} );
// LLM-powered number evaluator
export const evaluateSignalToNoise = evaluator( {
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of blog content',
inputSchema: blogContentSchema,
fn: async ( input: BlogContent ) => {
const { output } = await generateText( {
prompt: 'signal_noise@v1',
variables: {
title: input.title,
content: input.content
},
output: Output.object( {
schema: z.object( {
score: z.number().describe( 'Signal-to-noise score 0-100' )
} )
} )
} );
return new EvaluationNumberResult( {
value: output.score,
confidence: 0.85
} );
}
} );
// LLM-powered boolean evaluator
export const evaluateRelevance = evaluator( {
name: 'evaluate_relevance',
description: 'Check if content is relevant to the stated topic',
inputSchema: z.object( {
content: z.string(),
topic: z.string(),
keywords: z.array( z.string() )
} ),
fn: async ( { content, topic, keywords } ) => {
const { output } = await generateText( {
prompt: 'relevance_check@v1',
variables: { content, topic, keywords: keywords.join( ', ' ) },
output: Output.object( {
schema: z.object( {
isRelevant: z.boolean(),
relevanceScore: z.number().describe( 'Relevance score 0-1' ),
explanation: z.string()
} )
} )
} );
return new EvaluationBooleanResult( {
value: output.isRelevant,
confidence: output.relevanceScore,
reasoning: output.explanation
} );
}
} );
Best Practices
1. Use Appropriate Result Types
// Boolean for pass/fail decisions
return new EvaluationBooleanResult( { value: true, confidence: 0.9 } );
// Number for scores and ratings
return new EvaluationNumberResult( { value: 85, confidence: 0.85 } );
// String for categories, labels, or classifications
return new EvaluationStringResult( { value: 'positive', confidence: 0.9 } );
2. Provide Meaningful Confidence Scores
// High confidence for deterministic checks
confidence: 1.0 // e.g., length checks, pattern matching
// Medium confidence for heuristic-based evaluations
confidence: 0.85 // e.g., LLM-based assessments
// Lower confidence for uncertain evaluations
confidence: 0.7 // e.g., subjective quality judgments
3. Include Reasoning for Transparency
return new EvaluationBooleanResult( {
value: false,
confidence: 0.95,
reasoning: `Content contains ${errorCount} grammatical errors, exceeding threshold of ${maxErrors}`
} );
4. Keep Evaluators Focused
// Good - single responsibility
export const evaluateGrammar = evaluator( { ... } );
export const evaluateReadability = evaluator( { ... } );
export const evaluateTone = evaluator( { ... } );
// Avoid - doing too much in one evaluator
export const evaluateEverything = evaluator( { ... } );
5. Use Descriptive Names
// Good - clear what is being evaluated
name: 'evaluate_content_originality'
name: 'evaluate_factual_accuracy'
name: 'evaluate_sentiment_alignment'
// Avoid - vague names
name: 'check'
name: 'validate'
name: 'evaluate_stuff'
6. Use Feedback for Actionable Improvements
feedback: [
new EvaluationFeedback( {
issue: 'Missing conclusion paragraph',
suggestion: 'Add a summary paragraph at the end',
priority: 'high'
} )
]
7. Use Dimensions for Multi-Criteria Evaluation
dimensions: [
new EvaluationNumberResult( { value: 8, confidence: 0.9, name: 'coherence' } ),
new EvaluationNumberResult( { value: 6, confidence: 0.85, name: 'relevance' } )
]
Verification Checklist
- [ ]
evaluator,z, result types imported from@outputai/core - [ ]
generateTextandOutputimported from@outputai/llmif using LLM (not direct provider) - [ ] LLM output schemas use
.describe()instead of.min()/.max()onz.number() - [ ] All imports use
.jsextension - [ ] Named exports used for each evaluator
- [ ] Each evaluator has
name,description,inputSchema,fn - [ ] Evaluator name uses
snake_case - [ ] Returns appropriate result type (
EvaluationBooleanResult,EvaluationNumberResult, orEvaluationStringResult) - [ ] Confidence score between 0.0 and 1.0
- [ ] Evaluators only import allowed dependencies (local files, shared code)
- [ ] No imports of other evaluators, steps, or workflows
- [ ]
EvaluationFeedbackimported from@outputai/corewhen using feedback - [ ] Feedback objects include
issue,suggestion, andpriority - [ ] Dimensions use the
namefield to identify sub-evaluations
Related Skills
output-dev-workflow-function- Orchestrating evaluators in workflow.tsoutput-dev-step-function- Creating step functionsoutput-dev-types-file- Defining evaluator input schemasoutput-dev-prompt-file- Creating prompt files for LLM-powered evaluatorsoutput-dev-folder-structure- Understanding project layoutoutput-eval-error-analysis— Identify what to evaluate before writing evaluators
原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。