スキルOfficialdevelopment

📊model-evaluation

プラグイン: sagemaker-ai
ソース: GitHub で見る ↗

説明

Pythonコードを生成して、SageMakerモデルの評価を実行します。 LLM-as-Judge（LLMを評価者として使用する方式）とCustom Scorer（カスタムスコアラー）の2種類の評価タイプをサポートしています。次のような場合に使用: ユーザーが「モデルを評価して」「ベンチマークを実行して」「モデルのパフォーマンスをテストして」「モデルの結果を確認したい」「モデルを比較して」など、類似のリクエストを行ったとき。

原文を表示

Generates python code that evaluates SageMaker models. Supports two evaluation types: LLM-as-Judge and Custom Scorer. Use when the user says "evaluate my model", "run a benchmark", "test model performance", "how did my model perform", "compare models", or other similar requests.

ユースケース

✓SageMakerモデルを評価する
✓モデルのパフォーマンスをテストする
✓ベンチマークを実行する
✓複数モデルを比較する
✓モデルの結果を確認する

本文（日本語訳）

モデル評価

SageMaker モデルを評価するコードを生成します。

前提条件

SDK 環境が確認済みであること（SDK バージョン、リージョン、実行ロール）。未実施の場合は、先に sdk-getting-started スキルを有効化してください。

原則

一度に一つのことを行う。 各レスポンスは厳密に一つの決定を前進させる。一つのターンで複数の質問を組み合わせない。
進める前に確認する。 次のステップに移る前に、ユーザーの同意を待つ。
必要になるまでファイルを読まない。 参照ファイルは、そのファイルが必要なステップに到達したときのみ読む。
既知のことは聞かない。 会話履歴、workflow_state.json、plan.md、または既に読んだファイルに答えがある場合はそれを使用する。不確かな場合は確認するが、再度聞き直さない。
ナレーションをしない。 結果を伝え、質問する。レスポンスは短く保つ。
繰り返さない。 ツール呼び出しの前に伝えたことは、呼び出し後に繰り返さない。

スコープ

このスキルは、SageMaker Serverless Model Customization の評価機能をサポートします。 SageMaker サーバーレスモデルカスタマイズでサポートされているベースモデルおよびファインチューニング済みモデル（OSS モデル（Llama、Mistral、Qwen 等）および Nova モデルの両方）を評価できます。

スキルが有効化された際にユーザーへ伝えること:

「SageMaker サーバーレスモデルカスタマイズでサポートされている、あらゆるベースモデルまたはファインチューニング済みモデルの評価をお手伝いできます。」

SageMaker サーバーレスモデルカスタマイズでサポートされていないモデルの評価支援をユーザーが要求した場合は、このスキルではサポートされていない旨を説明してください。

評価タイプ

評価タイプは 2 種類あります:

LLM-as-Judge — LLM がモデルのレスポンスを採点します。（OSS モデル専用 — Nova は非対応）
Custom Scorer — Lambda 関数によるプログラム的な評価（数学・コードの組み込みスコアラーを含む）。OSS モデルおよび Nova モデルの両方に対応。

ワークフロー

ステップ 1: 評価タイプの決定

使用する評価タイプを既に把握していますか？

会話履歴、plan.md、workflow_state.json、または既に読んだその他の情報を確認してください。

把握している場合: ユーザーに確認する。

「[評価タイプ] を実行したいということですね。よろしいですか？」

⏸ 確認を待つ。確認が取れたら → ステップ 2 へ。

把握していない場合: 質問する。

「どのような評価を実行したいですか？サポートしている評価タイプは以下の通りです:

LLM-as-Judge — LLM がモデルのレスポンスを採点します

Custom Scorer — プログラム的なスコアリング（数学、コード、または独自ロジック）

どちらかを選択するか、決められない場合は「決めるのを手伝って」とお伝えください。」

⏸ ユーザーの回答を待つ。

ユーザーがいずれかを選択した場合 → ステップ 2 へ。
ユーザーが「決めるのを手伝って」「どちらでもいい」「よくわからない」などの不確かな意思表示をした場合 → references/evaluation-type-guide.md を読み、その指示に従う。このファイルがユーザーを選択へ導いた後、ここに戻ってくる。 references/evaluation-type-guide.md を読まずに、評価タイプについてユーザーへ推奨を行ってはならない。

ステップ 2: 検証と評価ワークフローへの引き渡し

参照ファイルを読む前に、選択された評価タイプがユーザーの状況と互換性があるか検証してください。会話のコンテキストから既に把握している場合は、不要な質問をしないこと。

LLM-as-Judge の検証

評価対象のモデルタイプは何か？ LLM-as-Judge は Nova モデルに対応していません。モデルタイプを特定するには（まだ把握していない場合）:
- トレーニングジョブ名または ARN がある場合は、AWS MCP ツールの list-tags をトレーニングジョブ ARN に対して使用し、sagemaker-studio:jumpstart-model-id タグを確認する。「nova」を含む → Nova。それ以外 → OSS。
- Model Package ARN がある場合は、AWS MCP ツールの describe-model-package を使用し、モデルの説明またはソースタグを確認する。
- どちらも利用できない場合は、ユーザーに確認する。
ユーザーは評価データセットを持っているか？ LLM-as-Judge には評価データセットが必要です。

Custom Scorer の検証

ユーザーは評価データセットを持っているか？ Custom Scorer には評価データセットが必要です。（OSS モデルと Nova モデルの両方に対応していますが、Nova の場合はカスタム Lambda のみサポートされます。）

検証が失敗した場合は、満たされていない要件をユーザーに伝え、代替案を提示してください:

「[評価タイプ] は [理由] のため使用できません。」

失敗の理由が評価データセットの欠如である場合、対処できることはありません。ユーザーに次のように伝えてください:

「申し訳ありませんが、サポートされているすべての評価タイプには評価データセットが必要です。モデル評価のサポートはできません。」

失敗の理由がそれ以外の場合は、別の評価タイプの選択を支援することを提案してください。

⏸ ユーザーの回答を待つ。

別の評価タイプの選択を希望する場合 → references/evaluation-type-guide.md を読む。

検証が通過した場合は、対応する参照ファイルを読んでください:

ユーザーの選択	読むファイル
LLM-as-Judge	`references/llmaaj-evaluation.md`
Custom Scorer	`references/custom-scorer-evaluation.md`

参照ファイルの指示を最初から従ってください。

原文（English）を表示

Model Evaluation

Generate code that evaluates a SageMaker model.

Prerequisites

The SDK environment has been verified (SDK version, region, execution role). If not done, activate the sdk-getting-started skill first.

Principles

One thing at a time. Each response advances exactly one decision. Never combine multiple questions in a single turn.
Confirm before proceeding. Wait for the user to agree before moving to the next step.
Don't read files until you need them. Only read reference files when you've reached the step that requires them.
Don't ask what you already know. If the answer is in conversation history, workflow_state.json, plan.md, or any file you've already read — use it. Confirm if unsure, but don't re-ask.
No narration. Share outcomes and ask questions. Keep responses short.
No repetition. If you said something before a tool call, don't repeat it after.

Scope

This skill supports the evaluation feature for SageMaker Serverless Model Customization. It can evaluate any base or fine-tuned model supported by SageMaker serverless model customization — both OSS models (Llama, Mistral, Qwen, etc.) and Nova models.

Tell the user when the skill is activated:

"I can help evaluate any base or fine-tuned model supported by SageMaker serverless model customization."

If the user requests help evaluating a model that isn't supported by SageMaker serverless model customization, explain that it is not supported by this skill.

Evaluation Types

There are two evaluation types:

LLM-as-Judge — an LLM grades your model's responses. (OSS models only — not supported for Nova.)
Custom Scorer — programmatic evaluation via Lambda function (includes built-in math and code scorers). Works with both OSS and Nova models.

Workflow

Step 1: Determine evaluation type

Do you already know which evaluation type to use?

Check conversation history, plan.md, workflow_state.json, or anything else you've already read.

If yes: confirm with the user.

"It sounds like you want to run [evaluation type]. Is that right?"

⏸ Wait for confirmation. If confirmed → go to Step 2.

If no: ask.

"What kind of evaluation would you like to run? I support:

LLM-as-Judge — an LLM grades your model's responses

Custom Scorer — programmatic scoring (math, code, or your own logic)

Pick one, or say 'help me decide' if you're not sure."

⏸ Wait for user.

If user picks one → go to Step 2.
If user indicates uncertainty, by saying something like "help me decide," "whatever you think," "I'm not sure" → read references/evaluation-type-guide.md and follow its instructions. It will guide the user to a choice and then return here. You MUST NEVER make a recommendation to the user on eval type without reading references/evaluation-type-guide.md.

Step 2: Validate and hand off to evaluation workflow

Before reading the reference file, validate that the chosen evaluation type is compatible with the user's situation. You may already know these answers from conversation context — don't ask if you don't need to.

LLM-as-Judge validation

What model type are we evaluating? LLM-as-Judge is not supported for Nova models. To determine model type (if you don't already know it):
- If you have the training job name or ARN, use the AWS MCP tool list-tags on the training job ARN and look for the sagemaker-studio:jumpstart-model-id tag. Contains "nova" → Nova. Anything else → OSS.
- If you have a Model Package ARN, use the AWS MCP tool describe-model-package and check the model description or source tags.
- If neither is available, ask the user.
Does the user have an evaluation dataset? LLM-as-Judge requires one.

Custom Scorer validation

Does the user have an evaluation dataset? Custom Scorer requires one. (Works with both OSS and Nova models, though for Nova only custom lambdas are supported.)

If validation fails, tell the user which requirement(s) aren't met and offer alternatives:

"[Evaluation type] won't work because [reason]."

If the failure reason was lack of an eval dataset, there's nothing we can do. Inform the user:

"Unfortunately all of the supported eval types require an eval dataset. I can't help you with model evaluation."

If the failure reason is something else, offer to help them pick a different evaluation type.

⏸ Wait for user.

If they say they do want help choosing a different eval type → read references/evaluation-type-guide.md.

If validation passes, read the corresponding reference file:

User chose	Read
LLM-as-Judge	`references/llmaaj-evaluation.md`
Custom Scorer	`references/custom-scorer-evaluation.md`

Follow the reference file's instructions from the beginning.

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。