スキルOfficialmonitoring

🔍investigate-ai-session

プラグイン: amplitude
ソース: GitHub で見る ↗

説明

特定のAIエージェントセッションや障害パターンを詳細に調査し、問題が発生した原因を解明します。ユーザーのプロジェクトにAmplitude Agent Analyticsが実装されている場合のみ使用してください。次のような場合に使用: - 特定のセッションIDを調査するとき - エージェントの障害をデバッグするとき - 品質が低い原因を把握したいとき - ツールエラーをトレースするとき - `monitor-ai-quality` が根本原因分析を要する問題を検出したとき

原文を表示

Deep-dives into specific AI agent sessions or failure patterns to explain why something went wrong. Only use when the user has Amplitude Agent Analytics instrumented in their project. Use when investigating a specific session ID, debugging agent failures, understanding why quality is low, tracing tool errors, or when monitor-ai-quality surfaces an issue that needs root cause analysis.

ユースケース

✓特定のセッションIDを調査するとき
✓エージェントの障害をデバッグするとき
✓品質が低い原因を把握したいとき
✓ツールエラーをトレースするとき
✓根本原因分析が必要な問題を検出したとき

本文（日本語訳）

AI セッション調査ツール

特定のAIエージェントセッションや障害パターンを調査し、根本原因を特定します。セッションおよびスパンレベルで動作し、会話の読み取り、実行のトレース、障害の発生源への追跡を行います。これは /monitor-ai-quality の「何が起きたか」に続く「なぜ起きたか」を解明するスキルです。

手順

ステップ 1: 調査スコープの特定

ユーザーは以下のいずれかを提供します：

特定のセッション ID → ステップ 2 へ直接進む
障害パターン（例：「Chart Agent のタイムアウト」「昨日のツールエラー」）→ ステップ 1b へ
ユーザーからのクレーム（例：「ユーザー X がエージェントが動かないと言っている」）→ ステップ 1c へ
曖昧なシグナル（例：「エージェントに何か問題がある気がする」）→ まず /monitor-ai-quality にリダイレクトし、具体的な調査結果を得てから戻る

ステップ 1b: パターンに一致するセッションの検索

Amplitude:get_agent_analytics_schema（include: ["filter_options"]）を呼び出し、有効なエージェント名・ツール名・トピック値を確認します。その後、適切なフィルターを指定して Amplitude:query_agent_analytics_sessions を呼び出します：

エージェント障害: agentNames: ["<agent>"], hasTaskFailure: true
ツールエラー: toolNames: ["<tool>"], hasTaskFailure: true
技術的障害: hasTechnicalFailure: true
低品質: maxQualityScore: 0.4
不満を持つユーザー: maxSentimentScore: 0.4 または hasNegativeFeedback: true
コストの高いセッション: minCostUsd: <閾値>
低速なセッション: minDurationMs: <閾値>
特定トピック: primaryTopics: ["<topic>"]、またはモデル別フィルタリングには topicClassifications を使用

responseFormat: "concise"、limit: 20、ソートは "-session_start" で最新のものを取得します。詳細調査対象として最も代表的なセッションを 3〜5 件選択します。

ステップ 1c: 特定ユーザーのセッション検索

Amplitude:query_agent_analytics_sessions（searchQuery: "<メールアドレスまたはユーザー ID>"）を呼び出してセッションを検索します。特定の時間帯が報告されている場合は、startDate/endDate を追加します。クレームの内容に一致するセッションを選択します。

ステップ 2: セッションの詳細調査（呼び出し予算: 3〜6 回）

調査対象セッション（最大 3〜5 件）ごとに、以下の処理をセッション単位で並列実行します：

セッション詳細の取得 Amplitude:query_agent_analytics_sessions（sessionIds: ["<id>"]、responseFormat: "detailed"）を呼び出します。ルーブリックスコア、障害理由、トピック分類、全体的な結果、品質フラグなどのエンリッチメントデータが返されます。
会話トランスクリプトの取得 Amplitude:get_agent_analytics_conversation（sessionId: "<id>"、includeCategories: true）を呼び出します。ユーザーとエージェントのやり取り全体を読み、何が質問され、エージェントがどう応答し、どこで問題が起きたかを把握します。
実行トレースの取得 Amplitude:query_agent_analytics_spans（sessionId: "<id>"）を呼び出します。すべての LLM 呼び出し、ツール呼び出し、埋め込み操作のレイテンシ・ステータス・コスト・順序が確認できます。以下の点に注目します：
- status: "ERROR" のスパン → 直接的な障害
- レイテンシが高いツール呼び出し（>10s）→ タイムアウトまたは低速な依存関係
- 同一ツールへの複数回のリトライ → エージェントが苦戦している証拠
- トークン数が異常に多い LLM 呼び出し → プロンプト肥大化の可能性
- 操作の実行順序 → エージェントが適切なパスを選択していたか

ステップ 3: 根本原因分析

会話・トレース・エンリッチメントデータをもとに診断を行います：

障害タイプの分類:
- ツール障害: ツール呼び出しがエラーを返した、またはタイムアウトした。スパンのステータスとエラー詳細を確認。適切なツールだったか？エージェントは有効な入力を渡していたか？
- LLM 障害: モデルが不適切な応答を生成した（幻覚、拒否、フォーマット不正、無限ループ）。会話のどこで応答が逸脱したかを確認。
- オーケストレーション障害: エージェントが誤ったツールを選択した、順序が誤っていた、または早期に処理を中断した。スパンの実行順序をトレースする。
- ユーザーの混乱: ユーザーのリクエストが曖昧または実行不可能だった。エージェントが明確化を行わなかった。最初の 1〜2 ターンを確認。
- データ/コンテキストの問題: エージェントに必要なコンテキストが不足していた（スキーマ欠落、プロジェクト誤指定、古いデータ）。利用可能だったコンテキストを確認。
スコープの特定: 単発か、それとも継続的な問題か？
- パターン調査（ステップ 1b）の場合: 障害が起きているすべてのセッションで、同じ障害タイプ・ツール・エージェントが共通しているかを確認。Amplitude:query_agent_analytics_sessions（groupBy: ["agent_name"] または groupBy: ["primary_topic"]）で障害のクラスタリングを確認。
- 単一セッションの場合: Amplitude:query_agent_analytics_sessions に同一エージェントと同一時間帯を指定し、類似の障害が他に存在するかを確認。
トリガーの特定: 何が変わったのか？
- 特定の日付から障害が始まっていないか確認（新しいデプロイ、モデル変更、設定更新など）
- 障害が特定のトピックやユーザーセグメントと相関していないか確認
- Amplitude:query_agent_analytics_spans（groupBy: ["tool_name"]）でツールのエラー率が変化していないか確認

ステップ 4: 関連パターンの検索（呼び出し予算: 1〜2 回）

セッションデータだけでは根本原因が明確でない場合：

会話の検索 エラーやトピックのキーワードを使って Amplitude:search_agent_analytics_conversations を呼び出し、同様の問題を持つ他のセッションを検索します。セッションレベルのクエリでは見落とされる可能性のあるパターンを発見できます。
ツール/モデルの健全性確認 Amplitude:query_agent_analytics_spans（groupBy: ["tool_name"] または groupBy: ["model_name"]）を該当時間帯に対して呼び出します。障害セッションと相関するエラー率やレイテンシの上昇が見られるツールがないか確認します。

ステップ 5: 調査結果の報告

根本原因分析として以下の構成で出力します。

必須セクション:

調査サマリー（2〜3 文）: 何を調査し、何が判明し、深刻度はどの程度か。チーム向けの見出しとして記述。
調査対象セッション一覧: 調査したセッションのコンパクトな表:

| セッション ID | エージェント | 結果 | 品質スコア | センチメント | 障害タイプ |
|--------------|------------|------|-----------|------------|-----------|
| [id]         | [name]     | [結果] | [score]  | [score]    | [type または —] |

根本原因（1 段落）: 何が問題だったかの主要な説明。ツール名、エラー内容、モデルの挙動、オーケストレーションの問題を具体的に記述。会話とトレースからの根拠を含める。
実行トレースのハイライト（最も例示的なセッションについて）: 障害経路を示す主要なスパンをウォークスルー形式で説明:
- 例: 「ターン 1: ユーザーが X を質問 → エージェントがツール Y を呼び出し（成功、2.1s）→ エージェントがツール Z を呼び出し（ERROR、30s 後にタイムアウト）→ エージェントが質問に回答できないフォールバック応答を返した」
- 障害ポイントとその経緯に焦点を当てる
会話の抜粋（問題が明確に表れている場合）: エージェントがユーザーの期待に応えられなかった箇所を示す 2〜3 ターンを引用。簡潔にまとめる。
スコープ評価: 単発か継続的な問題か。影響を受けているセッション数は？悪化傾向にあるか？
推奨される修正策（番号付きで 2〜4 項目）: 具体的なアクション。例:
- 「query_dataset ツールにエクスポネンシャルバックオフ付きのリトライを追加する — 15 件の障害のうち 8 件は一時的なタイムアウト」
- 「エージェントが get_context より先に get_events を呼び出しており、プロジェクト ID が欠落するエラーが発生している — エージェントプロンプトのツール実行順序を修正する」
- 「リテンションについて質問しているユーザーが Funnel Agent ではなく Chart Agent にルーティングされている — ルーティングロジックを更新する」
次のステップの提案: 次のアクションを提示。例: 「このツールのタイムアウトが他のエージェントにも影響しているか確認しますか？類似のユーザークレームを検索しますか？今後数日間このパターンをモニタリングしますか？」

使用例

例 1: 特定セッションの調査

次のような場合に使用: ユーザーが「セッション abc-123 で何が起きたか？」と問い合わせてきた場合

実行内容:

abc-123 のセッション詳細・会話・スパンを取得（3 件の並列呼び出し）
会話を読み、ユーザーが何を求めていたかを理解する
スパンをトレースし、実行がどこで失敗したかを特定する
障害を分類し、継続的な問題でないかを確認する
トレースのハイライトと会話の抜粋を含めて根本原因を報告する

例 2: パターンの調査

次のような場合に使用: ユーザーが「なぜ Chart Agent のセッションが失敗しているのか？」と問い合わせてきた場合

実行内容:

AI スキーマを取得し、「Chart Agent」が有効なエージェント名であることを確認する
最近の Chart Agent の障害を取得（hasTaskFailure: true、agentNames: ["Chart Agent"]）
最新の 3 件の障害を選択し、それぞれを詳細調査する
障害を比較する — 同じツール？同じエラー？同じトピック？
スパン集計でツールの健全性を確認する
根本原因とスコープ評価を含めてパターンを報告する

例 3: ユーザーからのクレーム対応

次のような場合に使用: 「昨日、AI が誤ったデータを返したとお客様から報告があった」という問い合わせの場合

実行内容:

顧客のメールアドレスまたはユーザー ID を確認する
昨日のセッションを検索する
該当セッションを詳細調査する
会話を読み、どのデータが誤っていたかを特定する
スパンをトレースし、どのツールがそのデータを提供したかを確認する
エラーが示されている会話の抜粋とともに調査結果を報告する

トラブルシューティング

セッション

原文（English）を表示

AI Session Investigator

You investigate specific AI agent sessions or failure patterns to determine root causes. You operate at the session and span level — reading conversations, tracing execution, and connecting failures to their origins. This is the "why" skill that follows the "what" from /monitor-ai-quality.

Instructions

Step 1: Determine Investigation Scope

The user will provide one of:

A specific session ID → go directly to Step 2
A failure pattern (e.g., "Chart Agent timeouts", "tool errors in the last day") → go to Step 1b
A user complaint (e.g., "user X said the agent didn't work") → go to Step 1c
A vague signal (e.g., "something's off with the agents") → redirect to /monitor-ai-quality first, then come back with specific findings

Step 1b: Find Sessions Matching a Pattern

Call Amplitude:get_agent_analytics_schema with include: ["filter_options"] to discover valid agent names, tool names, and topic values. Then call Amplitude:query_agent_analytics_sessions with appropriate filters:

Agent failures: agentNames: ["<agent>"], hasTaskFailure: true
Tool errors: toolNames: ["<tool>"], hasTaskFailure: true
Technical failures: hasTechnicalFailure: true
Low quality: maxQualityScore: 0.4
Frustrated users: maxSentimentScore: 0.4 or hasNegativeFeedback: true
Expensive sessions: minCostUsd: <threshold>
Slow sessions: minDurationMs: <threshold>
Specific topic: primaryTopics: ["<topic>"] or use topicClassifications for model-specific filtering

Use responseFormat: "concise", limit: 20, and sort by "-session_start" to get recent examples. Select the 3-5 most representative sessions for deep investigation.

Step 1c: Find a Specific User's Sessions

Call Amplitude:query_agent_analytics_sessions with searchQuery: "<email or user ID>" to find their sessions. If they reported a specific timeframe, add startDate/endDate. Pick the session(s) that match the complaint.

Step 2: Deep-Dive into Sessions (Budget: 3-6 calls)

For each session being investigated (max 3-5 sessions), run these in parallel per session:

Full session detail. Call Amplitude:query_agent_analytics_sessions with sessionIds: ["<id>"], responseFormat: "detailed". This returns enrichment data: rubric scores, failure reasons, topic classifications, overall outcome, and quality flags.
Conversation transcript. Call Amplitude:get_agent_analytics_conversation with sessionId: "<id>", includeCategories: true. Read the full user-agent exchange to understand what was asked, how the agent responded, and where things broke down.
Execution trace. Call Amplitude:query_agent_analytics_spans with sessionId: "<id>". This shows every LLM call, tool call, and embedding operation — their latency, status, cost, and ordering. Look for:
- Spans with status: "ERROR" — direct failures
- Tool calls with high latency (>10s) — timeouts or slow dependencies
- Multiple retries of the same tool — agent struggling
- LLM calls with unusually high token counts — potential prompt bloat
- The sequence of operations — did the agent take a reasonable path?

Step 3: Root Cause Analysis

With conversation + trace + enrichment data, build the diagnosis:

Classify the failure type:
- Tool failure: A tool call returned an error or timed out. Check the span's status and error details. Was it the right tool? Did the agent pass valid inputs?
- LLM failure: The model produced a bad response — hallucination, refusal, wrong format, or infinite loop. Check the conversation for where the response diverged.
- Orchestration failure: The agent chose the wrong tools, called them in the wrong order, or gave up too early. Trace the span sequence.
- User confusion: The user's request was ambiguous or impossible. The agent failed to clarify. Check the first 1-2 turns.
- Data/context issue: The agent had insufficient context — missing schema, wrong project, stale data. Check what context was available.
Determine scope: Is this a one-off or systemic?
- If investigating a pattern (Step 1b), check: Do all failing sessions share the same failure type, tool, or agent? Use Amplitude:query_agent_analytics_sessions with groupBy: ["agent_name"] or groupBy: ["primary_topic"] to see if failures cluster.
- If a single session, call Amplitude:query_agent_analytics_sessions with the same agent and time window to check if similar failures exist.
Find the trigger: What changed?
- Check if failures started on a specific date (new deployment, model change, config update)
- Check if failures correlate with specific topics or user segments
- Check if a tool's error rate changed using Amplitude:query_agent_analytics_spans with groupBy: ["tool_name"]

Step 4: Search for Related Patterns (Budget: 1-2 calls)

If the root cause isn't clear from the session data alone:

Search conversations. Call Amplitude:search_agent_analytics_conversations with keywords from the error or topic to find other sessions with the same issue. This surfaces patterns the session-level queries might miss.
Check tool/model health. Call Amplitude:query_agent_analytics_spans with groupBy: ["tool_name"] or groupBy: ["model_name"] over the relevant time window. Look for tools with elevated error rates or latency that correlate with the failing sessions.

Step 5: Present the Investigation

Structure the output as a root cause analysis.

Required sections:

Investigation summary (2-3 sentences): What was investigated, what was found, and the severity. Written as a headline for the team.
Sessions examined: A compact table of the sessions investigated:

| Session ID | Agent | Outcome | Quality | Sentiment | Failure Type |
|------------|-------|---------|---------|-----------|--------------|
| [id] | [name] | [outcome] | [score] | [score] | [type or —] |

Root cause (1 paragraph): The primary explanation for what went wrong. Be specific — name the tool, the error, the model behavior, or the orchestration issue. Include evidence from the conversation and trace.
Execution trace highlights (for the most illustrative session): Walk through the key spans showing the failure path:
- "Turn 1: User asked X → Agent called tool Y (OK, 2.1s) → Agent called tool Z (ERROR, timeout after 30s) → Agent responded with fallback that didn't address the question"
- Focus on the failure point and what led to it
Conversation excerpt (if revealing): Quote the 2-3 most relevant turns showing where the agent failed the user. Keep it brief.
Scope assessment: One-off vs. systemic. How many sessions are affected? Is it getting worse?
Recommended fixes (2-4 numbered items): Concrete actions. Examples:
- "Add a retry with exponential backoff for the query_dataset tool — 8 of 15 failures are transient timeouts"
- "The agent is calling get_events before get_context, causing a missing project ID error — fix the tool ordering in the agent prompt"
- "Users asking about retention are getting routed to the Chart Agent instead of the Funnel Agent — update the routing logic"
Follow-on prompt: Offer next steps — "Want me to check if this tool timeout affects other agents, search for similar user complaints, or monitor this pattern over the next few days?"

Examples

Example 1: Specific Session Investigation

User says: "What happened in session abc-123?"

Actions:

Get detailed session data, conversation, and spans for abc-123 (3 parallel calls)
Read the conversation to understand what the user wanted
Trace the spans to find where the execution failed
Classify the failure and check if it's systemic
Present root cause with trace highlights and conversation excerpt

Example 2: Pattern Investigation

User says: "Why are Chart Agent sessions failing?"

Actions:

Get AI schema to confirm "Chart Agent" is a valid agent name
Query recent Chart Agent failures (hasTaskFailure: true, agentNames: ["Chart Agent"])
Pick the 3 most recent failures and deep-dive into each
Compare the failures — same tool? Same error? Same topic?
Check tool health with span aggregations
Present the pattern with root cause and scope assessment

Example 3: User Complaint

User says: "A customer said our AI gave them wrong data yesterday"

Actions:

Ask for the customer's email or user ID
Search for their sessions from yesterday
Deep-dive into the relevant session(s)
Read the conversation to find what data was wrong
Trace the spans to see what tools provided the data
Present findings with the specific conversation excerpt showing the error

Troubleshooting

Session ID not found

The session may be from a different project, or outside the data retention window. Ask the user to confirm the project and check if the session ID is correct.

Spans not available for a session

Span-level data requires OpenTelemetry-compatible tracing in the AI agent. Report what's available from the session and conversation level and note that span data would help narrow the root cause.

Too many failing sessions to investigate

Don't try to investigate more than 5 sessions in detail. Instead, use groupBy on query_agent_analytics_sessions to find the common pattern, then deep-dive into 2-3 representative examples.

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。