スキルOfficialmonitoring

🔍diagnose-errors

プラグイン: amplitude
ソース: GitHub で見る ↗

説明

ネットワーク障害、JavaScriptエラー、エラークリックを横断的に調査し、何が壊れているか・どこで・なぜ発生しているかを特定します。次のような場合に使用: ユーザーが「何が壊れているか」「エラーが増加している」「なぜユーザーがエラーを見ているのか」「JSエラー」「ネットワーク障害」「5xxスパイク」「何かが壊れている」と言っている場合、またはプロダクトの信頼性に関する問題をトリアージしたい場合。

原文を表示

Investigates errors across network failures, JavaScript errors, and error clicks to identify what's broken, where, and why. Use when the user says "what's broken", "errors are up", "why are users seeing errors", "JS errors", "network failures", "5xx spike", "something is broken", or wants to triage product reliability issues.

ユースケース

✓エラーが増加しているとき
✓ネットワーク障害を調査するとき
✓JavaScriptエラーの原因を特定するとき
✓プロダクト信頼性の問題をトリアージするとき
✓ユーザーがエラーを見ている理由を把握したいとき

本文（日本語訳）

エラー診断 & トリアージ

3種類の自動キャプチャイベント — [Amplitude] Network Request、[Amplitude] Error Logged、[Amplitude] Error Click — をトリアージすることで、プロダクトのエラーを調査します。何が壊れているか、どのユーザーが影響を受けているか、その原因は何かを特定します。このスキルは3つのシグナルをクロスリファレンスし、各シグナルを個別に扱うのではなく、因果チェーン（リクエスト失敗 → JSエラー → ユーザーの不満）を浮かび上がらせます。

これはリアクティブ調査スキルです — ユーザーが何らかのシグナル（スパイク、クレーム、実験のリグレッション、勘など）を持っており、何が起きているかを把握したい場合に使用します。プロアクティブな監視には、代わりに monitor-reliability スキルを使用してください。

重要: イベントリファレンス

このスキルが操作する3種類の自動キャプチャイベントです。プロパティ名は絶対に推測せず、以下の名称をそのまま使用してください。

[Amplitude] Network Request — ブラウザのネットワークリクエスト。主要プロパティ: [Amplitude] URL、[Amplitude] Status Code、[Amplitude] Duration、[Amplitude] Request Method、[Amplitude] Request Type、[Amplitude] Request Body Size、[Amplitude] Response Body Size、[Amplitude] Start Time、[Amplitude] Completion Time、[Amplitude] Page Path

[Amplitude] Error Logged — JavaScriptエラー。主要プロパティ: Error Message、Error Type、Error URL、File Name、Error Lineno、Error Colno、Error Stack Trace、[Amplitude] Error Detection Source

[Amplitude] Error Click — エラーに関連するUI要素へのクリック。主要プロパティ: [Amplitude] Message、[Amplitude] Kind、[Amplitude] Filename、[Amplitude] Line Number、[Amplitude] Column Number、[Amplitude] Element Text、[Amplitude] Element Tag、[Amplitude] Element Hierarchy

3つすべてに共通: [Amplitude] Page Path、[Amplitude] Page URL、[Amplitude] Session Replay ID

手順

ステップ 1: コンテキストとスコープの確認

Amplitude:get_context を呼び出します。複数のプロジェクトがある場合は、どれを調査するか確認します。
ユーザーのリクエストから調査スコープを決定します:
- 広範なトリアージ: 「何が壊れている?」→ 3種類のイベントタイプすべてをスキャンし、最大の問題を特定する
- ターゲット絞り込み: 「ネットワークエラーが増えている」→ [Amplitude] Network Request から開始し、JSエラーへのカスケードがないか確認する
- 特定エラー: 「ユーザーがTypeErrorを見ている」→ そのエラーでフィルタした [Amplitude] Error Logged から開始する
時間ウィンドウを決定します。ユーザーが特に指定しない限り、デフォルトは日次粒度で過去7日間とします。デプロイや日付に言及がある場合は、そこを起点にします。

ステップ 2: エラーの全体像を定量化する

可能な場合は並行して実行します。このステップの予算: 4〜6回の呼び出し。

2a. ネットワーク障害

Amplitude:query_dataset を使用して [Amplitude] Network Request をクエリします:

障害率のトレンド。 [Amplitude] Status Code を4xx・5xxの範囲でフィルタします。日次イベント数とユニークユーザー数を計測します。障害率(%)を求めるため、総ネットワークリクエスト数と比較します。
上位の障害エンドポイント。 [Amplitude] URL でグループ化し、最も頻繁に失敗するAPIをランク付けします。401（認証）・500（サーバーエラー）・404（リソース不在）を区別するため、[Amplitude] Status Code をセカンダリグループとして加えます。
遅いエンドポイント（該当する場合）。 ユーザーがパフォーマンスや遅延に言及している場合、[Amplitude] URL ごとに [Amplitude] Duration を計測します。P95 が3秒超、または平均が1秒超の場合はフラグを立てます。

2b. JavaScriptエラー

Amplitude:query_dataset を使用して [Amplitude] Error Logged をクエリします:

エラー量のトレンド。 時間ウィンドウ内の日次エラー数と影響を受けたユニークユーザー数。前日比25%超のスパイクにはフラグを立てます。
上位エラー。 Error Message でグループ化し、発生頻度の高いエラーを特定します。コンテキストとして Error Type と File Name を含めます。
新規 vs. 慢性。 直近ウィンドウのエラーを前期間と比較します。直近ウィンドウにのみ現れるエラーはリグレッションの可能性が高く、両方に存在するエラーは慢性的な技術的負債です。

2c. エラークリック（フラストレーションシグナル）

Amplitude:query_dataset を使用して [Amplitude] Error Click をクエリします:

ボリュームトレンド。 日次エラークリック数。スパイクは、ユーザーがエラー状態に実際に遭遇し、操作していることを示します。
ユーザーが何をクリックしているか。 [Amplitude] Element Text または [Amplitude] Message でグループ化し、どのエラーUI要素が最もインタラクションされているかを確認します。

ステップ 3: イベント間のクロスリファレンス相関

このステップこそ、各イベントを個別に見るだけでは得られない付加価値を生み出す部分です。

リクエスト失敗 → JSエラーチェーン。 ネットワーク障害（ステップ2a）とJSエラー（ステップ2b）のタイミングとページを比較します。同じページに5xxのネットワーク障害とJSエラーの両方がある場合、ネットワーク障害が根本原因である可能性が高いです。結合ディメンションとして [Amplitude] Page Path を使用します。
エラー → フラストレーションチェーン。 JSエラーが発生したページとエラークリックが発生したページを比較します。JSエラー率の高いページでエラークリックが多ければ、ユーザーが壊れた体験に遭遇・操作していることが確認されます。
ページレベルのトリアージ。 Amplitude:query_dataset を使用して3つのイベントすべてを [Amplitude] Page Path でグループ化します。ページレベルのエラーヒートマップを作成します:
- ネットワーク障害 + JSエラー + エラークリックがあるページ = Critical（完全な因果チェーン）
- JSエラー + エラークリックはあるがネットワーク障害がないページ = フロントエンドバグ
- ネットワーク障害はあるがJSエラーがないページ = バックエンド問題（適切にハンドリング済み）
- JSエラーはあるがエラークリックがないページ = サイレントエラー（UXに影響しない可能性あり）

ステップ 4: 影響を受けたユーザーとセグメントの特定

ステップ3で特定された上位2〜3件のエラーパターンについて:

ユーザースコープ。 Amplitude:query_dataset を使用して影響を受けたユニークユーザーを集計します。アクティブユーザー総数と比較し、影響割合(%)を算出します。
セグメント内訳。 利用可能なユーザープロパティ（プラットフォーム、ブラウザ、国、プランティア、組織）でグループ化し、エラーが特定のセグメントに集中しているか確認します。利用可能なプロパティを確認する必要がある場合は Amplitude:get_event_properties を呼び出します。
Session Replay。 最も影響の大きいエラーパターンについて、Amplitude:get_session_replays をそのエラーイベントを含むセッションでフィルタして呼び出します。ユーザーが実際に何が起きたかを確認できるよう、2〜3件のReplayリンクを提供します。

ステップ 5: 根本原因の仮説構築

前のステップからの証拠を用いて根本原因の仮説を構築します:

デプロイとの相関。 Amplitude:get_deployments を1回呼び出します。エラーのスパイクが直近のデプロイと一致しているか確認します。エラースパイクの24時間以内にデプロイが実施されていた場合、それが最有力仮説です。
実験との相関。 ユーザーが実験について言及している場合、またはエラーが実験バリアントに対応するセグメントに集中している場合は、Amplitude:get_experiments と Amplitude:query_experiment を呼び出して確認します。
時間的パターン。 エラーは継続的か、断続的か、増加傾向か? 継続的であればコードバグ、断続的であればインフラ問題、増加傾向であれば進行性の障害（メモリリーク、キューバックログなど）が疑われます。
フィードバックとの相関。 Amplitude:get_feedback_sources を呼び出した後、上位エラーメッセージのキーワードで Amplitude:get_feedback_insights を呼び出します。ユーザーが同じ問題を報告していれば、影響の深刻さが裏付けられ、データだけでは得られない追加コンテキストが得られる可能性があります。

ステップ 6: 診断結果の提示

トリアージレポートとして出力を構成します。最も深刻で対処可能なものを先頭に示します。

必須セクション:

診断サマリー（2〜3文）: 最も重要な単一の発見事項。エンジニアリングリードに送るヘッドラインとして書きます。何人のユーザーが、どのページで、いつから影響を受けているかのスコープを含めます。
エラーランドスケープ — 3種類のシグナルタイプにわたる状況をまとめた表:

| シグナル | ボリューム（7日間） | トレンド | 主な発生源 | 重大度 |
|---------|-------------------|---------|-----------|--------|
| ネットワーク障害（4xx/5xx） | [N] リクエスト | [↑/↓/→] | [エンドポイント] | [Critical/High/Medium/Low] |
| JSエラー | [N] エラー、[N] ユーザー | [↑/↓/→] | [Error Message] | ... |
| エラークリック | [N] クリック | [↑/↓/→] | [Element Text] | ... |

上位エラー（最大3〜5件）: それぞれを1段落のナラティブとして記述:
- [エラーの見出し — 10語以内] — 何が起きているか（エラーの内容）、どこで（ページ/エンドポイント）、誰が影響を受けているか（ユーザー数/セグメント）、いつから（デプロイまたは日付）、何をすべきか（具体的な修正アクション）。チャートリンクとReplayリンクをインラインで含めます。
因果チェーン（発見された場合）: クロスイベントのチェーンを説明します。例: 「/api/query へのPOSTが500を返している → これが ChartRenderer.tsx:142 のハンドルされていないTypeErrorをトリガーしている → ユーザーがエラー状態を見てクリックしている。過去7日間で約1,200ユーザーが影響を受けている。」
推奨アクション（2〜4件、番号付き）: 具体的ですぐに実行できる内容。各項目は動詞で始めます。修正、特定の内訳による追加調査、モニタリングのセットアップを優先します。
フォローアッププロンプト: 次に何を深掘りするかを聞きます — 例: 「APIの失敗を組織のティア別にセグメントしますか? いくつかのSession Replayを見ますか? これらのエラーの監視ダッシュボードを作成しますか?」

重大度の分類:

重大度	基準
**Critical

原文（English）を表示

Error Diagnosis & Triage

Investigate product errors by triaging across three auto-captured event types — [Amplitude] Network Request, [Amplitude] Error Logged, and [Amplitude] Error Click — to identify what's broken, which users are affected, and what's causing it. This skill cross-references all three signals to surface causal chains (failed request → JS error → user frustration) rather than treating each in isolation.

This is a reactive investigation skill — the user has a signal (spike, complaint, experiment regression, gut feeling) and wants to understand what's happening. For proactive monitoring, use the monitor-reliability skill instead.

CRITICAL: Event Reference

These are the three auto-captured events this skill operates on. Never guess property names — use exactly these.

[Amplitude] Network Request — Browser network requests. Key properties: [Amplitude] URL, [Amplitude] Status Code, [Amplitude] Duration, [Amplitude] Request Method, [Amplitude] Request Type, [Amplitude] Request Body Size, [Amplitude] Response Body Size, [Amplitude] Start Time, [Amplitude] Completion Time, [Amplitude] Page Path.

[Amplitude] Error Logged — JavaScript errors. Key properties: Error Message, Error Type, Error URL, File Name, Error Lineno, Error Colno, Error Stack Trace, [Amplitude] Error Detection Source.

[Amplitude] Error Click — Clicks on error-associated UI elements. Key properties: [Amplitude] Message, [Amplitude] Kind, [Amplitude] Filename, [Amplitude] Line Number, [Amplitude] Column Number, [Amplitude] Element Text, [Amplitude] Element Tag, [Amplitude] Element Hierarchy.

All three share: [Amplitude] Page Path, [Amplitude] Page URL, [Amplitude] Session Replay ID.

Instructions

Step 1: Context & Scope

Call Amplitude:get_context. If multiple projects, ask which to investigate.
Determine the investigation scope from the user's request:
- Broad triage: "What's broken?" → scan all three event types for the biggest problems
- Targeted: "Network errors are up" → start with [Amplitude] Network Request, then check if they cascade into JS errors
- Specific error: "Users are seeing TypeError" → start with [Amplitude] Error Logged, filtered to that error
Determine the time window. Default to the last 7 days with daily granularity unless the user specifies otherwise. If they mention a deploy or date, anchor to that.

Step 2: Quantify the Error Landscape

Run these in parallel where possible. Budget: 4-6 calls for this step.

2a. Network Failures

Use Amplitude:query_dataset to query [Amplitude] Network Request:

Failure rate trend. Filter [Amplitude] Status Code to 4xx and 5xx ranges. Measure daily event counts and unique users. Compare to total network request volume for a failure rate percentage.
Top failing endpoints. Group by [Amplitude] URL to rank which APIs fail most. Include [Amplitude] Status Code as a secondary grouping to distinguish 401s (auth) from 500s (server errors) from 404s (missing).
Slow endpoints (if relevant). If the user mentions performance or slowness, measure [Amplitude] Duration by [Amplitude] URL. Flag P95 > 3s or mean > 1s.

2b. JavaScript Errors

Use Amplitude:query_dataset to query [Amplitude] Error Logged:

Error volume trend. Daily error count and unique users affected over the time window. Flag day-over-day spikes >25%.
Top errors. Group by Error Message to find the highest-volume errors. Include Error Type and File Name for context.
New vs. chronic. Compare errors in the recent window to the prior period. Errors that appear only in the recent window are likely regressions. Errors present in both are chronic tech debt.

2c. Error Clicks (Frustration Signal)

Use Amplitude:query_dataset to query [Amplitude] Error Click:

Volume trend. Daily error click count. Spikes indicate users are actively encountering and engaging with error states.
What users are clicking. Group by [Amplitude] Element Text or [Amplitude] Message to see which error UI elements get the most interaction.

Step 3: Cross-Event Correlation

This is where the skill adds value beyond looking at each event in isolation.

Failed request → JS error chain. Compare the timing and pages of network failures (Step 2a) with JS errors (Step 2b). If the same pages have both 5xx network failures AND JS errors, the network failure is likely the root cause. Use [Amplitude] Page Path as the join dimension.
Error → frustration chain. Compare JS error pages with error click pages. High error click volume on pages with high JS error rates confirms users are seeing and interacting with the broken experience.
Page-level triage. Use Amplitude:query_dataset to group all three events by [Amplitude] Page Path. Produce a page-level error heatmap:
- Pages with network failures + JS errors + error clicks = critical (full causal chain)
- Pages with JS errors + error clicks but no network failures = frontend bug
- Pages with network failures but no JS errors = backend issue, gracefully handled
- Pages with JS errors but no error clicks = silent errors (may not affect UX)

Step 4: Identify Affected Users & Segments

For the top 2-3 error patterns from Step 3:

User scope. Use Amplitude:query_dataset to count unique users affected. Compare to total active users for an impact percentage.
Segment breakdown. Group by available user properties (platform, browser, country, plan tier, org) to determine if errors concentrate in a specific segment. Call Amplitude:get_event_properties if you need to discover available properties.
Session Replays. For the most impactful error pattern, call Amplitude:get_session_replays filtered to sessions containing the error event. Provide 2-3 replay links so the user can see exactly what happened.

Step 5: Root Cause Hypothesis

Build a root cause hypothesis using evidence from the prior steps:

Deployment correlation. Call Amplitude:get_deployments once. Check if error spikes align with recent deploys. If a deployment shipped within 24 hours of the error spike, it's the leading hypothesis.
Experiment correlation. If the user mentions an experiment or if errors concentrate in a segment that maps to an experiment variant, call Amplitude:get_experiments and Amplitude:query_experiment to check.
Temporal pattern. Is the error constant, intermittent, or growing? Constant suggests a code bug. Intermittent suggests infrastructure. Growing suggests a progressive failure (memory leak, queue backlog).
Feedback correlation. Call Amplitude:get_feedback_sources then Amplitude:get_feedback_insights with keywords from the top error messages. If users are reporting the same issue, it validates the impact and may provide additional context the data can't.

Step 6: Present the Diagnosis

Structure the output as a triage report. Lead with what's most broken and actionable.

Required sections:

Diagnosis summary (2-3 sentences): The single most important finding. Written as a headline you'd send to the engineering lead. Include scope: how many users, which pages, since when.
Error landscape — A table summarizing the state across all three signal types:

| Signal | Volume (7d) | Trend | Top Source | Severity |
|--------|-------------|-------|------------|----------|
| Network failures (4xx/5xx) | [N] requests | [↑/↓/→] | [endpoint] | [Critical/High/Medium/Low] |
| JS errors | [N] errors, [N] users | [↑/↓/→] | [Error Message] | ... |
| Error clicks | [N] clicks | [↑/↓/→] | [Element Text] | ... |

Top errors (3-5 max): Each as a narrative paragraph:
- [Error headline — ≤10 words] — What's happening (the error), where (page/endpoint), who's affected (user count/segment), since when (deployment or date), and what to do (specific fix action). Include chart links and replay links inline.
Causal chains (if found): Describe the cross-event chain. "POST to /api/query is returning 500 → this triggers an unhandled TypeError in ChartRenderer.tsx:142 → users see and click the error state. ~1,200 users affected in the last 7 days."
Recommended actions (2-4 numbered items): Concrete, copy-paste-ready. Start each with a verb. Bias toward fixing, investigating further with a specific breakdown, or setting up monitoring.
Follow-on prompt: Ask what to dig into next — e.g., "Want me to segment the API failures by org tier, watch a few session replays, or build a monitoring dashboard for these errors?"

Severity classification:

Severity	Criteria
Critical	>5% of users affected, full causal chain (network → error → frustration), or blocking a core flow
High	1-5% of users, JS errors on key pages, or a clear regression from a deploy
Medium	<1% of users, chronic errors, or errors on non-critical pages
Low	Silent errors with no user-facing impact, or errors isolated to a single edge-case segment

Edge Cases

No auto-captured error events. The project may not have Session Replay or autocapture enabled. Report this clearly: "This project doesn't appear to have [Amplitude] Network Request, [Amplitude] Error Logged, or [Amplitude] Error Click events. These require Session Replay or the autocapture plugin to be enabled." Suggest the user check their SDK configuration.
Very high error volume. If >100K errors in the window, focus on unique error messages and affected user counts, not raw event counts. Group aggressively.
All errors are chronic. If nothing is new, frame findings as tech debt priorities rather than regressions. Compare error-free session rate to establish a baseline.
Error data is sparse. If only one of the three events has data, work with what's available. Note which signals are missing and what they would add.
User asks about a specific error message. Skip the broad landscape scan (Step 2) and go directly to filtering [Amplitude] Error Logged by Error Message. Then check for correlated network failures and error clicks.
User asks about a specific user or org. Scope all queries to that user/org. Provide a session-level timeline of errors rather than aggregate trends. Prioritize Session Replay links.

Examples

Example 1: Broad Error Triage

User says: "What's broken right now?"

Actions:

Get context and project
Query all three error events for the last 7 days — volume, trend, top sources
Cross-reference by page to find causal chains
Check deployments for correlation
Surface the 3-5 biggest issues ranked by user impact
Provide replay links for the worst pattern

Example 2: Regression Investigation

User says: "Errors seem up since yesterday's deploy"

Actions:

Get context and check get_deployments for what shipped
Query [Amplitude] Error Logged comparing pre-deploy (7d before) vs post-deploy (last 24h)
Identify new error messages that didn't exist before the deploy
Check if new errors correlate with failing network requests
Segment by page and feature to isolate the blast radius
Present findings anchored to the specific deployment

Example 3: Specific Error Deep-Dive

User says: "We're seeing a lot of TypeErrors in the chart builder"

Actions:

Filter [Amplitude] Error Logged to Error Type = TypeError and [Amplitude] Page Path containing the chart builder
Group by Error Message and File Name to find the specific errors
Check [Amplitude] Network Request on the same pages for failing API calls
Pull session replays of users who hit the TypeError
Present the error with reproduction steps derived from replay patterns

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。