スキルOfficialmonitoring

🔍production-investigation

プラグイン: honeycomb
ソース: GitHub で見る ↗

説明

Honeycombにおける本番環境のトラブルシューティング：調査の流れ本番環境での問題を体系的に調査するためのワークフローです。以下のツール呼び出しを順序立てて実行します： - **背景情報の準備**（現在の状況を把握） - **広範なクエリ実行**（問題の全体像を確認） - **BubbleUp**（異常な値や傾向を特定） - **トレース分析**（詳細な処理の流れを追跡） - **検証**（原因の確認）各ステップの結果を次のステップにつなぎ合わせることで、根本的な原因へたどり着きます。 **次のような場合に使用：** 「本番環境の問題を調査してほしい」「レイテンシ（応答時間）の急上昇をデバッグしたい」「根本的な原因を見つけたい」「BubbleUpを使ってほしい」「トレースを分析してほしい」「障害の原因を調べたい」「APIが遅い理由は何か」「エラーが増加している」「ヘルスチェック（稼働状態の確認）」「SLO焼却」（サービス品質基準の逼迫状況）など、本番環境の問題に関する調査やデバッグのご依頼

原文を表示

Structured workflows for investigating production issues in Honeycomb — the sequence of tool calls (context priming, broad query, BubbleUp, trace analysis, verification) and how to chain results between steps to reach root causes. Trigger phrases: "investigate production issue", "debug latency spike", "find root cause", "use BubbleUp", "analyze traces", "debug an outage", "why is my API slow", "errors are increasing", "health check", "SLO burning", or any request to investigate or debug production problems.

ユースケース

✓本番環境の問題を調査するとき
✓レイテンシの急上昇をデバッグするとき
✓根本的な原因を特定したいとき
✓エラー増加の理由を調べるとき
✓APIが遅い原因を追跡するとき

本文（日本語訳）

Honeycomb本番環境問題調査

本番環境のトラブル解決に向けた体系的な作業フロー。MCP（システム連携の仕組み）ツールはそれぞれのパラメータ（入力項目）を文書化していますが、このスキルはツール呼び出しの順序と結果の解釈方法に焦点を当て、根本原因に到達するための手順を示します。

中核となる分析ループ

このワークフローは、observability-fundamentals（監視の基礎） スキルからの中核分析ループ（定義 → 可視化 → 調査 → 評価）を実装しています。BubbleUp（異常値の自動検出）が有用な結果を返さない場合、しばしば計測データの不足が原因です。欠落している属性を追加（otel-instrumentation（監視ツールの設定） スキルを参照）してから再試行してください。

調査ワークフロー

ステップ1：全体を把握する

get_workspace_context → 環境とデータセットを確認
get_slos → SLO（目標到達度）達成に支障が出ているか確認（問題の深刻度を判断）
get_triggers → アラートが発火しているか確認（調査範囲を絞る）
find_queries → 過去に同じ問題が調査されたか確認

ステップ2：問題の特徴を把握する

広い範囲でクエリを実行し、問題の現れ方を見極めます：

レスポンス時間の急増：P99（99パーセンタイル）レスポンス時間を、サービスやルートごとにヒートマップ（分布図）で表示
エラーの急増：エラーフラグをフィルタしたカウント数を、エラーメッセージやサービスごとにグループ化
原因不明：カウント数をサービス名でグループ化し、どのサービスが異常な量を処理しているか検出

同時に get_service_map を呼び出します。サービス間のP95（95パーセンタイル）レスポンス時間を表示し、どの連携先システムが遅いかをすぐに特定できます。

ステップ3：BubbleUpで原因の手がかりを探す

ここが最も価値の高いステップです。異常を示すクエリが得られたら：

クエリ結果の異常値領域に対して run_bubbleup を実行
BubbleUpが異常値とベースライン（正常時）の分布を全カラム自動比較
分布が大きく異なるフィールドを探す

BubbleUp結果の読み方：

カテゴリカル（分類）フィールド：異常値に多く現れる値は原因の手がかり（例：遅いリクエストの90%がdeployment.version=v2.3.1だが、正常時は20%のみ）
数値フィールド：分布のズレは相関する指標を示す（例：異常値ではdb.query_duration（データベース問い合わせ時間）が大幅に上昇）
典型的な根本原因：デプロイ（新版本の配置）バージョン、リージョン（地域）、ユーザー層、特定エンドポイント、機能フラグ（機能の有効/無効切り替え）

ステップ4：トレース（詳細ログ）で深掘りする

BubbleUpで疑わしい原因が浮かんだら：

BubbleUpの検出結果をWHEREフィルタとして追加し、結果を絞る
代表的なトレースIDを選び出す
get_trace を呼び出して完全なトレースを取得

トレース瀑布図（時系列表示）で注視すべき点：

親プロセスに比べて過度に長い時間がかかるスパン（検査区間）→ ボトルネック
順序実行できるスパンが連続している（N+1クエリパターン：不要な重複問い合わせ）
エラーが含まれるスパン → スパンイベントでスタックトレース（エラー経路）を確認
スパン間の時間的な隙間 → 計測漏れか無駄待ち時間
サービス境界 → トレースがサービス間を跨ぐ地点

ステップ5：仮説を検証する

BubbleUpとトレース分析から仮説を立て、確認します：

疑わしい原因を含めたクエリを実行
疑わしい原因を除いたクエリを実行（対照実験）
両者の指標が大きく異なれば、根本原因が見つかった

ステップ6：調査結果を記録する

create_board を呼び出し、以下を含めます：

根本原因をまとめたテキストパネル（Markdown形式）
問題を特定した主要クエリの実行ID
関連するSLOがあれば記載

調査パターン集

レスポンス時間の急増

ヒートマップで遅い領域を表示 → BubbleUpで分析 → 遅いリクエストのトレースを確認 → フィルタクエリで検証

エラーの急増

エラーカウントをエラーメッセージごとにグループ化 → BubbleUpで分析 → エラーが発生したリクエストのトレースを確認 → フィルタクエリで検証

デプロイ後の性能低下

P99をデプロイバージョンごとにグループ化 → BubbleUpで新旧版を比較 → 新版からのトレースを確認 → フィルタクエリで検証

連携先システムの障害

get_service_map で依存関係を表示 → 遅い連携先システムのP99を確認 → ユーザー影響を測定するクエリ（サービス横断検索）を実行 → 影響を受けたリクエストのトレースを確認

ワークフローを守る

以下のような考えに陥ったら、それでもワークフローに従ってください：

「原因は明らかだからBubbleUpはスキップできる」→ BubbleUpは後から明らかに見える原因をしばしば発見します。また、見落とす副次的な原因も捉えます
「デプロイの問題だと確信している」→ ステップ5で検証してください。インシデント対応中は確認バイアス（予想した答えを探す傾向）が最も強まります。疑わしい原因で絞ったクエリと絞らないクエリの両方を実行してください
「トレースで確認済みだから検証は不要」→ 1つのトレースは一例に過ぎません。検証クエリで、1つのリクエストではなく全体のトラフィックでパターンが成り立つことを証明してください
「単純な問題だからこの全フローは大げさ」→ このワークフローは数分で完了します。診断を誤ると、インシデント対応に数時間かかります

結果が空白または不明確な場合

結果がない：find_columns でフィールド名を確認、時間範囲を広げる、環境とデータセットを検証
BubbleUpに有意な結果がない：別の時間区間を選び直す、フィルタを追加して異常をより明確に分離、別の計算値を選択
トレースにスパンがない：サンプリング（一部データのみ記録）、計測漏れ、または環境間でのトレース分断が原因

参照資料

リファレンスファイル

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md — レスポンス時間の急増、エラーの急増、デプロイ回帰、依存先障害、SLO（サービス水準）予算消費、ヘルスチェックの段階的手順
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md — BubbleUp詳細ガイド：選択方法、時間指定、ページング、結果解釈
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md — トレース構造、get_traceのパラメータと表示方式、瀑布図分析、スパンイベントとリンク

Honeycomb Production Investigation

Structured workflows for debugging production issues. The MCP tools document their own parameters — this skill focuses on the sequence of tool calls and how to interpret results to reach a root cause.

The Core Analysis Loop

This workflow implements the core analysis loop (Define → Visualize → Investigate → Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing useful, the issue is often an instrumentation gap — add the missing attributes (see the otel-instrumentation skill) and try again.

Investigation Workflow

Step 1: Orient

get_workspace_context → environments and datasets
get_slos → any SLOs in violation? (frames severity)
get_triggers → any alerts firing? (narrows scope)
find_queries → has anyone investigated this before?

Step 2: Characterize the Problem

Run a broad query to see the shape of the issue:

Latency spike: P99(duration_ms), HEATMAP(duration_ms) grouped by service or route
Error surge: COUNT filtered on error=true, grouped by exception.message or service
Unknown: COUNT grouped by service.name to find which service has anomalous volume

Also call get_service_map — it shows P95 durations between services and can immediately reveal which dependency is slow.

Step 3: BubbleUp to Find Differentiators

This is the highest-value step. Once you have a query showing the anomaly:

Run run_bubbleup on the query result, selecting the outlier region
BubbleUp compares outlier vs baseline distributions across all columns automatically
Look for fields where the distributions differ significantly

How to interpret BubbleUp results:

Categorical fields (dimensions): A value overrepresented in outliers points to a cause (e.g., deployment.version=v2.3.1 is 90% of slow requests but only 20% of baseline)
Numeric fields (measures): A shifted distribution shows correlated metrics (e.g., db.query_duration is much higher in outliers)
Typical root causes surfaced: deployment version, region, user cohort, specific endpoint, feature flag

Step 4: Drill Into Traces

After BubbleUp identifies suspects:

Add BubbleUp findings as WHERE filters to narrow results
Pick a representative trace ID
Call get_trace to fetch the full trace

What to look for in the trace waterfall:

Spans with disproportionate duration vs parent (the bottleneck)
Sequential spans that could be parallelized (N+1 query patterns)
Error spans — check span events for stack traces
Gaps between child spans (missing instrumentation or idle wait)
Service boundaries (where the trace crosses services)

Step 5: Verify Hypothesis

Form a hypothesis from BubbleUp + trace analysis, then confirm:

Query WITH the suspected cause filtered in
Query WITHOUT it (as a control)
If the metrics diverge, you've found it

Step 6: Record Findings

Call create_board with:

A text panel summarizing the root cause (Markdown)
The key query run PKs that identified the problem
Related SLOs if applicable

Investigation Patterns

Latency Spike

HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries

Error Surge

COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify

Deployment Regression

P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify

Dependency Failure

get_service_map → P99 on the slow dependency → relational query (any.service.name) to measure user impact → trace an affected request

Stay on the Path

If you find yourself reasoning any of these, follow the workflow anyway:

"The cause is obvious, I can skip BubbleUp" — BubbleUp routinely surfaces causes that seem obvious in hindsight but weren't the first guess. It also catches secondary causes you'd miss entirely.
"I already know it's a deployment issue" — verify with Step 5. Confirmation bias is strongest during incidents. Query with and without the suspected cause.
"Traces confirmed it, no need to verify" — a single trace is an anecdote. The verification query proves the pattern holds across all traffic, not just one request.
"This is a simple issue, the full workflow is overkill" — the workflow takes minutes; a wrong diagnosis during an incident costs hours.

When Results Are Empty or Unclear

No results: Check field names with find_columns, expand time range, verify environment/dataset
BubbleUp shows no signal: Try a different time selection, add filters to isolate the anomaly more clearly, or select a different calculation
Trace missing spans: Sampling, instrumentation gaps, or cross-environment trace split

Additional Resources

Reference Files

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md — Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md — Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md — Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links

Cross-References

For the conceptual foundations of the core analysis loop, see the observability-fundamentals skill
For query construction patterns, see the query-patterns skill
For SLO/trigger context during investigations, see the slos-and-triggers skill

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。