スキルOfficialmonitoring

🔍observability-fundamentals

プラグイン: honeycomb
ソース: GitHub で見る ↗

説明

可観測性（システムの内部状態を外部から把握する能力）の基本原理 — 幅広いイベント（時間軸でのシステム内の出来事）、高いカーディナリティ（データの多様性）、中核的な分析ループ、イベント vs メトリクス（数値指標）vs ログ（記録）、そして計測（データ収集）がデバッグの成果にどう結びつくか。ツール固有の使い方ではなく、基本原理に基づいた提案を行います。次のような場合に使用: 「可観測性とは何か」「なぜ可観測性が必要なのか」「なぜHoneycombなのか」「イベント vs メトリクス vs ログ」「イベント vs メトリクス」「イベント vs ログ」「メトリクス vs ログ」「なぜ幅広いイベントなのか」「高いカーディナリティとは何か」「中核的な分析ループ」「可観測性 vs 監視」「ディメンショナリティ（データの観点）とは何か」「可観測性について説明してほしい」、またはその他の可観測性に関する概念的な質問、あるいはHoneycombの手法が従来の監視と異なる理由についての質問。

原文を表示

First principles behind observability — wide events, high cardinality, the core analysis loop, events vs metrics vs logs, and how instrumentation connects to debugging outcomes. Grounds recommendations in first principles rather than tool-specific how-to. Trigger phrases: "what is observability", "why observability", "why Honeycomb", "events vs metrics vs logs", "events vs metrics", "events vs logs", "metrics vs logs", "why wide events", "what is high cardinality", "core analysis loop", "observability vs monitoring", "what is dimensionality", "explain observability", or any conceptual question about observability or why Honeycomb's approach differs from traditional monitoring.

ユースケース

✓可観測性の基本原理を学ぶ
✓イベント・メトリクス・ログの違いを理解する
✓システムのデバッグ方法を改善したい
✓従来の監視と可観測性の違いを知る
✓高いカーディナリティの重要性を理解する

本文（日本語訳）

可視化（システムの状態を理解する仕組み）の基礎

Honeycombが提唱する可視化についての基本的な考え方。このスキルは推奨事項の根拠として使用するほか、概念的な質問に答えるためのものです。SDKのセットアップやツール固有の使い方については、otel-instrumentation および query-patterns スキルを参照してください。

定義

可視化：システムが取りうるあらゆる状態（どんなに複雑で予測しがたい場合でも）を、システムが生成する情報を調べることで理解・説明できる能力のこと。新たな質問のたびに新しいコードをデプロイする必要がありません。

広幅イベント（Wide Event）：あるユニットの仕事全体の文脈を捉えたキー・バリュー形式の記録。リクエストを送信したユーザー、使用されたエンドポイント、キャッシュのヒット/ミス、ビルドバージョン、処理時間、エラーの有無、その操作に関連するビジネス情報などを含みます。OpenTelemetryでは、**span（処理の痕跡）**がこれに当たります。

高基数性（High Cardinality）：あるフィールドが持ちうる異なる値の数。数百万のユーザーIDを持つ user.id は高基数性です。数種類しかない http.method は低基数性です。

高次元性（High Dimensionality）：イベントが持つ異なるフィールドの数。50個の属性を持つspanは高次元性です。

コンセプト	可視化	従来型の監視
質問	事前に決められない、任意の質問	あらかじめ定義済み（ダッシュボード、アラート）
データの形	クエリ実行時に決定	計測時に決定
基数性	高基数性は価値がある	高基数性はコストが高い
調査	広い範囲から絞り込み→確認	ダッシュボードで確認→報告

広幅イベントを使う理由

収集するデータの形状が、後で立てられる質問を制約します。メトリクス（数値データ）は計測時に文脈を事前集約（細かい情報を丸めて大まかな数字に）して失ってしまいます。一方、広幅イベントは文脈を保持し、分析の形をクエリ実行時に決めることができます。

spanの各属性は、クエリで掘り下げられるディメンション（分析の軸）になります。user.id、deployment.version、cache.hit を同じspanに追加すれば、1回のクエリで相互関係を調べられます。たとえば「テナントXのバージョン2.3.1でキャッシュミスが発生した遅いリクエスト」という分析です。これに対し、別々に取得したメトリクスではできません。各ディメンションの組み合わせが新しい時系列を生成し、非効率になるためです。

Honeycombのストレージエンジンは、高基数性と高次元性に対応しながらも、メトリクスシステムで起こるようなコスト爆発を回避します。user.id のような高基数フィールドを追加しても、数百万の時系列が生成されるわけではなく、各イベント上にもう1つのカラム（列）が増えるだけです。クエリ実行時に集約されます。

イベント vs メトリクス vs ログ

	構造化イベント（Span）	メトリクス	ログ
何を記録するか	リクエストの全文脈（すべての属性）	低基数性タグ付きの事前集約済み数値	テキストまたはフィールド単位の構造化情報
何が失われるか	何もなし—生のイベントが保持される	個別リクエストと高基数ディメンション	ログ行間の関連性（トレースコンテキスト〈処理の流れの情報〉がない場合）
クエリの能力	任意のディメンションでグループ化・フィルタリング・異常抽出	事前定義ディメンションでの高速集約	テキスト検索、構造化フィールドクエリ
コストの増え方	イベント量に比例	ディメンション数の増加に指数関数的	量に比例、クエリコストは可変
最適用途	調査、根本原因分析	低コストアラート、長期トレンド	監査ログ、レアイベント

メトリクスやログを生成するのと同じ計測作業で、広幅イベントを生成することができます。そのイベントなら、3つ全ての機能が得られます：件数を数える（メトリクス）、内容を読む（ログ）、複数のディメンションで分析する（可視化）。

同じ操作を3つの方法で計測するコード例は、 ${CLAUDE_PLUGIN_ROOT}/skills/observability-fundamentals/references/events-vs-metrics-vs-logs.md を参照してください。

コア分析ループ

Honeycombでのデバッグは、定義 → 可視化 → 調査 → 評価 のループに従います。

定義：質問を設定します。アラート、SLOバジェット（予算）の消費、またはユーザー報告から出発します。
可視化：クエリを実行し、問題の全体像を見ます（ヒートマップ、件数、99パーセンタイル値など）。
調査：BubbleUp（自動的に外れ値と標準をすべてのディメンションで比較する機能）とトレース分析で絞り込みます。
評価：疑わしい原因があるかどうかを、その原因を含めたクエリと除いたクエリで検証し、仮説を確認します。

その後ループを繰り返します。各回答が新たな疑問を生みます。BubbleUpはすべてのカラムで分布を比較してステップ2-3を自動化しますが、これが機能するには、イベントに比較対象となるディメンションが十分必要です。

このループをHoneycombのツールで実現した構造化ワークフローについては、production-investigation スキルを参照してください。

計測と調査をつなぐ

spanの各属性は、BubbleUpが根本原因を見つけるために使えるディメンションです。事故発生時に最も役立つ属性は、3つの質問に答えます：

誰が影響を受けているのか？ —ユーザー、テナント、アカウント階層、地域
何が変わったのか？ —デプロイバージョン、機能フラグ（実装の有効/無効を制御）、設定バージョン
ボトルネックはどこか？ —ビジネス操作のspan、処理時間の内訳、キャッシュの状態

完全さのためではなく、午前3時の事故対応時に立てる質問のために計測してください。BubbleUpが調査で何も有用な結果を返さなければ、通常、計測の欠落が原因です。欠けているディメンションを追加してもう一度試してください。

属性の完全なカタログについては、 ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/wide-event-attributes.md を参照してください。

SDKを使って属性を追加する方法については、otel-instrumentation スキルを参照してください。

開発プラクティスとしての計測

計測は一度きりのセットアップタスクではありません。コードを書く技術者こそが、どの操作が重要で、どのパスがエラーしやすく、デバッグ時にどんな文脈が役立つかを最もよく知っています。計測をテストのように扱ってください。機能を計画する際にテレメトリー（計測データ）も計画し、コードレビューで確認し、事後対応で欠けていたディメンションを追加します。

その他のリソース

参考ファイル

${CLAUDE_PLUGIN_ROOT}/skills/observability-fundamentals/references/events-vs-metrics-vs-logs.md —コード例：同じ操作をイベント、メトリクス、ログとして計測

Observability Fundamentals

First principles behind Honeycomb's approach to observability. Use this to ground recommendations and answer conceptual questions — for SDK setup and tool-specific guidance, see the otel-instrumentation and query-patterns skills.

Definitions

Observability: The ability to understand and explain any state your system can get into, no matter how novel or complex — by examining what the system produces, without deploying new code for each new question.

Wide event: A flat key-value record capturing the full context of a unit of work — who made the request, which endpoint, cache hit/miss, build version, duration, error status, and any business context relevant to the operation. In OpenTelemetry, a span is a wide event.

High cardinality: The number of unique values a field can have. user.id with millions of values is high cardinality. http.method with a handful is low cardinality.

High dimensionality: The number of distinct fields on your events. A span with 50 attributes has high dimensionality.

Concept	Observability	Traditional Monitoring
Questions	Arbitrary, unknown ahead of time	Pre-defined (dashboards, alerts)
Data shape	Decided at query time	Decided at instrumentation time
Cardinality	High cardinality is valuable	High cardinality is expensive
Investigation	Explore → narrow → confirm	Check dashboard → escalate

Why Wide Events

The shape of the data you collect constrains the questions you can ask later. Metrics pre-aggregate context away at instrumentation time. Wide events preserve context and let you decide the shape of your analysis at query time.

Every attribute on a span is a queryable dimension. Adding user.id, deployment.version, and cache.hit to the same span lets you correlate them in a single query — "slow requests are from tenant X on version 2.3.1 with cache misses." Separate metrics can't do this because each dimension combination creates a new time series.

Honeycomb's storage engine handles high cardinality and dimensionality without the cost explosion that affects metrics systems. Adding a high-cardinality field like user.id doesn't create millions of time series — it's another column on each event, aggregated at query time.

Events vs Metrics vs Logs

	Structured Events (Spans)	Metrics	Logs
Captures	Full request context (all attributes)	Pre-aggregated numbers with low-cardinality tags	Text or structured fields per line
Discards	Nothing — raw events retained	Individual requests, high-cardinality dimensions	Correlation across lines (without trace context)
Query power	GROUP BY, filter, BubbleUp on any dimension	Fast aggregates on pre-defined dimensions	Text search, structured field queries
Cost scaling	Linear with event volume	Exponential with dimension count (cardinality)	Linear with volume, query cost varies
Best for	Investigation, root cause analysis	Cheap alerting, long-term trends	Audit trails, rare events

The same instrumentation effort that produces a metric or log line can produce a wide event — and the event gives you all three capabilities: count it (metric), read it (log), analyze it across dimensions (observability).

For code examples showing the same operation instrumented three ways, see ${CLAUDE_PLUGIN_ROOT}/skills/observability-fundamentals/references/events-vs-metrics-vs-logs.md.

The Core Analysis Loop

Debugging in Honeycomb follows a loop: Define → Visualize → Investigate → Evaluate.

Define — Frame the question. Start from an alert, SLO budget burn, or user report.
Visualize — Run a query to see the shape of the problem (HEATMAP, COUNT, P99).
Investigate — Narrow down with BubbleUp (automated outlier-vs-baseline comparison across all dimensions) and trace analysis.
Evaluate — Confirm the hypothesis by querying with and without the suspected cause.

Then loop — each answer raises new questions. BubbleUp automates steps 2-3 by comparing distributions across every column, but it only works if events have enough dimensions to diff on.

For the structured workflow that implements this loop with Honeycomb's tools, see the production-investigation skill.

Instrumentation Connects to Investigation

Every attribute on a span is a dimension BubbleUp can use to find root causes. The attributes that matter most during incidents answer three questions:

Who is affected? — user, tenant, account tier, region
What changed? — deployment version, feature flag, config version
Where is the bottleneck? — business operation spans, timing breakdowns, cache state

Instrument for the questions you'll ask at 3am, not for completeness. If BubbleUp returns nothing useful during an investigation, the issue is usually an instrumentation gap — add the missing dimensions and try again.

For the complete attribute catalog, see ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/wide-event-attributes.md. For SDK guidance on adding attributes, see the otel-instrumentation skill.

Instrumentation as a Development Practice

Instrumentation is not a one-time setup task. The engineers who write the code are best positioned to know which operations are critical, which paths are error-prone, and what context helps during debugging. Treat instrumentation like testing: plan telemetry when planning features, review it in code reviews, and add missing dimensions as post-incident follow-ups.

Additional Resources

Reference Files

${CLAUDE_PLUGIN_ROOT}/skills/observability-fundamentals/references/events-vs-metrics-vs-logs.md — Code examples: same operation as event, metric, and log

Cross-References

For SDK setup and custom instrumentation: otel-instrumentation skill
For the investigation workflow implementing the core analysis loop: production-investigation skill
For autonomous instrumentation gap analysis: instrumentation-advisor agent

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。