スキルOfficialmonitoring

📊slos-and-triggers

プラグイン: honeycomb
ソース: GitHub で見る ↗

説明

Honeycomb（信頼性監視ツール）のSLO（サービスレベル目標）、エラー予算の消費速度（バーンレート）、アラート状態を読み解き、数字が何を意味するのか、どのアクション（デプロイの停止、オンコール対応など）を取るべきかを判断するスキルです。設定ミスのあるSLI（サービスレベル指標）の検出、デプロイ凍結とページング（アラート通知）のどちらを選ぶべきかの判断、バーンアラート閾値（起動基準値）の設計に対応します。 get_slos や get_triggers を呼び出す前にこのスキルを読み込んでください。 **次のような場合に使用:** 「SLOの状態を確認して」「SLO達成状況を教えて」「どのSLOが正常か」「エラー予算に問題ないか」「どのアラートが発火しているか」「現在のバーンレートは」「SLOを設定したい」「トリガーを作成したい」「アラートを設定したい」「バーンアラートを設定したい」「トリガー状態を確認して」「オンコール対応を開始する」「信頼性の現状を把握したい」「デプロイを凍結すべきか」「このSLOの設定は間違っていないか」「予算の範囲内か」「SLOが機能していない」「予算がマイナスになっている」、またはHoneycomb内のサービスレベル目標、エラー予算、バーンレート、アラート設定に関する各種リクエスト

原文を表示

Decision heuristics for interpreting Honeycomb SLO compliance, budget burn rates, and trigger status — what the numbers mean and what action to take, including detecting misconfigured SLIs, deciding when to freeze deploys vs page on-call, and designing burn alert thresholds. Load this skill before calling get_slos or get_triggers. Trigger phrases: "check our SLOs", "are we meeting our SLOs", "which SLOs are healthy", "is the error budget OK", "are any alerts firing", "what's the burn rate", "set up an SLO", "create a trigger", "configure alerts", "set up burn alerts", "check trigger status", "starting on-call", "reliability picture", "should we freeze deploys", "is this SLO misconfigured", "are we within budget", "SLO is broken", "budget is negative", or any request about service level objectives, error budgets, burn rates, or alerting in Honeycomb.

ユースケース

✓SLOの状態確認と達成状況の把握
✓エラー予算の消費状況を判断する
✓バーンレートからアラート対応を決める
✓SLI設定の誤りを検出する
✓デプロイ凍結またはページング判断を下す

本文（日本語訳）

Honeycomb SLOs とトリガー

Honeycomb における信頼性設定と評価のガイダンスです。get_slos と get_triggers ツールはパラメータを独自に文書化していますが、このスキルは効果的な SLOs の設計、SLOs とトリガーの使い分け、そして数値が何を意味するかの解釈に焦点を当てています。

利用可能性: SLOs は Pro または Enterprise プランが必要です。トリガーはすべてのプランで利用可能です。

SLO とトリガー — どちらを使うか

質問	SLO	トリガー
「信頼性の約束を果たしているか？」	○	×
「今、何か故障していないか？」	×	○
「エラー予算をどのくらいの速さで消費しているか？」	○（バーンアラート）	×
「エラー数が閾値を超えたか？」	×	○
「デプロイを遅くするべきか？」	○（予算残量）	×

経験則: SLOs は時間をかけて約束した信頼性を測定します。トリガーは目前の運用上の問題をキャッチします。

効果的な SLO の設計

SLI を定義する

SLI は「このイベントは成功したか？」という個別イベント単位の判定です。計算フィールドとして実装され、未定義（関連のないイベント）、1（成功）、0（失敗）を返します。

形式: IF(<qualifying-condition>, <success-condition>) 条件部分は関連イベントにフィルタリングし、成功判定部分は何が成功かを定義します。条件部分を満たさない場合、SLI は空になります。
具体的な条件部分: 関連イベントの部分集合を選択します（例: チェックアウトエンドポイントのルートスパンは AND(EQUALS($http.route, "/checkout"), NOT(EXISTS($trace.parent_id)))）
レスポンス速度の成功判定: LTE(duration_ms, 500) — 500ms より速いリクエスト
可用性の成功判定: LTE(http.status_code, 499) — 5xx エラー以外のレスポンス
ビジネスロジックの成功判定: EQUALS(checkout.status, "completed") — チェックアウト完了

目標値を設定する

控えめに始める（99.99% ではなく 99% から）
まず現在のベースラインを P50/P99 クエリで測定する
目標値は現在のパフォーマンスより少し高く設定する
「ユーザーが実際に必要とする信頼性は何か？」と問い直す

枯渇警告アラートを設定する

最低でも 2 つのアラート:

接近警告（枯渇予想時間 ~4 時間）: PagerDuty 経由でオンコール担当者に通知
枯渇トレンド（24 時間の予算消費率）: Slack でチームに通知

バーンレートアラートを設定する

予算がまだ枯渇に近くなくても、急速なバーンを検出します。例:

1 時間バーンレート > 10 倍 — オンコール担当者に通知

SLO 作成後、これらのアラートをユーザーに推奨してください。エージェントはこれらのアラートやその通知先を設定する能力を持ちません。

ベストプラクティス

ユーザーに近い場所で測定する（深い層ではなくエッジで）
チームの枠組みではなく、ユーザーのワークフローを中心に設計する
多くの狭い SLO より広い SLO を優先する
1 つの SLO から始めて、ノイズを減らしてから展開する

SLO ステータスの解釈

get_slos で SLO を確認するときは:

予算残量 > 50%: 健全 — リスク対応の余裕あり
予算残量 10～50%: 注意 — 変更を控える
予算残量 < 10%: 危険 — 重要でないデプロイを凍結
予算がマイナス: 違反 — production-investigation スキルで即座に調査
コンプライアンス 0%: SLI が誤設定の可能性（間違った列、反転したロジック、マッチするイベントなし）— SLI 定義を確認

トリガーの設定

パーセンタイル方式より件数方式を優先

「2 秒より遅いリクエスト 50 件」は「P99 が 2100ms」より実行可能性が高い。 P99 トリガーではなく COUNT WHERE duration_ms > threshold を使用します。

一般的なパターン

エラー急増: COUNT WHERE error = true、閾値 > N（5 分以内）
リクエスト遅延: COUNT WHERE duration_ms > 2000、閾値 > N（5 分以内）
トラフィック低下: COUNT WHERE is_root、閾値 < N（10 分以内、通常以下）

ベストプラクティス

名前: アラートの内容。説明: 対応方法（ランブックへのリンク）。
最低 5～10 分の継続時間を設定してフラッピング（頻繁な状態変化）を避ける
最初は低感度に設定し、誤検知率に基づいて調整する

マルチサービス SLO

最大 10 個のサービス間で 1 つのエラー予算を共有します。

SLI は環境レベルの計算フィールドである必要があります
含まれるサービスからのイベントは均等に重み付けされます
用途: 複数のエッジサービス、モノリスからマイクロサービスへの移行

ユーザーに確認する

Honeycomb のワークスペースは SLO とトリガーの数に制限があります。作成ツールを実行する前にユーザーに確認してください。すべてのパラメータと推論を表示し、承認を求めてください。

SLO へのリンク構築

利用可能なツールでは Honeycomb の SLO ページへ直接リンクすることはできません。代わりに、SLO のリストページにリンクできます。

/<team_slug>/environments/<environment_slug>/slos

追加リソース

リファレンスファイル

${CLAUDE_PLUGIN_ROOT}/skills/slos-and-triggers/references/slo-design-guide.md — 詳細な SLO 設計方法、マルチサービス SLO、エラー予算の計算
${CLAUDE_PLUGIN_ROOT}/skills/slos-and-triggers/references/trigger-examples.md — ユースケース別に整理した完全なトリガー例集
${CLAUDE_PLUGIN_ROOT}/skills/slos-and-triggers/references/alerting-strategy.md — SLO バーンアラートとトリガーを一貫したアラート戦略に統合する方法

Honeycomb SLOs and Triggers

Guidance for configuring and reasoning about reliability in Honeycomb. The get_slos and get_triggers tools document their own parameters — this skill focuses on designing effective SLOs, choosing between SLOs and triggers, and interpreting what the numbers mean.

Availability: SLOs require Pro or Enterprise plan. Triggers available on all plans.

SLO vs Trigger — When to Use Which

Question	SLO	Trigger
"Are we meeting our reliability commitments?"	Yes	No
"Is something broken right now?"	No	Yes
"How fast are we burning our error budget?"	Yes (burn alerts)	No
"Did error count exceed a threshold?"	No	Yes
"Should we slow down deploys?"	Yes (budget remaining)	No

Rule of thumb: SLOs measure reliability against commitments over time. Triggers catch immediate operational issues.

Designing Effective SLOs

Define the SLI

An SLI is a per-event boolean: was this event successful? Implemented as a calculated field returning undefined (not a relevant event), 1 (success), or 0 (failure).

Format: IF(<qualifying-condition>, <success-condition>) The qualifying condition filters to relevant events; the success condition defines what counts as success. If the qualifying condition is not met, the formula returns undefined, and the SLI is unpopulated.
Specific Qualifying Condition: Choose the relevant subset of events (e.g. AND(EQUALS($http.route, "/checkout"), NOT(EXISTS($trace.parent_id))) for root spans of checkout endpoint)
Latency Success Condition: LTE(duration_ms, 500) — requests faster than 500ms
Availability Success Condition: LTE(http.status_code, 499) — non-5xx responses
Business Logic Success Condition: EQUALS(checkout.status, "completed") — successful checkouts

Set the Target

Start conservative (99% before 99.99%)
Measure current baseline first with P50/P99 queries
Set target slightly above current performance
Ask: what reliability do users actually need?

Configure Exhaustion Time Alerts

At minimum, two alerts:

Near exhaustion (exhaustion time ~4h): pages on-call via PagerDuty
Trending to exhaustion (budget rate over 24h): notifies team via Slack

Configure Burn Rate Alerts

Detect fast burns even if the budget isn't close to exhaustion yet. For example:

1h burn rate > 10x — page on-call

Recommend these alerts to the user after creating the SLO. Agents do not have the ability to set up these alerts or their recipients.

Best Practices

Measure close to the user (at the edge, not deep in the stack)
Design around user workflows, not team boundaries
Favor broad SLOs over many narrow ones
Start with one SLO, reduce noise, then expand

Interpreting SLO Status

When reviewing SLOs with get_slos:

Budget remaining > 50%: Healthy — room for risk
Budget remaining 10-50%: Caution — slow down changes
Budget remaining < 10%: At risk — freeze non-critical deploys
Budget negative: Breached — investigate immediately with the production-investigation skill
Compliance at 0%: Likely misconfigured SLI (wrong column, inverted logic, no matching events) — check the SLI definition

Configuring Triggers

Prefer Count-Based Over Percentile-Based

"50 requests slower than 2s" is more actionable than "P99 is 2100ms." Use COUNT WHERE duration_ms > threshold instead of P99 triggers.

Common Patterns

Error spike: COUNT WHERE error = true, threshold > N in 5 min
Slow requests: COUNT WHERE duration_ms > 2000, threshold > N in 5 min
Traffic drop: COUNT WHERE is_root, threshold < N in 10 min (below normal)

Best Practices

Name: What the alert is. Description: What to do (link to runbook).
Set duration 5-10 min minimum to avoid flapping
Start less sensitive, tighten based on false positive rate

Multi-Service SLOs

Share a single error budget across up to 10 services.

SLI must be an environment-level calculated field
Events from included services weighted equally
Use cases: multiple edge services, monolith-to-microservices migration

Check in with the user

Workspaces in Honeycomb have a limited number of SLOs and triggers. Before executing the create tool, check in with the user. Display all parameters and your reasoning, and ask for confirmation.

Constructing links to SLOs

The tools you have will not let you link directly to the SLO page in Honeycomb. Instead, you can link to the list of SLOs.

/<team_slug>/environments/<environment_slug>/slos

Additional Resources

Reference Files

${CLAUDE_PLUGIN_ROOT}/skills/slos-and-triggers/references/slo-design-guide.md — Detailed SLO design methodology, multi-service SLOs, error budget math
${CLAUDE_PLUGIN_ROOT}/skills/slos-and-triggers/references/trigger-examples.md — Complete trigger example library organized by use case
${CLAUDE_PLUGIN_ROOT}/skills/slos-and-triggers/references/alerting-strategy.md — How to combine SLO burn alerts and triggers into a cohesive alerting strategy

Cross-References

For constructing SLI queries and calculated fields, see the query-patterns skill
For investigating SLO budget burn, see the production-investigation skill

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。

📊slos-and-triggers

説明

ユースケース

本文（日本語訳）

Honeycomb SLOs とトリガー

SLO とトリガー — どちらを使うか

効果的な SLO の設計

SLI を定義する

目標値を設定する

枯渇警告アラートを設定する

バーンレートアラートを設定する

ベストプラクティス

SLO ステータスの解釈

トリガーの設定

パーセンタイル方式より件数方式を優先

一般的なパターン

ベストプラクティス

マルチサービス SLO

ユーザーに確認する

SLO へのリンク構築

追加リソース

リファレンスファイル

関連スキル

Honeycomb SLOs and Triggers

SLO vs Trigger — When to Use Which

Designing Effective SLOs

Define the SLI

Set the Target

Configure Exhaustion Time Alerts

Configure Burn Rate Alerts

Best Practices

Interpreting SLO Status

Configuring Triggers

Prefer Count-Based Over Percentile-Based

Common Patterns

Best Practices

Multi-Service SLOs

Check in with the user

Constructing links to SLOs

Additional Resources

Reference Files

Cross-References