スキルOfficialdatabase

🔍diagnose

プラグイン: zilliz
ソース: GitHub で見る ↗

説明

次のような場合に使用: ユーザーが Zilliz Cloud クラスターまたは Milvus コレクションの不具合（応答が遅い、処理が止まる、エラーが返される、クォータの上限に達している、その他の異常な動作など）を報告している場合、あるいは「〇〇の何が問題か」「なぜ〇〇が遅いのか」「〇〇を診断してほしい」「〇〇のトラブルシューティングをしてほしい」といった質問をしている場合。

原文を表示

Use when the user reports that a Zilliz Cloud cluster or Milvus collection is unhealthy, slow, stuck, returning errors, hitting quotas, or otherwise misbehaving — or when they ask "what's wrong with...", "why is ... slow", "diagnose ...", "troubleshoot ...".

ユースケース

✓クラスターの不具合を報告するとき
✓コレクションのエラーを診断してほしいとき
✓処理が遅い原因を調査するとき
✓クォータ上限に達した問題を解決するとき
✓異常な動作をトラブルシューティングするとき

本文（日本語訳）

前提条件

CLIがインストール済みでログイン済みであること（セットアップスキルを参照）。
クラスター診断の場合: クラスターのコンテキストが設定済みであるか、--cluster-id を明示的に渡すこと。
コレクション診断の場合: クラスターコンテキストが、対象コレクションを所有するクラスターを指している必要があること（セットアップスキルを参照）。

スコープ

このスキルは、既存の zilliz コマンドを使用した 読み取り専用の診断 を実行します。クラスター・コレクション・インデックス・データの変更は一切行いません。すべての推奨事項は、提案内容と、ユーザー自身が実行できる具体的な次のコマンドとをセットで提示します。

エントリーポイントは以下の2種類です:

ユーザーの発言	使用するセクション
「クラスターを診断して」「クラスターが遅い / 止まっている / 不健全だ」	クラスター診断
「このコレクションを診断して」「X での検索が遅い」「X がロードできない」	コレクション診断

ユーザーの意図が両方にまたがる場合（例:「全体的に遅い」）は、まずクラスター診断を実行してください。コレクションレベルの症状は、クラスターレベルの原因に起因していることが多いためです。

クラスター診断

以下のチェックリストを順番に実施してください。P0問題（クラスターが RUNNING 状態でない、クオータによるハードストップ）が見つかった場合のみ早期終了し、残りを実行する前にその問題を提示してください。

1. 情報収集

可能な限り並列に実行し、機械的な解析には -o json を優先してください:

zilliz context current
zilliz cluster describe --cluster-id <id> -o json
zilliz cluster list --all -o json                       # コンテキスト把握のための関連クラスター
zilliz billing usage -o json                            # クオータ / 利用料の余裕

続いて、過去1時間と過去24時間の時系列メトリクスを取得します。人間によるレビューにはチャート出力を使用し、しきい値の計算が必要な場合は -o json を使用してください:

zilliz cluster metrics --cluster-id <id> \
  -m CU_COMPUTATION -m CU_CAPACITY -m CU_SIZE -m REPLICA_COUNT \
  -m STORAGE -m SLOW_QUERIES \
  -m SEARCH_QPS -m SEARCH_LATENCY_P99 \
  -m SEARCH_FAIL_RATE -m INSERT_FAIL_RATE \
  --period 1h

トレンドの把握のため、--period 24h でも繰り返し実行してください。

2. 分析

各ルールを順に確認します。各所見には根拠（コマンドと観測値）を必ず記載してください。

チェック項目	根拠	該当する場合の提案
クラスターステータスが RUNNING でない	`cluster describe` の `.status`	診断を一時停止し、ステータスを説明する。SUSPENDED の場合は `cluster resume` を提案
プランと実測 QPS の乖離（例: Serverless で高 QPS が継続）	プラン + `SEARCH_QPS`	プランアップグレードのトレードオフを説明
`CU_COMPUTATION` の p95 が継続的に `CU_CAPACITY` の 80% 超	メトリクス系列	CU のスケールアップ、またはレプリカ追加を推奨
QPS が横ばいなのに `SEARCH_LATENCY_P99` が上昇	レイテンシー vs QPS	インデックス / CU の負荷が原因の可能性あり。上位コレクションを深掘りする
`*_FAIL_RATE` がゼロでない	失敗率の系列	ユーザーから最近報告されたエラーとクロスチェック
`SLOW_QUERIES` がゼロでない	メトリクス	該当コレクションのコレクションレベル診断に深掘りする
ストレージがクオータ上限に近づいている	`STORAGE` + billing	ハードキャップに達する前にクリーンアップ / プラン変更を提案
クライアントからリージョンが遠い可能性がある	エンドポイント + ユーザー報告の RTT	リージョン不一致の可能性を指摘（CLI から RTT を計測することは不可）

3. 報告

以下の3セクションからなる単一レポートを、この順序で作成してください:

サマリー — 健全性の一言評価（healthy / degraded / critical）と、1〜3文の根拠。
所見 — 深刻度 | 所見 | 根拠 | 推奨される次のアクション の列を持つテーブル。
推奨コマンド — 各提案に対応した、コピー&ペーストで実行できる zilliz ... コマンド。変更を伴うコマンドは自身では実行せず、ユーザーに委ねてください。

コレクション診断

1. 情報収集

zilliz collection describe --name <coll> -o json
zilliz collection get-stats --name <coll> -o json
zilliz collection get-load-state --name <coll> -o json
zilliz index list --collection-name <coll> -o json
# 上記で報告された各ベクトルインデックスに対して:
zilliz index describe --collection-name <coll> --field-name <field> -o json
zilliz partition list --collection-name <coll> -o json

過去1時間と24時間のコレクション別メトリクスを取得します:

zilliz collection metrics -c <coll> \
  -m SEARCH_QPS -m SEARCH_LATENCY_P99 -m SEARCH_FAIL_RATE \
  -m QUERY_QPS -m QUERY_LATENCY_P99 \
  -m INSERT_QPS -m INSERT_LATENCY_P99 \
  -m ENTITIES -m ENTITIES_LOADED -m ENTITIES_INDEXED \
  --period 1h

クラスターコンテキストが誤っている場合、またはコレクションがデフォルト以外のデータベースに存在する場合は、すべてのコマンドに --database <db> を付与してください。

2. 分析

チェック項目	根拠	該当する場合の提案
ロード状態が Loaded でない（または部分的にロード済み）	`get-load-state`	`collection load --name <coll>` を実行。ロードされていないコレクションは検索に応答できないことを説明
`ENTITIES_LOADED` が `ENTITIES` を大きく下回る	メトリクス	ロードが未完了か、直近でデータが増加した可能性あり。待機またはリロードを提案
`ENTITIES_INDEXED` が `ENTITIES` を大きく下回る	メトリクス	インデックス作成が遅延中。チューニング前に調査が必要
ベクトルフィールドが存在するがベクトルインデックスがない、または大規模データに FLAT を使用	`index list` + `get-stats`	行数に応じたパラメーター範囲で HNSW / IVF_* を推奨。推奨であり絶対的な指示ではない旨を明記
インデックスパラメーターが明らかに不適切（例: IVF で `nlist` ≪ √N）	`index describe` + 行数	修正値を提案。最終的な値にはベンチマークが必要であることを注記
レプリカ数が 1 で高 QPS が継続	`collection describe` のレプリカ数 + `SEARCH_QPS`	クラスター診断での CU 余裕を条件に、レプリカ追加を提案
スキーマ上の問題: 非常に幅広い varchar、フィルターに使用される未インデックスのスカラーフィールドが多い	`collection describe` のスキーマ	影響を指摘。適用可能な箇所にスカラーインデックスを提案
パーティションキーがなく、行数が多く、クエリが自然に分割可能	スキーマ	パーティションキーオプションを案内（既存コレクションへの後付けは不可のため、設計時の判断が必要）
`SEARCH_FAIL_RATE` がゼロでない	メトリクス	クラスターレベルの失敗率メトリクスとクロスチェック
QPS が横ばいなのにレイテンシーが上昇	レイテンシー vs QPS	インデックス負荷またはセグメント数の増加が原因。コンパクションを検討。セグメントレベルの状態は CLI では確認不可である旨を注記

3. 報告

クラスター診断と同じ3セクション形式で作成します。コレクションの所見の根本原因がクラスターレベル（例: CU の飽和）にある場合は、その旨を明示し、クラスターレポートを参照してください。

ガイダンス

読み取り専用の徹底。 診断の一環として、delete・drop・release・resume・suspend・create・update・インデックスの作成/削除、その他データプレーンへの変更を伴うコマンドは絶対に実行しないこと。提案としてのみ提示してください。
根拠の明示。 すべての所見は、その根拠となったコマンドと観測値を必ず参照すること。根拠のない主張は禁止です。
不確実性の明示。 インデックスパラメーターの推奨値・レプリカ数・CU サイジングは、CLI では観測できないワークロード固有の条件に依存します。「出発点として、ベンチマークで確認を」という表現で提示してください。
限界の把握。 CLI では、クエリごとの実行プラン・セグメントレベルの状態・コンパクションのバックログ・GC・サーバー内部キュー・アラートルールの状態は確認できません。これらに関する質問があった場合は、推測せずにその旨を率直に伝えてください。
並列収集の優先。 複数の読み取り専用コマンドが独立している場合は、診断を高速化するため並列実行してください。
解析には JSON、人間向けにはチャート。 しきい値と値を比較する際は -o json（必要に応じて --query も活用）を使用し、ユーザーにトレンドを提示する際はデフォルトのチャート出力を使用してください。
コンテキストの尊重。 複数のクラスターが存在する場合は、収集を開始する前にどのクラスターを対象とするか確認してください。コレクションがデフォルト以外のデータベースに存在する場合は、すべてのコマンドに --database を付与してください。

原文（English）を表示

Prerequisites

CLI installed and logged in (see setup skill).
For cluster diagnosis: a cluster context, or pass --cluster-id explicitly.
For collection diagnosis: the cluster context must point at the cluster that owns the collection (see setup skill).

Scope

This skill performs read-only diagnosis using existing zilliz commands. It does NOT mutate clusters, collections, indexes, or data. Every recommendation is presented as a suggestion plus the exact next command the user can run themselves.

Two entry points:

User says	Use section
"diagnose my cluster", "cluster is slow / stuck / unhealthy"	Cluster Diagnosis
"diagnose this collection", "search is slow on X", "X won't load"	Collection Diagnosis

When the user's intent spans both (e.g. "everything is slow"), run cluster diagnosis first — a collection-level symptom often has a cluster-level cause.

Cluster Diagnosis

Follow this checklist in order. Stop early only if you find a P0 problem (cluster not RUNNING, quota hard-stop) — surface it before running the rest.

1. Collect

Run these in parallel where possible and prefer -o json for machine parsing:

zilliz context current
zilliz cluster describe --cluster-id <id> -o json
zilliz cluster list --all -o json                       # peer clusters for context
zilliz billing usage -o json                            # quota / spend headroom

Then pull time-series metrics covering the last hour and last 24h. Use the chart output for human review and -o json if you need to compute thresholds:

zilliz cluster metrics --cluster-id <id> \
  -m CU_COMPUTATION -m CU_CAPACITY -m CU_SIZE -m REPLICA_COUNT \
  -m STORAGE -m SLOW_QUERIES \
  -m SEARCH_QPS -m SEARCH_LATENCY_P99 \
  -m SEARCH_FAIL_RATE -m INSERT_FAIL_RATE \
  --period 1h

Repeat with --period 24h for trend context.

2. Analyze

Walk each rule. Each finding must cite the evidence (command + observed value).

Check	Evidence	If true, suggest
Cluster status ≠ RUNNING	`cluster describe` `.status`	Pause diagnosis; explain status; if SUSPENDED suggest `cluster resume`
Plan vs. observed QPS mismatch (e.g. Serverless under sustained high QPS)	plan + `SEARCH_QPS`	Discuss plan upgrade tradeoff
`CU_COMPUTATION` p95 > 80% of `CU_CAPACITY` for sustained windows	metric series	Recommend scaling CU or adding replicas
`SEARCH_LATENCY_P99` rising while QPS flat	latency vs qps	Likely index/CU pressure; drill into top collections
Non-zero `*_FAIL_RATE`	fail-rate series	Cross-check with recent user-reported errors
`SLOW_QUERIES` non-zero	metric	Drill into per-collection diagnosis for the offenders
Storage trending toward quota	`STORAGE` + billing	Suggest cleanup / plan change before hard cap
Region likely far from client	endpoint + user-reported RTT	Note possible region mismatch — cannot measure RTT from CLI

3. Present

Render a single report with three sections, in this order:

Summary — one-line health verdict (healthy / degraded / critical) + 1-3 sentence rationale.
Findings — table with columns Severity | Finding | Evidence | Suggested Next Step.
Suggested commands — copy-pasteable zilliz ... commands matching each suggestion. Never run mutating commands yourself — let the user run them.

Collection Diagnosis

1. Collect

zilliz collection describe --name <coll> -o json
zilliz collection get-stats --name <coll> -o json
zilliz collection get-load-state --name <coll> -o json
zilliz index list --collection-name <coll> -o json
# For each vector index reported above:
zilliz index describe --collection-name <coll> --field-name <field> -o json
zilliz partition list --collection-name <coll> -o json

Pull per-collection metrics for the last hour and 24h:

zilliz collection metrics -c <coll> \
  -m SEARCH_QPS -m SEARCH_LATENCY_P99 -m SEARCH_FAIL_RATE \
  -m QUERY_QPS -m QUERY_LATENCY_P99 \
  -m INSERT_QPS -m INSERT_LATENCY_P99 \
  -m ENTITIES -m ENTITIES_LOADED -m ENTITIES_INDEXED \
  --period 1h

If the cluster context is wrong or the collection lives in a non-default database, pass --database <db> on every command.

2. Analyze

Check	Evidence	If true, suggest
Load state ≠ Loaded (or partially loaded)	`get-load-state`	`collection load --name <coll>`; explain that unloaded collections cannot serve search
`ENTITIES_LOADED` ≪ `ENTITIES`	metrics	Load incomplete or recently grew; wait or reload
`ENTITIES_INDEXED` ≪ `ENTITIES`	metrics	Indexing lag; investigate before tuning
Vector field present but no vector index, or FLAT on large row count	`index list` + `get-stats`	Recommend HNSW / IVF_* with a parameter range appropriate for row count; mark as recommendation, not absolute
Index params clearly off (e.g. `nlist` ≪ √N for IVF)	`index describe` + row count	Suggest revised values; note that benchmarking is needed for the final number
Replica count = 1 with sustained high QPS	`collection describe` replicas + `SEARCH_QPS`	Suggest more replicas, conditioned on CU headroom from cluster diagnosis
Schema reasons: very wide varchar, many un-indexed scalar fields used in filters	`collection describe` schema	Note impact; suggest scalar indexes where applicable
No partition key but row count is large and queries are naturally partitionable	schema	Mention partition-key option (cannot be added in place — design-time decision)
Non-zero `SEARCH_FAIL_RATE`	metric	Cross-check with cluster-level fail-rate metrics
Latency rising while QPS flat	latency vs qps	Index pressure or growing segment count; consider compaction; note that segment-level state is not exposed via CLI

3. Present

Same three-section format as cluster diagnosis. When a collection finding's true root cause is at the cluster level (e.g. CU saturation), say so explicitly and reference the cluster report.

Guidance

Read-only. Never run delete, drop, release, resume, suspend, create, update, index create/drop, or any mutating data-plane command as part of diagnosis. Present them as suggestions only.
Cite evidence. Every finding must reference the command that produced it and the observed value. No unsourced claims.
Mark uncertainty. Index parameter recommendations, replica counts, and CU sizing depend on workload specifics the CLI cannot observe. Phrase as "starting point, benchmark to confirm."
Know the limits. The CLI cannot see per-query plans, segment-level state, compaction backlog, GC, server-internal queues, or alert-rule state. If a question requires those, say so plainly rather than guessing.
Prefer parallel collection. When multiple read-only commands are independent, run them in parallel to keep the diagnosis fast.
JSON for parsing, charts for humans. Use -o json (optionally with --query) when comparing values against thresholds; use the default chart output when surfacing trends back to the user.
Honor context. If the user has multiple clusters, confirm which one before collecting. If the collection is in a non-default database, thread --database through every command.

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。