スキルOfficialdevelopment

🔍aidp-spark-debugging

プラグイン: oracle-ai-data-platform-workbench-engineer-agent
ソース: GitHub で見る ↗

説明

AIDPのSparkジョブが遅延または失敗した場合の診断を、クラスターログ・メトリクス・Spark UI REST APIを使って行います。次のような場合に使用: - ジョブ/クエリが遅い、または失敗している - ユーザーが「なぜ失敗したのか / なぜ遅いのか」を尋ねている - ステージ/タスクのタイミング、データスキュー、シャッフル/スピル、Executor、またはSQL実行の詳細が必要軽量なトリアージ用途向けです。深いパフォーマンスチューニング（スキュー/スピル/シャッフル/ジョイン/AQE/Delta）には、`aidp-spark-optimization` スキルを使用してください。

原文を表示

Diagnose slow or failed AIDP Spark work using cluster logs, metrics, and the Spark UI REST API. Use when a job/query is slow or failed, the user asks "why did this fail / why is it slow", or you need stage/task timings, skew, shuffle/spill, executor, or SQL-execution details. Lightweight triage — for deep performance tuning (skew/spill/shuffle/joins/AQE/Delta) use the `aidp-spark-optimization` skill.

ユースケース

✓ジョブ/クエリが遅い、または失敗している
✓失敗理由や遅延原因を調査する
✓ステージ/タスクの詳細を確認する
✓データスキューやシャッフルを診断する

本文（日本語訳）

`aidp-spark-debugging` — ログ・メトリクス・Spark UI トリアージ

実際の実行データに基づいて障害/低速の診断を行います（推測は不要）。MCPは不要で、2つのエンジンを使用します:

クラスターログ＆メトリクス → クラスター（コントロールプレーン）に対して oci raw-request POST アクションを実行
Spark UI のジョブ/ステージ/タスク/SQL 詳細 → scripts/aidp_sql.py がカーネル側セルを実行し、Spark REST API（spark.sparkContext.uiWebUrl + /api/v1/applications/...）を叩く。Spark UI は実行中のカーネル内部からしか到達できないため、このヘルパーが MCP を使わない経路となる。

次のような場合に使用:

Spark ジョブ/クエリが遅い、または失敗している場合
「なぜ実行 X が失敗したのか / なぜ遅いのか」を調査したい場合
ステージ/タスク/スキュー/シャッフルの詳細が必要な場合

エンジン

1. ログ＆メトリクス — `oci raw-request`（コントロールプレーン）

ベース設定と認証手順は references/oci-raw-request.md を参照。 20240831/dataLakes にて動作確認済み。 <WS> = …/dataLakes/<OCID>/workspaces/<workspace>

# ログ検索（logContentTypeContains: driver | executor | events; ISO ミリ秒タイムスタンプ; ≤24h ウィンドウ）
oci raw-request --http-method POST \
  --target-uri "https://aidp.<region>.oci.oraclecloud.com/20240831/dataLakes/<OCID>/workspaces/<WS>/clusters/<KEY>/actions/searchLogs" \
  --request-body '{"timeBegin":"2026-06-09T00:00:00.000Z","timeEnd":"2026-06-09T01:00:00.000Z","logContentTypeContains":"executor","messageContains":"OutOfMemory"}' \
  --request-headers '{"content-type":"application/json"}' --profile DEFAULT
# また: …/actions/downloadLogs（同じリクエストボディ）でダウンロード可能なアーカイブを取得できる。

# 指定期間のメトリクス（POST summarizeMetricsData）
oci raw-request --http-method POST \
  --target-uri "https://aidp.<region>.oci.oraclecloud.com/20240831/dataLakes/<OCID>/workspaces/<WS>/clusters/<KEY>/actions/summarizeMetricsData" \
  --request-body '{"metricName":"MemoryUtilization","timeBegin":"…Z","timeEnd":"…Z","interval":"1m","aggregationType":"MEAN"}' \
  --request-headers '{"content-type":"application/json"}' --profile DEFAULT

オプションのログフィルター: logLevel、subjectContains、eventType、thread、executionContextId、advancedFilter

メトリクス名: CpuUtilization、MemoryUtilization、GcCpuUtilization、JvmHeapUsed、 DiskReadBytes/DiskWriteBytes、NetworkReceiveBytes/NetworkTransmitBytes、ActiveTasks、 TotalFailedTasks/TotalCompletedTasks/TotalTasks、 shuffleTotalBytesRead/TotalShuffleWriteBytes

aggregationType: MEAN|SUM|MAX|MIN

初回使用時は oci-raw-request.md の「fabrication 禁止ゲート」に従い、クラスターに対して実際のアクションパスをプローブして確認すること。

2. Spark UI 詳細 — `scripts/aidp_sql.py`（カーネル側 Spark REST）

対象クラスター上でセルを実行し、ライブの Spark REST API を読み取ります。このヘルパーは api_key の DEFAULT プロファイルから UPST を生成します（MCP 不要、AIDP_SESSION 不要）:

python "$PLUGIN_DIR/scripts/aidp_sql.py" \
  --region <region> --datalake <OCID> --workspace <ws> --cluster <KEY> \
  --code "import json,ssl,urllib.request as u; ctx=ssl._create_unverified_context(); base=spark.sparkContext.uiWebUrl+'/api/v1'; opn=lambda p: json.load(u.urlopen(p,context=ctx)); apps=opn(base+'/applications'); app=apps[0]['id']; print(json.dumps(opn(base+'/applications/'+app+'/jobs'), default=str))"

SSL に関する注意（2026-06-10 本番確認済み）: クラスターの Spark UI は自己署名証明書付きの HTTPS を使用しているため、通常の urlopen は SSLCertVerificationError を発生させます。上記のように未検証コンテキスト（ssl._create_unverified_context()）を渡してください。これは同一クラスター内部のカーネル間通信であり、外部への通信ではありません。

コントロールプレーン経由の代替手段（2026-06-12 本番確認済み） — カーネルセル不要: 同じ Spark UI REST が AIDP ゲートウェイ経由でプロキシされています:
https://gateway.aidp.<region>.oci.oraclecloud.com/sparkui/<clusterKey>/api/v1/applications[/<app>/{jobs,stages,stages/<id>/<attempt>/taskSummary,executors,sql,storage/rdd,environment}]
同じ oci raw-request --profile DEFAULT 署名で到達可能です。 GET …/sparkui/<clusterKey>/api/v1/applications → 200（実行中の app とバージョンを返す）。

カーネルセルを実行せずにステージ/タスク/SQL メトリクスを取得したい場合（例: クラスターに空きカーネルがない場合）に使用してください。カーネル側の uiWebUrl 経由が引き続きデフォルトです。

本番確認済み: ステージごとの taskSummary 分位数（[p50, p75, p95, p100]）によりタスクスキューを直接把握できます。

完了したジョブの実行結果はテスト済みクラスターに保持されない（Spark History Server は非公開） — 2026-06-12 本番確認済み: ゲートウェイはライブのクラスター app のみを返し、完了した実行の app に対しては 404 を返します。 Spark UI はライブ専用であり、メモリ内保持は有限です（spark.ui.retainedJobs/Stages）。そのため、完了後のジョブのステージ/タスク/SQL 詳細は事後に取得できません（ジョブ/タスクの所要時間とクラスターのメトリクスは summarizeMetricsData で引き続き取得可能 — §1 参照）。

ジョブの完全な Spark UI メトリクスを後から取得できるようにするには、ジョブの最後のタスクとしてライブ REST の内容をキャプチャして Volume / Object Storage に保存するスナップショットセルを追加してください:
import json, ssl, urllib.request as u
ctx=ssl._create_unverified_context(); base=spark.sparkContext.uiWebUrl+'/api/v1'
app=json.load(u.urlopen(base+'/applications',context=ctx))[0]['id']
get=lambda p: json.load(u.urlopen(f"{base}/applications/{app}{p}",context=ctx))
snap={'executors':get('/allexecutors'),'jobs':get('/jobs'),'stages':get('/stages'),'sql':get('/sql')}
spark.createDataFrame([(json.dumps(snap),)],['m']).coalesce(1).write.mode('overwrite').json('oci://<bkt>@<ns>/spark_metrics/<run>.json')
spark.read.json(...) でいつでも読み返せます。（または、テナンシーで利用可能であれば、クラスターレベルで Spark イベントログの永続化や History Server を有効にしてください。）

REST サブパスを差し替えてドリルダウンできます（すべて /applications/<app>/… 配下）:

/jobs · /jobs/<id> · /stages · /stages/<id>/<attempt> · /stages/<id>/<attempt>/taskSummary（p0〜p100 分位数）· /stages/<id>/<attempt>/taskList · /allexecutors · /sql · /sql/<execId> · /storage/rdd · /environment

戻り値は JSON {status, outputs, spark_job_ids, error} — outputs テキストをパースしてください。詳細は scripts/aidp_sql.py を参照。

トリアージワークフロー

スコープの確認: ジョブ実行（aidp-pipelines のタスク実行出力から開始）か、インタラクティブクエリか？
対象の特定: カーネル側 Spark REST の /applications/<app>/jobs（?status=failed 付き）または /sql で失敗/低速の ID を特定 → /jobs/<id> または /sql/<execId> で実行計画とステージ ID を確認。
ステージ/タスク詳細: /stages/<id>/<attempt>（シャッフル、スピル、GC）、taskSummary（p0〜p100 — 最大値が中央値を大幅に上回る場合はスキュー）、taskList で外れ値を確認。
Executor / キャッシュ: /allexecutors（メモリ、GC）、/storage/rdd でキャッシュ状況を確認。
エラー: searchLogs（driver/executor/events）で例外を検索、summarizeMetricsData でリソース圧迫を確認（ミリ秒精度の ISO タイムスタンプを使用、≤24h ウィンドウ）。
報告: 根本原因のシグナル（スキュー / スピル / OOM / クラスター不適合 / データエラー）を証拠とともに提示し、具体的な次のステップを示す。

スコープに関する注意

これはトリアージです。深いクエリの書き直しやチューニング（広範な集約、スキューのある JOIN、ドライバー側ループなど）については、上流の spark-performance-optimization / spark-ui-investigation スキルに委ねてください — ここで重複して扱わないようにしてください。

参考資料

references/oci-raw-request.md
references/no-mcp-rest-map.md
scripts/aidp_sql.py
連携スキル: aidp-pipelines、aidp-cluster-ops

原文（English）を表示

`aidp-spark-debugging` — logs + metrics + Spark UI triage

Ground failure/slowness diagnosis in real execution data (never guess). Two engines, no MCP required:

Cluster logs & metrics → oci raw-request POST actions on the cluster (control-plane).
Spark UI job/stage/task/SQL detail → scripts/aidp_sql.py runs a kernel-side cell that hits the Spark REST API (spark.sparkContext.uiWebUrl + /api/v1/applications/...). The Spark UI is only reachable from inside the running kernel, so the helper is the no-MCP path for it.

When to use

A Spark job/query is slow or failed; "why did run X fail / why is it slow"; need stage/task/skew/shuffle detail.

Engines

1. Logs & metrics — `oci raw-request` (control-plane)

Base + auth ladder in references/oci-raw-request.md. Verified on 20240831/dataLakes. <WS> = …/dataLakes/<OCID>/workspaces/<workspace>.

# Search logs (logContentTypeContains: driver | executor | events; ISO ms timestamps; ≤24h window)
oci raw-request --http-method POST \
  --target-uri "https://aidp.<region>.oci.oraclecloud.com/20240831/dataLakes/<OCID>/workspaces/<WS>/clusters/<KEY>/actions/searchLogs" \
  --request-body '{"timeBegin":"2026-06-09T00:00:00.000Z","timeEnd":"2026-06-09T01:00:00.000Z","logContentTypeContains":"executor","messageContains":"OutOfMemory"}' \
  --request-headers '{"content-type":"application/json"}' --profile DEFAULT
# also: …/actions/downloadLogs (same body) for a downloadable archive.

# Metrics over a range (POST summarizeMetricsData)
oci raw-request --http-method POST \
  --target-uri "https://aidp.<region>.oci.oraclecloud.com/20240831/dataLakes/<OCID>/workspaces/<WS>/clusters/<KEY>/actions/summarizeMetricsData" \
  --request-body '{"metricName":"MemoryUtilization","timeBegin":"…Z","timeEnd":"…Z","interval":"1m","aggregationType":"MEAN"}' \
  --request-headers '{"content-type":"application/json"}' --profile DEFAULT

Optional log filters: logLevel, subjectContains, eventType, thread, executionContextId, advancedFilter. Metric names: CpuUtilization, MemoryUtilization, GcCpuUtilization, JvmHeapUsed, DiskReadBytes/DiskWriteBytes, NetworkReceiveBytes/NetworkTransmitBytes, ActiveTasks, TotalFailedTasks/TotalCompletedTasks/TotalTasks, shuffleTotalBytesRead/TotalShuffleWriteBytes. aggregationType: MEAN|SUM|MAX|MIN. Confirm the exact action path against the cluster on first use (probe), per the no-fabrication gate in oci-raw-request.md.

2. Spark UI detail — `scripts/aidp_sql.py` (kernel-side Spark REST)

Run a cell on the target cluster that reads the live Spark REST API. The helper mints a UPST from the api_key DEFAULT profile (no MCP, no AIDP_SESSION):

python "$PLUGIN_DIR/scripts/aidp_sql.py" \
  --region <region> --datalake <OCID> --workspace <ws> --cluster <KEY> \
  --code "import json,ssl,urllib.request as u; ctx=ssl._create_unverified_context(); base=spark.sparkContext.uiWebUrl+'/api/v1'; opn=lambda p: json.load(u.urlopen(p,context=ctx)); apps=opn(base+'/applications'); app=apps[0]['id']; print(json.dumps(opn(base+'/applications/'+app+'/jobs'), default=str))"

SSL note (LIVE-VERIFIED 2026-06-10): the cluster's Spark UI is HTTPS with a self-signed cert, so a bare urlopen raises SSLCertVerificationError. Pass an unverified context (ssl._create_unverified_context(), as above) — this is kernel-internal traffic to the same cluster, not an external call.

Control-plane alternative (LIVE-VERIFIED 2026-06-12) — no kernel cell needed: the same Spark UI REST is proxied through the AIDP gateway at https://gateway.aidp.<region>.oci.oraclecloud.com/sparkui/<clusterKey>/api/v1/applications[/<app>/{jobs,stages,stages/<id>/<attempt>/taskSummary,executors,sql,storage/rdd,environment}], reachable with the same oci raw-request --profile DEFAULT signing. GET …/sparkui/<clusterKey>/api/v1/applications → 200 (returns the running app + version). Use this when you want stage/task/SQL metrics without executing a cell (e.g. the cluster has no free kernel); the kernel-side uiWebUrl path above remains the default. Verified live: per-stage taskSummary quantiles ([p50,p75,p95,p100]) expose task skew directly.

Finished job runs are not retained on the tested cluster (no Spark History Server exposed) — LIVE-VERIFIED 2026-06-12: the gateway lists only the LIVE cluster app and 404s on a completed run's app; the Spark UI is live-only with bounded in-memory retention (spark.ui.retainedJobs/Stages). So a finished Job's stage/task/SQL detail isn't retrievable after the fact (job/task durations and cluster metrics via summarizeMetricsData still are — see §1). To make a Job's full Spark-UI metrics fetchable later, append a snapshot cell as the job's last task that captures the live REST and persists it to a Volume / Object Storage:
import json, ssl, urllib.request as u
ctx=ssl._create_unverified_context(); base=spark.sparkContext.uiWebUrl+'/api/v1'
app=json.load(u.urlopen(base+'/applications',context=ctx))[0]['id']
get=lambda p: json.load(u.urlopen(f"{base}/applications/{app}{p}",context=ctx))
snap={'executors':get('/allexecutors'),'jobs':get('/jobs'),'stages':get('/stages'),'sql':get('/sql')}
spark.createDataFrame([(json.dumps(snap),)],['m']).coalesce(1).write.mode('overwrite').json('oci://<bkt>@<ns>/spark_metrics/<run>.json')
Read it back anytime with spark.read.json(...). (Or enable Spark event-log persistence / a History Server at the cluster level if your tenancy exposes one.) Swap the final REST sub-path to drill in (all under /applications/<app>/…): /jobs · /jobs/<id> · /stages · /stages/<id>/<attempt> · /stages/<id>/<attempt>/taskSummary (p0–p100 quantiles) · /stages/<id>/<attempt>/taskList · /allexecutors · /sql · /sql/<execId> · /storage/rdd · /environment. Returns JSON {status, outputs, spark_job_ids, error} — parse the outputs text. See scripts/aidp_sql.py.

Triage workflow

Scope: a Job run (start with aidp-pipelines task-run output) or an interactive query?
Find the work: kernel-side Spark REST /applications/<app>/jobs (+ ?status=failed) / /sql → the failing/slow id; then /jobs/<id> / /sql/<execId> for the plan + stage ids.
Stage/task detail: /stages/<id>/<attempt> (shuffle, spills, GC); taskSummary (p0–p100 — skew if max ≫ median); taskList for outliers.
Executors / cache: /allexecutors (memory, GC); /storage/rdd for caching.
Errors: searchLogs (driver/executor/events) for the exception; summarizeMetricsData for resource pressure (use ms-precision ISO timestamps; ≤24h window).
Report the root-cause signal (skew / spill / OOM / wrong cluster / data error) with the evidence, and a concrete next step.

Scope note

This is triage. For deep query rewrites/tuning (wide aggregations, skewed joins, driver-side loops), defer to the upstream spark-performance-optimization / spark-ui-investigation skills — don't duplicate them here.

References

references/oci-raw-request.md · references/no-mcp-rest-map.md · scripts/aidp_sql.py · pairs with aidp-pipelines, aidp-cluster-ops

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。