スキルOfficialdevelopment

🔍aidp-check-data

プラグイン: oracle-ai-data-platform-workbench-databricks-migrator
ソース: GitHub で見る ↗

説明

移行前のデータ可用性スキャン。マイグレーションマニフェスト内のすべてのノートブックを読み込み、`spark.read.table` / `spark.read.parquet` / `saveAsTable` の参照をすべて抽出した上で、Pass-2 のクラスター時間を消費する前に、各対象スキーマ／テーブル／パスが AIDP クラスター上に存在するかどうかを事前に確認します。次のような場合に使用: `aidp-build-dag` の実行後、かつ `aidp-migrate-job` の実行前。特に、対象環境に対して初めて移行を行う際に有効です。

原文を表示

Pre-migration data-availability scan. Reads every notebook in a migration manifest, extracts every spark.read.table / spark.read.parquet / saveAsTable reference, and probes whether each target schema/table/path exists on the AIDP cluster BEFORE you spend Pass-2 cluster time. Use after aidp-build-dag and before aidp-migrate-job, especially the first time you migrate against a target environment.

ユースケース

✓マイグレーション前のデータ存在確認
✓スキャン対象の参照抽出
✓クラスター時間消費前の事前検証

本文（日本語訳）

`aidp-check-data` — 移行前データ可用性スキャン

マイグレーターのPass-2は高コストです（ライブクラスター時間 + セルごとのtool-use付きClaudeトークンを消費します）。このスキャンを事前に実行することで、「ソーステーブルが存在しない」「バケットが誤っている」といった障害パターンを、数時間ではなく数秒で検出できます。

次のような場合に使用

aidp-build-dag の実行後、aidp-migrate-job の実行前。
aidp-migrate-catalog の実行後（スキーマとテーブルが実際に正常に作成されたかを検証する場合）。
ユーザーが「データの準備ができているか」を確認したい場合。

実行方法

python3 ${CLAUDE_PLUGIN_ROOT}/engine/scripts/check_data_availability.py \
  --root "<databricks-workspace-path>" \
  --cluster <CLUSTER_ID> \
  --aidp-base <AIDP_BASE> \
  --datalake-ocid <DATALAKE_OCID> \
  --workspace-id <WORKSPACE_UUID> \
  --oci-profile <profile>

ワークフロー形式の入力（aidp-build-dag のワークフローパスに対応）の場合:

python3 ${CLAUDE_PLUGIN_ROOT}/engine/scripts/check_data_availability_for_workflow.py \
  --job-id <databricks-job-id> \
  --cluster <CLUSTER_ID> \
  --aidp-base <AIDP_BASE> \
  --datalake-ocid <DATALAKE_OCID> \
  --workspace-id <WORKSPACE_UUID> \
  --oci-profile <profile>

処理内容

マニフェスト内のすべてのノートブックを走査します。
以下の参照をすべて抽出します:
- spark.read.table("...") / spark.table("...")
- spark.read.parquet/csv/json/delta("...")
- .saveAsTable("...")（書き込み先テーブル）
- %sql / spark.sql(...) 文字列内の3パート名参照
一意な参照ごとに、クラスター上でSparkセッションを開いてプローブを実行します:
- テーブル → DESCRIBE TABLE <fq>（さらに SHOW TABLES IN <schema> で「スキーマ不在」と「テーブル不在」を判別）
- パス → マイグレーターのヘルパー経由で dbutils.fs.ls(path) を実行
3つのステータスを含むレポートを出力します:
- OK — テーブル/パスが存在し、アクセス可能
- MISSING — クラスター上に存在しない
- EMPTY — 存在するが行数/ファイル数が0（カタログ移行は成功したがデータがレプリケートされていない場合によく見られる）

出力の読み方

出力例:

== check_data_availability_for_workflow report ==
TABLES
   OK      <catalog>.<schema>.<table_a>             1234567 rows
   MISSING <catalog>.<schema>.<table_b>             -- DESCRIBE failed: SCHEMA_OR_TABLE_NOT_FOUND
   EMPTY   <catalog>.<schema>.<table_c>             0 rows

PATHS
   OK      oci://<bucket>@<ns>/path/to/file         52 objects
   MISSING oci://<bucket>@<ns>/missing/path         -- listObjects 404

MISSING の行 → Pass-2は該当セルで必ず失敗します。対処方法:

対象のスキーマが存在しない場合は aidp-migrate-catalog を実行する。
s3:// → oci:// への書き換えが未実施の場合は aidp-bucket-mapping を設定する。
マニフェスト上でそのテーブルを「対象外」としてマークし、消費側ノートブックをスタブの上流で移行する。

EMPTY の行 → Pass-2はエラーなく通過するかもしれませんが、下流のテーブルも空になります。これはサイレント障害パターンです。以下のいずれかを判断してください:

ソースをバックフィルする。
合成データパスを使用する（チームに用意がある場合）。
空のまま受け入れてドキュメントに記録する。

バケットマッピング設定の活用

マニフェストに s3:// パスが含まれる場合、スキャナーはプローブ実行前に <migrator-repo>/config/oci_bucket_tenancy_mapping.json（またはバケットマッピングヘルパーが解決するファイル）を参照して変換を行います。該当バケットがマッピングに存在しない場合、スキャナーは S3 bucket X not found in OCI bucket mapping というわかりやすいエラーを報告します。 aidp-bucket-mapping で修正後、再実行してください。

パフォーマンスとコスト

テーブルプローブは小さな DESCRIBE 処理のみ — ウォームクラスター上ではサブ秒で完了。
パスプローブは OCI Object Storage への listObjects リクエスト — こちらも高速。
スキャン時間は一意な参照数に比例してリニアにスケール。ノートブック50本のワークフローで2分未満が目安。
Claudeトークンは一切消費しない — 純粋なREST + Spark処理。

注意事項

2パート名 vs 3パート名の解決 — ソースコードが schema.table（カタログなし）形式を使用している場合、スキャナーはクラスターの現在のカタログ（default）を基準に解決します。ユーザーがデフォルト以外のカタログを期待している場合は、その不一致を明示してください。
クラスターはActive状態であること。 Stopped 状態のクラスターでは、すべてのプローブが接続エラーで失敗します。先にクラスターを起動するようユーザーに案内してください。
CTEと動的に生成される名前: 正規表現によるエクストラクターは静的な名前のみを捕捉します。f-string等で実行時に構築される名前は対象外です。偽陰性（検出漏れ）が発生する可能性があるため、動的なテーブル名を使用するノートブックは手動でレビューしてください。

次のステップ

すべてOKの場合: aidp-migrate-job に進んでください。 MISSINGが存在する場合: aidp-migrate-catalog または aidp-bucket-mapping で解決後、このスキルを再実行してください。

原文（English）を表示

`aidp-check-data` — pre-migration data-availability scan

Pass-2 of the migrator is expensive (live cluster time + Claude-with-tool-use tokens per cell). Running this scan first catches the "no source table" and "wrong bucket" failure modes in seconds instead of hours.

When to use

After aidp-build-dag, before aidp-migrate-job.
After aidp-migrate-catalog (verify schemas + tables actually landed).
Any time the user wonders "is the data ready".

Invocation

python3 ${CLAUDE_PLUGIN_ROOT}/engine/scripts/check_data_availability.py \
  --root "<databricks-workspace-path>" \
  --cluster <CLUSTER_ID> \
  --aidp-base <AIDP_BASE> \
  --datalake-ocid <DATALAKE_OCID> \
  --workspace-id <WORKSPACE_UUID> \
  --oci-profile <profile>

Or for the workflow-shape input (matches aidp-build-dag's workflow path):

python3 ${CLAUDE_PLUGIN_ROOT}/engine/scripts/check_data_availability_for_workflow.py \
  --job-id <databricks-job-id> \
  --cluster <CLUSTER_ID> \
  --aidp-base <AIDP_BASE> \
  --datalake-ocid <DATALAKE_OCID> \
  --workspace-id <WORKSPACE_UUID> \
  --oci-profile <profile>

What it does

Walks every notebook in the manifest.
Extracts every reference to:
- spark.read.table("...") / spark.table("...")
- spark.read.parquet/csv/json/delta("...")
- .saveAsTable("...") (target — wrote-to)
- 3-part name references in %sql / spark.sql(...) strings
For each unique reference, opens a Spark session on the cluster and runs a probe:
- tables → DESCRIBE TABLE <fq> (and SHOW TABLES IN <schema> to differentiate "schema missing" from "table missing")
- paths → dbutils.fs.ls(path) via the migrator's helper
Emits a report with three columns:
- OK — table/path exists, accessible
- MISSING — does not exist on the cluster
- EMPTY — exists but has 0 rows / 0 files (often a sign that the catalog migration succeeded but data wasn't replicated)

How to read the output

Sample shape:

== check_data_availability_for_workflow report ==
TABLES
   OK      <catalog>.<schema>.<table_a>             1234567 rows
   MISSING <catalog>.<schema>.<table_b>             -- DESCRIBE failed: SCHEMA_OR_TABLE_NOT_FOUND
   EMPTY   <catalog>.<schema>.<table_c>             0 rows

PATHS
   OK      oci://<bucket>@<ns>/path/to/file         52 objects
   MISSING oci://<bucket>@<ns>/missing/path         -- listObjects 404

MISSING rows → Pass-2 will definitely fail at those cells. Options:

Run aidp-migrate-catalog if the underlying schema is missing.
Configure aidp-bucket-mapping if s3:// → oci:// rewrites haven't been done.
Mark the table as "out of scope" in the manifest and migrate the consuming notebook with a stub upstream.

EMPTY rows → Pass-2 may pass (no error) but produce empty downstream tables. This is the silent failure mode. Decide whether to:

Backfill the source.
Use the synthetic-data path (if your team has one).
Accept and document.

Reusing the bucket-mapping config

If the manifest references s3:// paths, the scanner also consults <migrator-repo>/config/oci_bucket_tenancy_mapping.json (or whatever your bucket mapping helper resolves) to translate before probing. If the mapping is missing the bucket, the scanner reports a clear S3 bucket X not found in OCI bucket mapping. Fix via aidp-bucket-mapping and re-run.

Performance + cost

Each table probe is a small DESCRIBE — sub-second on a warm cluster.
Each path probe is a listObjects against OCI Object Storage — also fast.
Total scan time scales linearly with unique references; expect <2 min for a workflow with 50 notebooks.
No Claude tokens spent — this is pure REST + Spark.

Gotchas

2-part vs 3-part name resolution — if the source code uses schema.table (no catalog), the scanner resolves against the cluster's current catalog (default). If the user expects a non-default catalog, surface that mismatch.
Cluster must be Active. A Stopped cluster will make every probe fail with a connection error — instruct the user to start the cluster first.
CTEs and computed names: the regex extractor catches static names, not names built at runtime via f-strings. False negatives are possible — review notebooks that consume dynamic table names manually.

After this

If everything is OK: proceed to aidp-migrate-job. If anything is MISSING: resolve via aidp-migrate-catalog or aidp-bucket-mapping and re-run this skill.

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。