スキルOfficialdevelopment

✅aidp-data-quality

プラグイン: oracle-ai-data-platform-workbench-engineer-agent
ソース: GitHub で見る ↗

説明

AIDPテーブルに対してデータ品質ルールのチェックを実行します。チェック対象のルールは以下の通りです： - NOT NULL制約 - 一意性（ユニーク制約） - 許容範囲・許容値セット - 参照整合性 - データの鮮度（フレッシュネス）次のような場合に使用: ユーザーがデータのバリデーションを行いたい場合、NULLや重複・孤立レコードを検出したい場合、カラムのドメイン（値域）をアサートしたい場合、またはパイプラインに品質ゲートを設けたい場合。各ルールは境界付きのSpark SQLとして表現され、違反件数とともにパス／フェイルの結果をレポートします。

原文を表示

Run data-quality rule checks on AIDP tables — not-null, uniqueness, allowed ranges/sets, referential integrity, and freshness. Use when the user wants to validate data, check for nulls/duplicates/orphans, assert a column's domain, or gate a pipeline on quality. Expresses each rule as bounded Spark SQL and reports pass/fail with offending counts.

ユースケース

✓データのバリデーションを行いたい
✓NULLや重複・孤立レコードを検出したい
✓カラムのドメイン（値域）をアサートしたい
✓パイプラインに品質ゲートを設けたい

本文（日本語訳）

`aidp-data-quality` — Spark SQL によるルールチェック

AIDP テーブルを明示的なデータ品質ルールに基づいて検証します。各ルールは範囲を限定した Spark SQL にコンパイルされ、同梱のヘルパースクリプトで実行されます。 MCP も ai-data-engineer-agent リポジトリも不要です。

次のような場合に使用

「<テーブル> の NULL / 重複をチェックして」「<カラム> を検証して」「孤立行はあるか」「データは最新か」など、パイプラインをデータ品質でゲーティングしたい場合。

ルール種別（各ルール → 違反件数を返す SQL。0 件なら合格）

ルール	チェック内容（違反件数）
not-null（非 NULL）	`COUNT(*) WHERE col IS NULL`
unique（一意性）	`COUNT() - COUNT(DISTINCT key)`（または `GROUP BY key HAVING COUNT()>1`）
range / set（範囲 / 値集合）	`COUNT(*) WHERE col NOT BETWEEN lo AND hi` / `col NOT IN (...)`
referential（参照整合性）	`COUNT(*) child LEFT JOIN parent ... WHERE parent.key IS NULL`
freshness（鮮度）	`MAX(ts)` を SLA と比較（例: `datediff(current_date, MAX(ts)) <= N`）

ワークフロー

テーブル／カラムを特定し、参照整合性チェックの結合キーは .aidp/catalog.md から取得します（推測は禁止）。利用可能であれば .aidp/semantic.md の値辞書からルール定義を取り込みます。
クラスターが RUNNING 状態であることを確認（aidp-cluster-ops / oci raw-request）したうえで、各ルールの違反件数 SQL を同梱ヘルパーで実行します（0 件 → PASS、1 件以上 → FAIL）:
```
python "$PLUGIN_DIR/scripts/aidp_sql.py" --region <region> --datalake <DATALAKE_OCID> --workspace <ws> \
  --cluster <cluster-key> \
  --code "spark.sql('''SELECT COUNT(*) AS v FROM cat.sch.t WHERE col IS NULL''').show()"
```
このスクリプトは api_key DEFAULT プロファイルから UPST を発行し、スクラッチノートブックを自動作成して、 status / outputs / spark_job_ids を含む JSON を返します。 AIDP_SESSION は不要（--session-profile はオプション）です。
件数が 0 以外の場合は FAIL とし、違反行のサンプルを別途 LIMIT 付きのクエリで取得します。
ルール・対象・結果・違反件数をまとめたサマリーテーブルを出力します。
以下の対応を提案します:
- (a) ルールセットを再実行可能な形式で保存する（後述）
- (b) チェックを Job（aidp-pipelines）のゲーティングタスクとして組み込む

再実行可能なルールセットの保存

検証済みルールを .aidp/dq-rules.md に登録することで、後から再実行できます（.aidp/verified-queries.md のデータ品質版に相当）。各エントリには、対象テーブル／カラム・ルール種別（上記5種）・違反 SQL（違反件数をカウント → 0 で PASS）・ last-result / last-checked を記録します。

再実行時は各エントリの違反 SQL を scripts/aidp_sql.py 経由で実行し、結果を PASS (0) または FAIL (<件数>) に更新して、クラスターと実行日を記録します。 status: ok かつ 0 件が確認されない限り、PASS とマークしてはなりません。

フォーマットおよび再実行ルール: references/dq-rules.md

信頼性に関するルール

SQL は必ず scripts/aidp_sql.py で実際に実行してください。status: ok の結果なしにルールを PASS と断言することは禁止です。
チェックは範囲を限定して実行し、違反行の例はサンプル取得にとどめ、全件ダンプは行わないでください。
セルが status: error を返した場合は、エラー内容を読み、カタログに基づいて SQL を修正してから再試行してください。

参考資料

references/dq-rules.md（.aidp/dq-rules.md のルールセット形式と再実行方法）
scripts/aidp_sql.py · references/no-mcp-rest-map.md · references/oci-raw-request.md · references/semantic-model.md

原文（English）を表示

`aidp-data-quality` — rule checks via Spark SQL

Validate AIDP tables against explicit data-quality rules, each compiled to bounded Spark SQL and executed with the bundled helper — no MCP and no ai-data-engineer-agent repo required.

When to use

"Check <table> for nulls/duplicates", "validate <column>", "are there orphan rows", "is the data fresh", or gating a pipeline on quality.

Rule types (each → a counting SQL that should return 0 violations)

Rule	Check (violations)
not-null	`COUNT(*) WHERE col IS NULL`
unique	`COUNT() - COUNT(DISTINCT key)` (or `GROUP BY key HAVING COUNT()>1`)
range / set	`COUNT(*) WHERE col NOT BETWEEN lo AND hi` / `col NOT IN (...)`
referential	`COUNT(*) child LEFT JOIN parent ... WHERE parent.key IS NULL`
freshness	`MAX(ts)` vs SLA (e.g. `datediff(current_date, MAX(ts)) <= N`)

Workflow

Resolve table(s)/columns; use join keys from .aidp/catalog.md for referential checks (don't guess). Pull rule definitions from .aidp/semantic.md value dictionaries where available.
Ensure the cluster is RUNNING (aidp-cluster-ops / oci raw-request), then for each rule run the violation-count SQL with the bundled helper (PASS if 0, else FAIL):
```
python "$PLUGIN_DIR/scripts/aidp_sql.py" --region <region> --datalake <DATALAKE_OCID> --workspace <ws> \
  --cluster <cluster-key> \
  --code "spark.sql('''SELECT COUNT(*) AS v FROM cat.sch.t WHERE col IS NULL''').show()"
```
It mints a UPST from the api_key DEFAULT profile, auto-creates a scratch notebook, and returns JSON with status / outputs / spark_job_ids. No AIDP_SESSION required (--session-profile optional).
On a non-zero count, FAIL and pull a few example offending rows with a separate bounded LIMIT query.
Report a summary table: rule · target · result · violation count.
Offer to (a) persist the rule set for re-runs (see below), and (b) wire checks into a Job (aidp-pipelines) as a gating task.

Persisting a re-runnable rule set

Register validated rules in .aidp/dq-rules.md so they can be re-run later (the quality analogue of .aidp/verified-queries.md). One entry per rule records the target table/column, rule-type (the five types above), the violation-SQL (counts violations → PASS when 0), and last-result / last-checked. To re-run, execute each entry's stored violation-SQL via scripts/aidp_sql.py, set the result to PASS (0) or FAIL (<count>), and record the cluster + date — never mark PASS without a status: ok run returning 0. Format and re-run rules: references/dq-rules.md.

Reliability rules

Run real SQL via scripts/aidp_sql.py; never assert a rule passed without a status: ok result.
Keep checks bounded; sample example offenders rather than dumping full result sets.
If a cell returns status: error, read the error, fix the SQL grounded in the catalog, and retry.

References

references/dq-rules.md (.aidp/dq-rules.md rule-set format + re-run)
scripts/aidp_sql.py · references/no-mcp-rest-map.md · references/oci-raw-request.md · references/semantic-model.md

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。