スキルOfficialdevelopment

📊aidp-excel

プラグイン: oracle-ai-data-platform-workbench-spark-connectors
ソース: GitHub で見る ↗

説明

ExcelファイルをAIDPノートブック上でSpark DataFrameとして読み込むためのツールです。次のような場合に使用: ユーザーがExcel、.xlsx、.xlsについて言及している場合、またはVolumeやObject Storageバケットにスプレッドシートファイルが存在する場合。 2つの読み込み方式に対応しています。 - `com.crealytics.spark.excel` Spark形式を使用する方式（クラスターへのjar追加が必要） - jarが不要な `pandas → CSV → spark.read.csv` によるフォールバック方式

原文を表示

Read Excel (.xlsx, .xls) files into a Spark DataFrame from an AIDP notebook. Use when the user mentions Excel, .xlsx, .xls, or has spreadsheet files in a Volume / Object Storage bucket. Two paths — the `com.crealytics.spark.excel` Spark format (cluster jar required) and a `pandas → CSV → spark.read.csv` fallback that needs no jars.

ユースケース

✓ExcelファイルをSpark DataFrameとして読み込む
✓VolumeのスプレッドシートファイルをSpark上で扱う
✓Object Storageバケットのxlsx形式ファイルを利用する

本文（日本語訳）

`aidp-excel` — Excel (.xlsx) 取り込み

Excel データを Spark に読み込む方法は2つあります: ネイティブの Spark Excel フォーマット（高速・並列処理）と、pandas を介した CSV 変換パス（クラスタ設定不要）です。

次のような場合に使用

ユーザーが Volume または Object Storage バケットに .xlsx / .xls ファイルを持っている場合
「Excel」「.xlsx」「スプレッドシートの取り込み」といったキーワードが言及された場合

使用しない場合

CSV ファイルの場合 → aidp-object-storage を使用してください。Spark は CSV をネイティブに読み込めます。

オプション C — 純粋な標準ライブラリパーサー（openpyxl 不要・JAR 不要）

このプラグインには標準ライブラリのみで動作する .xlsx リーダーが同梱されています。 openpyxl 不要。Crealytics JAR 不要。 PyPI アクセスも Maven アクセスも利用できない AIDP クラスタ（Crealytics の依存関係が解決できない環境）で動作します。

import os
from oracle_ai_data_platform_connectors.excel import read_xlsx_stdlib

xlsx_path = os.environ["EXCEL_PATH"]
header, *body = read_xlsx_stdlib(xlsx_path)
df = spark.createDataFrame(body, schema=header)
df.show()

制限事項: 読み取り専用（標準ライブラリでの .xlsx 書き込みパスなし）、デフォルトは1枚目のシートのみ（他のシートを参照する場合は sheet_path="xl/worksheets/sheet2.xml" を指定）、セル型のコアーションはベストエフォート対応となります。小〜中規模のワークブックの取り込みに適しています。大きなファイル（50 MB 超）の場合は、並列読み取りが可能なオプション A の com.crealytics.spark.excel JAR を推奨します。

実装は scripts/oracle_ai_data_platform_connectors/excel.py にあります。

オプション A — `com.crealytics.spark.excel` フォーマット

クラスタの前提条件

Crealytics Spark Excel の JAR（および Apache POI の依存ライブラリ）を Volume にアップロードし、クラスタの「Library」タブから追加してください。

JAR	Maven 座標
spark-excel	`com.crealytics:spark-excel_2.12:3.5.0_0.20.4`（Spark 3.5 対応。クラスタに合わせて `_<spark-ver>_<release>` を選択）
poi	spark-excel に同梱。不足している場合は `org.apache.poi:poi-ooxml:5.2.5` と推移的依存関係を追加

import os

excel_path = os.environ["EXCEL_PATH"]   # 例: /Volumes/default/default/uploads/data.xlsx

df = (spark.read
      .format("com.crealytics.spark.excel")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("dataAddress", "'Sheet1'!A1")    # 省略可 — デフォルトは先頭シートの A1
      .load(excel_path))
df.show()

メリット: 大規模ワークブックの並列読み取り、述語プッシュダウン対応

オプション B — pandas → CSV → Spark（JAR 不要）

import os, pandas as pd

excel_path = os.environ["EXCEL_PATH"]
csv_path   = excel_path.replace(".xlsx", ".csv")

# pandas で読み込み（シングルスレッド、ドライバ側で実行）
pdf = pd.read_excel(excel_path)

# 同じ Volume / Object Storage パスに CSV として出力
pdf.to_csv(csv_path, index=False)
print(pdf.head())

# 分散処理のために Spark で再読み込み
df = spark.read.csv(csv_path, header=True, inferSchema=True)
df.show()

メリット: クラスタへの JAR インストール不要 トレードオフ: ドライバ側でのシングルスレッド読み込みとなるため、500 MB 超のファイルでは OOM リスクあり

注意点

com.crealytics.spark.excel の JAR バージョンはクラスタの Spark バージョンと一致させる必要があります。 Spark 3.5 クラスタに 3.4 用の JAR を使用すると、フォーマット登録時にエラーが発生します。
複数シートファイルの dataAddress 指定 — "'Sheet 2'!A1" のように、スペースを含むシート名はシングルクォートで囲んでください。
大きなファイルでは inferSchema=true が低速になります — 本番ジョブでは .schema(...) でスキーマを事前宣言してください。
エンコーディング / 結合セル — pandas はほとんどの問題を吸収しますが、Spark Excel JAR は結合セルのヘッダーで誤動作することがあります。列ずれが発生した場合はオプション B を選択してください。
oci:// 上の Excel ファイル — どちらのオプションでも動作します。oci://bucket@ns/path/file.xlsx を直接指定するか、繰り返し読み込む場合は /Volumes/... にステージングしてください。

参考資料

公式サンプル: oracle-samples/oracle-aidp-samples → data-engineering/ingestion/Read_excel_data/read_excel.ipynb
Crealytics Spark Excel: https://github.com/crealytics/spark-excel

原文（English）を表示

`aidp-excel` — Excel (.xlsx) ingestion

Two ways to land Excel data in Spark: the native Spark Excel format (faster, parallel) or a pandas-mediated CSV path (no cluster setup).

When to use

User has .xlsx / .xls files in a Volume or Object Storage bucket.
Mentioned: "Excel", ".xlsx", "spreadsheet ingestion".

When NOT to use

For CSV files → just use aidp-object-storage. Spark reads CSV natively.

Option C — Pure-stdlib parser (no openpyxl, no JARs)

The plugin ships a stdlib-only .xlsx reader. No openpyxl. No Crealytics JAR. Works on AIDP clusters that have neither PyPI access nor Maven access for the Crealytics dependency closure.

import os
from oracle_ai_data_platform_connectors.excel import read_xlsx_stdlib

xlsx_path = os.environ["EXCEL_PATH"]
header, *body = read_xlsx_stdlib(xlsx_path)
df = spark.createDataFrame(body, schema=header)
df.show()

Limitations: read-only (no stdlib path to write .xlsx), first sheet only by default (pass sheet_path="xl/worksheets/sheet2.xml" for others), best-effort cell type coercion. Good for ingestion of small-to-medium workbooks; for big files (>50 MB) prefer Option A's com.crealytics.spark.excel JAR for parallel reads.

The implementation is at scripts/oracle_ai_data_platform_connectors/excel.py.

Option A — `com.crealytics.spark.excel` format

Cluster prerequisite

Upload the Crealytics Spark Excel jar (and its Apache POI dependencies) to a Volume and attach via the cluster Library tab:

JAR	Maven coordinates
spark-excel	`com.crealytics:spark-excel_2.12:3.5.0_0.20.4` (matches Spark 3.5; pick the `_<spark-ver>_<release>` matching your cluster)
poi	bundled with spark-excel; if missing, add `org.apache.poi:poi-ooxml:5.2.5` and transitive deps

import os

excel_path = os.environ["EXCEL_PATH"]   # e.g. /Volumes/default/default/uploads/data.xlsx

df = (spark.read
      .format("com.crealytics.spark.excel")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("dataAddress", "'Sheet1'!A1")    # optional — default is first sheet, A1
      .load(excel_path))
df.show()

Strengths: parallel reads on large workbooks; predicate pushdown.

Option B — pandas → CSV → Spark (no jars)

import os, pandas as pd

excel_path = os.environ["EXCEL_PATH"]
csv_path   = excel_path.replace(".xlsx", ".csv")

# Read with pandas (single-threaded, in-driver)
pdf = pd.read_excel(excel_path)

# Convert to CSV in the same Volume / Object Storage path
pdf.to_csv(csv_path, index=False)
print(pdf.head())

# Re-read as Spark for distributed downstream work
df = spark.read.csv(csv_path, header=True, inferSchema=True)
df.show()

Strengths: no cluster JAR install. Tradeoff: driver-side single-threaded read; OOM risk for files >500 MB.

Gotchas

com.crealytics.spark.excel jar version must match the cluster's Spark version. A 3.4 jar on a 3.5 cluster errors out at format registration time.
dataAddress for multi-sheet files — "'Sheet 2'!A1" (note quotes around sheet name with spaces).
inferSchema=true is slow for big files — pre-declare schema with .schema(...) for production jobs.
Encoding / merged cells — pandas handles most quirks; the Spark Excel jar can choke on merged-cell headers. If you see misaligned columns, prefer Option B.
Excel files in oci:// — both options work; pass oci://bucket@ns/path/file.xlsx directly, or pre-stage to /Volumes/... for repeated reads.

References

Official sample: oracle-samples/oracle-aidp-samples → data-engineering/ingestion/Read_excel_data/read_excel.ipynb
Crealytics Spark Excel: https://github.com/crealytics/spark-excel

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。