スキルOfficialdevelopment

💾aidp-azure-adls

プラグイン: oracle-ai-data-platform-workbench-spark-connectors
ソース: GitHub で見る ↗

説明

AIDPノートブック上でAzure Data Lake Storage Gen2（`abfss://`）への読み書きを行います。次のような場合に使用: ユーザーがADLS、Azure Data Lake、abfssに言及している場合、またはマルチクラウドのAzureソースからデータを取り込みたい場合。認証はOAuthクライアントクレデンシャル方式（Service Principal の `client_id` + シークレット + テナント）を使用します。

原文を表示

Read and write Azure Data Lake Storage Gen2 (`abfss://`) from an AIDP notebook. Use when the user mentions ADLS, Azure Data Lake, abfss, or wants to ingest from a multi-cloud Azure source. Auth is OAuth client-credentials (Service Principal client_id + secret + tenant).

ユースケース

✓Azure Data Lake Storage Gen2へのデータ読み書き
✓ADLSやabfssに関する処理を実行したい
✓マルチクラウドのAzureソースからデータ取り込み

本文（日本語訳）

`aidp-azure-adls` — OAuthクライアント資格情報を使用したAzure ADLS Gen2連携

Service Principalを使用して、AIDP Sparkから abfss://<container>@<storage_account>.dfs.core.windows.net/... パスの読み書きを行います。

次のような場合に使用

AIDPがAzure ADLS Gen2のデータを取り込む、またはデータを書き出す必要がある場合。
「ADLS」「abfss」「Azure Data Lake」といったキーワードが言及されている場合。

使用しない場合

OCI Object Storage を使用する場合 → aidp-object-storage
AWS S3 を使用する場合 → aidp-aws-s3

初回認証セットアップ

Service Principalの資格情報を使用してSpark Hadoopコネクタを設定します。セッション／ジョブごとに1回実施してください。

import os

storage_account = os.environ["ADLS_STORAGE_ACCOUNT"]   # アカウント名のみ（.dfs... は不要）
client_id       = os.environ["ADLS_CLIENT_ID"]          # SP アプリケーション（クライアント）ID
client_secret   = os.environ["ADLS_CLIENT_SECRET"]      # SP シークレット値
tenant          = os.environ["ADLS_TENANT"]             # Azure AD テナント ID（GUID）

base = f"fs.azure.account"
host = f"{storage_account}.dfs.core.windows.net"

spark.conf.set(f"{base}.auth.type.{host}",                 "OAuth")
spark.conf.set(f"{base}.oauth.provider.type.{host}",       "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"{base}.oauth2.client.id.{host}",          client_id)
spark.conf.set(f"{base}.oauth2.client.secret.{host}",      client_secret)
spark.conf.set(f"{base}.oauth2.client.endpoint.{host}",    f"https://login.microsoftonline.com/{tenant}/oauth2/token")

読み込み（Read）

container = os.environ["ADLS_CONTAINER"]
data_file = os.environ["ADLS_DATA_FILE"]   # 例: "data/2025/january/orders.csv"

df = (spark.read
      .format("csv")
      .option("header", True)
      .load(f"abfss://{container}@{storage_account}.dfs.core.windows.net/{data_file}"))
df.show()

書き込み（Write）（例: マネージドDeltaテーブルへの書き出し）

(df.write
   .mode("overwrite")
   .format("delta")
   .saveAsTable("default.default.data_from_adls"))

注意事項

Service PrincipalにストレージアカウントへのRBACを付与する必要があります。 コンテナまたはアカウントに対して Storage Blob Data Contributor（読み取り専用の場合は Reader）を割り当ててください。
abfss:// を使用するには、ストレージアカウントで階層型名前空間（Hierarchical Namespace）を有効にする必要があります。 （ADLS Gen2 = HNS が有効なストレージアカウント）
シークレットは環境変数で管理し、ノートブックにハードコードしないでください。 .gitignore に登録した .env ファイル、またはOCI Vaultの oracle_ai_data_platform_connectors.auth.secrets.get(name) 経由で取得することを推奨します。
エンドポイントURL — login.microsoftonline.com/<tenant>/oauth2/token はv1エンドポイントであり、ClientCredsTokenProvider が期待する形式です。ここではv2エンドポイントを使用しないでください。
abfs:// ではなく abfss:// を使用してください。 常にTLS対応のバリアントを使用してください。

参考資料

公式サンプル: oracle-samples/oracle-aidp-samples → data-engineering/ingestion/Ingest_from_Multi_Cloud.ipynb
Hadoop Azure ドキュメント: https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html

原文（English）を表示

`aidp-azure-adls` — Azure ADLS Gen2 via OAuth client-credentials

Read or write abfss://<container>@<storage_account>.dfs.core.windows.net/... paths from AIDP Spark using a Service Principal.

When to use

AIDP needs to consume or land data in Azure ADLS Gen2.
Mentioned: "ADLS", "abfss", "Azure Data Lake".

When NOT to use

For OCI Object Storage → aidp-object-storage.
For AWS S3 → aidp-aws-s3.

One-time auth setup

Configure the Spark Hadoop connector with Service-Principal credentials. Do this once per session/job:

import os

storage_account = os.environ["ADLS_STORAGE_ACCOUNT"]   # account name only, no .dfs...
client_id       = os.environ["ADLS_CLIENT_ID"]          # SP application (client) id
client_secret   = os.environ["ADLS_CLIENT_SECRET"]      # SP secret value
tenant          = os.environ["ADLS_TENANT"]             # Azure AD tenant id (GUID)

base = f"fs.azure.account"
host = f"{storage_account}.dfs.core.windows.net"

spark.conf.set(f"{base}.auth.type.{host}",                 "OAuth")
spark.conf.set(f"{base}.oauth.provider.type.{host}",       "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"{base}.oauth2.client.id.{host}",          client_id)
spark.conf.set(f"{base}.oauth2.client.secret.{host}",      client_secret)
spark.conf.set(f"{base}.oauth2.client.endpoint.{host}",    f"https://login.microsoftonline.com/{tenant}/oauth2/token")

Read

container = os.environ["ADLS_CONTAINER"]
data_file = os.environ["ADLS_DATA_FILE"]   # e.g. "data/2025/january/orders.csv"

df = (spark.read
      .format("csv")
      .option("header", True)
      .load(f"abfss://{container}@{storage_account}.dfs.core.windows.net/{data_file}"))
df.show()

Write (e.g. land into a managed Delta table)

(df.write
   .mode("overwrite")
   .format("delta")
   .saveAsTable("default.default.data_from_adls"))

Gotchas

Service Principal must have RBAC on the storage account. Assign Storage Blob Data Contributor (or Reader for read-only) on the container or the account.
Hierarchical Namespace must be enabled on the storage account for abfss:// to work (ADLS Gen2 = HNS-on storage account).
Secrets in env vars — never hard-code in notebooks. Source from a .env file gitignored, or from OCI Vault via oracle_ai_data_platform_connectors.auth.secrets.get(name).
Endpoint URL — login.microsoftonline.com/<tenant>/oauth2/token is the v1 endpoint and is what the ClientCredsTokenProvider expects. Don't use the v2 endpoint here.
abfss:// not abfs:// — always use the TLS variant.

References

Official sample: oracle-samples/oracle-aidp-samples → data-engineering/ingestion/Ingest_from_Multi_Cloud.ipynb
Hadoop Azure docs: https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。