スキルOfficialdatabase

📄databricks-unstructured-pdf-generation

プラグイン: databricks
ソース: GitHub で見る ↗

説明

Databricks上でRAG／非構造化ドキュメント評価データセットおよびデモ用ドキュメント（Knowledge Assistant向けなど）を構築します。具体的には、合成PDFをローカルで生成し、Unity Catalogボリュームにアップロードしたうえで、検索評価用のテスト質問を各ドキュメントと対応付けます。次のような場合に使用: - RAGパイプラインの評価データセットを準備したいとき - Knowledge Assistantなどのデモ用サンプルドキュメントを作成したいとき - 非構造化ドキュメントの検索精度を検証するためのテスト質問セットを整備したいとき

原文を表示

Build RAG / unstructured-document evaluation datasets and demo documents (e.g. for Knowledge Assistant) on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation.

ユースケース

✓RAGパイプラインの評価データセットを準備する
✓デモ用サンプルドキュメントを作成する
✓検索精度検証用のテスト質問セットを整備する

本文（日本語訳）

DatabricksにおけるデモおよびEvalデータセット向け非構造化ドキュメント生成

合成PDFドキュメント＋対応テスト問題をUnity Catalogに格納するデータセットとして生成するワークフローです。 DatabricksにおけるデモおよびRAG・非構造化ドキュメント検索の評価用途を想定しています。

PDF生成ステップ自体は標準的なローカルの HTML → PDF ツールを使用しますが、このスキルのDatabricks固有の価値はワークフロー全体の構造にあります — UCボリュームのレイアウト、ドキュメントと問題の対応ファイル、および下流のDatabricks検索 / ai_extract / ai_parse_document 評価との統合です。

ワークフロー

./raw_data/html/ にHTMLファイルを書き出す（速度向上のため複数ファイルを並列書き出し）— 本番の検索パイプラインが扱うドキュメントに合わせたドメイン固有の内容にする。
<SKILL_ROOT>/scripts/pdf_generator.py を使用してHTML → PDFに変換する（並列変換、plutoprint のラッパー）。
databricks fs cp 経由でPDFをUnity Catalogボリュームにアップロードする — 本番パイプラインが読み込むのと同じボリューム構成。
各ドキュメントと検索評価用の問題を対応付けた doc_questions.json を生成する。これが mlflow.genai.evaluate() や同等の検索品質スコアラーで使用するゴールドデータセットになる。

単発でPDFが必要なだけでDatabricksワークフローが不要な場合は、任意のHTML → PDFツール（weasyprint、wkhtmltopdf、playwright pdf、plutoprint）を直接使用できます。このスキルはUC上に合成データセットをエンドツーエンドで構築するために存在するものであり、汎用PDFジェネレーターではありません。

パス規則: 以下の <SKILL_ROOT> は、このSKILL.mdが置かれているディレクトリを指します。絶対パスに解決してください（例: ~/.claude/skills/databricks-unstructured-pdf-generation）。./raw_data/... のパスはプロジェクトの作業ディレクトリからの相対パスです。

依存関係

uv pip install plutoprint

ステップ1: HTMLファイルの書き出し

mkdir -p ./raw_data/html

HTMLドキュメントを ./raw_data/html/filename.html に書き出します。サブディレクトリを使って整理することもできます（ディレクトリ構造は保持されます）。

ステップ2: PDFへの変換

# フォルダ全体を変換（並列、ワーカー数4）
python <SKILL_ROOT>/scripts/pdf_generator.py convert --input ./raw_data/html --output ./raw_data/pdf

PDFが既に存在し、かつHTMLより新しい場合はスキップします。すべて再変換するには --force を使用してください。

ステップ3: ボリュームへのアップロード

databricks fs は、UCボリュームのパスであっても dbfs: スキームのプレフィックスが必要です。 -r はソースディレクトリの中身をターゲットにコピーします（ソースディレクトリ名は保持されません）。そのため、ファイルは raw_data/ 直下に配置されます。

databricks fs cp -r --overwrite ./raw_data/pdf dbfs:/Volumes/my_catalog/my_schema/raw_data

ステップ4: テスト問題の生成

Knowledge AssistantのEvalまたはMAS向けの問題を含む ./raw_data/pdf/pdf_eval_questions.json を作成します:

{
  "api_errors_guide.pdf": {
    "question": "エラーERR-4521の解決策は何ですか？",
    "expected_fact": "TTLの3600秒が切れる前に、refresh_tokenを使って /api/v2/auth/refresh を呼び出す"
  },
  "installation_manual.pdf": {
    "question": "サービスがデフォルトで使用するポートは何ですか？",
    "expected_fact": "HTTPSにはポート8443を使用し、CONFIG_PORT環境変数で変更可能"
  }
}

このJSONはKAテストケースの構築や検索精度の検証に使用できます。

ドキュメントコンテンツのガイドライン

Knowledge AssistantのテストやデモのためにHTMLドキュメントを生成する際のガイドラインです:

複数ページのドキュメント: 各PDFは十分なコンテンツを持つ複数ページ構成にする
固有のエラーコードと解決策: 製品固有のエラーコード、原因、解決手順を含める
技術的詳細: APIエンドポイント、設定パラメーター、バージョン番号、具体的なコマンド
シンプルなCSS: HTML作成の高速化とPDF変換の安定性のため、スタイルは最小限に抑える
クエリ可能なファクト: KAがドキュメントを読まなければ答えられない詳細情報を含める（一般的な知識ではなく）

適切なドキュメントの種類:

トラブルシューティングセクションを含む製品ユーザーマニュアル
APIエラーリファレンスガイド（エラーコード、原因、解決策）
具体的な手順を含むインストール・設定ガイド
バージョン固有の詳細を含む技術仕様書

コンテンツの例: 「Connection failed」のような汎用的なエラーではなく、以下のように記述する:

「エラー ERR-4521: OAuthトークンの期限切れ。原因: トークンのTTLがデフォルトの3600秒を超過。解決策: 期限切れ前に refresh_token を使って /api/v2/auth/refresh を呼び出す。トークンのライフサイクル管理についてはセクション4.2を参照。」

CLIリファレンス

python <SKILL_ROOT>/scripts/pdf_generator.py convert [OPTIONS]

  --input, -i     入力HTMLファイルまたはフォルダ（必須）
  --output, -o    PDF出力先フォルダ（必須）
  --force, -f     強制再変換（タイムスタンプを無視）
  --workers, -w   並列ワーカー数（デフォルト: 4）

フォルダ構成

サブフォルダの構造は保持されます:

./raw_data/html/                    ./raw_data/pdf/
├── report.html             →       ├── report.pdf
├── quarterly/                      ├── quarterly/
│   └── q1.html             →       │   └── q1.pdf
└── legal/                          └── legal/
    └── terms.html          →           └── terms.pdf

トラブルシューティング

問題	解決策
"plutoprint not installed"	`uv pip install plutoprint`
PDFの表示が崩れる	HTML/CSSの構文を確認する
"Volume does not exist"	`databricks volumes create CATALOG SCHEMA VOLUME_NAME MANAGED`（`catalog.schema.volume` 形式ではなく、4つの位置引数を個別に指定）

原文（English）を表示

Unstructured-Document for Demos and Eval Datasets on Databricks

Workflow for producing synthetic PDF documents + paired test questions as a Unity Catalog-resident dataset for Demos and RAG / unstructured-document retrieval evaluation on Databricks. The PDF-generation step uses standard local HTML → PDF tooling; the Databricks-specific value is the workflow shape — UC volume layout, paired question files, and integration with downstream Databricks retrieval / ai_extract / ai_parse_document evaluation.

Workflow

Write HTML files to ./raw_data/html/ (write multiple files in parallel for speed) — domain-shaped to match the documents your retrieval pipeline will see in production.
Convert HTML → PDF using <SKILL_ROOT>/scripts/pdf_generator.py (parallel conversion, wraps plutoprint).
Upload PDFs to a Unity Catalog volume via databricks fs cp — same volume shape your production pipeline will read from.
Generate doc_questions.json pairing each document with retrieval-eval questions; this becomes the gold dataset for mlflow.genai.evaluate() or comparable retrieval-quality scorers.

If you only need ad-hoc PDFs (no Databricks workflow), any HTML → PDF tool (weasyprint, wkhtmltopdf, playwright pdf, plutoprint) works directly — this skill exists for the synthetic-dataset-on-UC end-to-end shape, not as a general PDF generator.

Path convention: <SKILL_ROOT> below = the directory containing this SKILL.md. Resolve to the absolute install path (e.g. ~/.claude/skills/databricks-unstructured-pdf-generation). ./raw_data/... paths are relative to your own project cwd.

Dependencies

uv pip install plutoprint

Step 1: Write HTML Files

mkdir -p ./raw_data/html

Write HTML documents to ./raw_data/html/filename.html. Use subdirectories to organize (structure is preserved).

Step 2: Convert to PDF

# Convert entire folder (parallel, 4 workers)
python <SKILL_ROOT>/scripts/pdf_generator.py convert --input ./raw_data/html --output ./raw_data/pdf

Skips files where PDF exists and is newer than HTML. Use --force to reconvert all.

Step 3: Upload to Volume

databricks fs requires the dbfs: scheme prefix even for UC Volume paths. -r copies the contents of the source directory into the target (the source directory name is not preserved), so files land directly under raw_data/.

databricks fs cp -r --overwrite ./raw_data/pdf dbfs:/Volumes/my_catalog/my_schema/raw_data

Step 4: Generate Test Questions

Create ./raw_data/pdf/pdf_eval_questions.json with questions for Knowledge Assistant evaluation or MAS:

{
  "api_errors_guide.pdf": {
    "question": "What is the solution for error ERR-4521?",
    "expected_fact": "Call /api/v2/auth/refresh with refresh_token before the 3600s TTL expires"
  },
  "installation_manual.pdf": {
    "question": "What port does the service use by default?",
    "expected_fact": "Port 8443 for HTTPS, configurable via CONFIG_PORT environment variable"
  }
}

This JSON can be used to build KA test cases and validate retrieval accuracy.

Document Content Guidelines

When generating documents for Knowledge Assistant testing or demos:

Multi-page documents: Each PDF should be several pages with substantial content
Specific error codes and solutions: Include product-specific error codes, causes, and resolution steps
Technical details: API endpoints, configuration parameters, version numbers, specific commands
Simple CSS: Keep styling minimal for fast HTML creation and reliable PDF conversion
Queryable facts: Include details a KA must read the document to answer (not general knowledge)

Good document types:

Product user manuals with troubleshooting sections
API error reference guides (error codes, causes, solutions)
Installation/configuration guides with specific steps
Technical specifications with version-specific details

Example content: Instead of generic "Connection failed" errors, write:

"Error ERR-4521: OAuth token expired. Cause: Token TTL exceeded 3600s default. Solution: Call /api/v2/auth/refresh with your refresh_token before expiration. See Section 4.2 for token lifecycle management."

CLI Reference

python <SKILL_ROOT>/scripts/pdf_generator.py convert [OPTIONS]

  --input, -i     Input HTML file or folder (required)
  --output, -o    Output folder for PDFs (required)
  --force, -f     Force reconvert (ignore timestamps)
  --workers, -w   Parallel workers (default: 4)

Folder Structure

Subfolder structure is preserved:

./raw_data/html/                    ./raw_data/pdf/
├── report.html             →       ├── report.pdf
├── quarterly/                      ├── quarterly/
│   └── q1.html             →       │   └── q1.pdf
└── legal/                          └── legal/
    └── terms.html          →           └── terms.pdf

Troubleshooting

Issue	Solution
"plutoprint not installed"	`uv pip install plutoprint`
PDF looks wrong	Check HTML/CSS syntax
"Volume does not exist"	`databricks volumes create CATALOG SCHEMA VOLUME_NAME MANAGED` (four separate positional args, not `catalog.schema.volume`)

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。