スキルOfficialdatabase

🗄️databricks-dbsql

プラグイン: databricks
ソース: GitHub で見る ↗

説明

Databricks SQL（DBSQL）の高度な機能およびSQLウェアハウスの機能に関するSkill。ユーザーが以下のいずれかに言及した場合、このSkillを**必ず**呼び出すこと: 「DBSQL」「Databricks SQL」「SQL warehouse」「SQLスクリプティング」「ストアドプロシージャ」「CALL procedure」「マテリアライズドビュー」「CREATE MATERIALIZED VIEW」「パイプ構文」「`|>`」「ジオスペーシャル」「H3」「`ST_`」「spatial SQL」「コレーション」「COLLATE」「`ai_query`」「`ai_classify`」「`ai_extract`」「`ai_gen`」「AI関数」「`http_request`」「`remote_query`」「`read_files`」「Lakehouse Federation」「再帰CTE」「WITH RECURSIVE」「マルチステートメントトランザクション」「一時テーブル」「テンポラリービュー」「パイプ演算子」次のような場合にも使用: ユーザーがDatabricks上でのSQLのベストプラクティス、データモデリングパターン、または高度なSQL機能について質問している場合。

原文を表示

Databricks SQL (DBSQL) advanced features and SQL warehouse capabilities. This skill MUST be invoked when the user mentions: "DBSQL", "Databricks SQL", "SQL warehouse", "SQL scripting", "stored procedure", "CALL procedure", "materialized view", "CREATE MATERIALIZED VIEW", "pipe syntax", "|>", "geospatial", "H3", "ST_", "spatial SQL", "collation", "COLLATE", "ai_query", "ai_classify", "ai_extract", "ai_gen", "AI function", "http_request", "remote_query", "read_files", "Lakehouse Federation", "recursive CTE", "WITH RECURSIVE", "multi-statement transaction", "temp table", "temporary view", "pipe operator". SHOULD also invoke when the user asks about SQL best practices, data modeling patterns, or advanced SQL features on Databricks.

ユースケース

✓Databricks SQLの高度な機能を使用するとき
✓SQLウェアハウスを構築・最適化するとき
✓SQLのベストプラクティスを確認するとき
✓データモデリングパターンを検討するとき

本文（日本語訳）

Databricks SQL (DBSQL) - 高度な機能

クイックリファレンス

機能	主な構文	対応バージョン	参考資料
SQL スクリプティング	`BEGIN...END`, `DECLARE`, `IF/WHILE/FOR`	DBR 16.3+	references/sql-scripting.md
ストアドプロシージャ	`CREATE PROCEDURE`, `CALL`	DBR 17.0+	references/sql-scripting.md
再帰 CTE	`WITH RECURSIVE`	DBR 17.0+	references/sql-scripting.md
トランザクション	`BEGIN ATOMIC...END`	プレビュー	references/sql-scripting.md
マテリアライズドビュー	`CREATE MATERIALIZED VIEW`	Pro/Serverless	references/materialized-views-pipes.md
一時テーブル	`CREATE TEMPORARY TABLE`	全バージョン	references/materialized-views-pipes.md
パイプ構文	`\|>` 演算子	DBR 16.1+	references/materialized-views-pipes.md
地理空間 (H3)	`h3_longlatash3()`, `h3_polyfillash3()`	DBR 11.2+	references/geospatial-collations.md
地理空間 (ST)	`ST_Point()`, `ST_Contains()` ほか 80+ 関数	DBR 16.0+	references/geospatial-collations.md
コレーション	`COLLATE`, `UTF8_LCASE`, ロケール対応	DBR 16.1+	references/geospatial-collations.md
AI 関数	`ai_query()`, `ai_classify()` ほか 11+ 関数	DBR 15.1+	references/ai-functions.md
http_request	`http_request(conn, ...)`	Pro/Serverless	references/ai-functions.md
remote_query	`SELECT * FROM remote_query(...)`	Pro/Serverless	references/ai-functions.md
read_files	`SELECT * FROM read_files(...)`	全バージョン	references/ai-functions.md
データモデリング	スタースキーマ、Liquid Clustering	全バージョン	references/best-practices.md

よく使うパターン

SQL スクリプティング - 手続き型 ETL

BEGIN
  DECLARE v_count INT;
  DECLARE v_status STRING DEFAULT 'pending';

  SET v_count = (SELECT COUNT(*) FROM catalog.schema.raw_orders WHERE status = 'new');

  IF v_count > 0 THEN
    INSERT INTO catalog.schema.processed_orders
    SELECT *, current_timestamp() AS processed_at
    FROM catalog.schema.raw_orders
    WHERE status = 'new';

    SET v_status = 'completed';
  ELSE
    SET v_status = 'skipped';
  END IF;

  SELECT v_status AS result, v_count AS rows_processed;
END

エラーハンドリング付きストアドプロシージャ

CREATE OR REPLACE PROCEDURE catalog.schema.upsert_customers(
  IN p_source STRING,
  OUT p_rows_affected INT
)
LANGUAGE SQL
SQL SECURITY INVOKER
BEGIN
  DECLARE EXIT HANDLER FOR SQLEXCEPTION
  BEGIN
    SET p_rows_affected = -1;
    SIGNAL SQLSTATE '45000'
      SET MESSAGE_TEXT = concat('Upsert failed for source: ', p_source);
  END;

  MERGE INTO catalog.schema.dim_customer AS t
  USING (SELECT * FROM identifier(p_source)) AS s
  ON t.customer_id = s.customer_id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *;

  SET p_rows_affected = (SELECT COUNT(*) FROM identifier(p_source));
END;

-- 呼び出し方:
CALL catalog.schema.upsert_customers('catalog.schema.staging_customers', ?);

スケジュール更新付きマテリアライズドビュー

CREATE OR REPLACE MATERIALIZED VIEW catalog.schema.daily_revenue
  CLUSTER BY (order_date)
  SCHEDULE EVERY 1 HOUR
  COMMENT 'Hourly-refreshed daily revenue by region'
AS SELECT
    order_date,
    region,
    SUM(amount) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM catalog.schema.fact_orders
JOIN catalog.schema.dim_store USING (store_id)
GROUP BY order_date, region;

パイプ構文 - 可読性の高い変換処理

-- 従来の SQL をパイプ構文で書き直した例
FROM catalog.schema.fact_orders
  |> WHERE order_date >= current_date() - INTERVAL 30 DAYS
  |> AGGREGATE SUM(amount) AS total, COUNT(*) AS cnt GROUP BY region, product_category
  |> WHERE total > 10000
  |> ORDER BY total DESC
  |> LIMIT 20;

AI 関数 - LLM によるデータエンリッチメント

-- サポートチケットの分類
SELECT
  ticket_id,
  description,
  ai_classify(description, ARRAY('billing', 'technical', 'account', 'feature_request')) AS category,
  ai_analyze_sentiment(description) AS sentiment
FROM catalog.schema.support_tickets
LIMIT 100;

-- テキストからエンティティを抽出
SELECT
  doc_id,
  ai_extract(content, ARRAY('person_name', 'company', 'dollar_amount')) AS entities
FROM catalog.schema.contracts;

-- 構造化出力を伴う汎用 AI クエリ
SELECT ai_query(
  'databricks-meta-llama-3-3-70b-instruct',
  concat('Summarize this customer feedback in JSON with keys: topic, sentiment, action_items. Feedback: ', feedback),
  returnType => 'STRUCT<topic STRING, sentiment STRING, action_items ARRAY<STRING>>'
) AS analysis
FROM catalog.schema.customer_feedback
LIMIT 50;

地理空間 - H3 を使った近接検索

-- H3 インデックスを用いて各顧客から 5km 以内の店舗を検索
WITH customer_h3 AS (
  SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell
  FROM catalog.schema.customers
),
store_h3 AS (
  SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell
  FROM catalog.schema.stores
)
SELECT
  c.customer_id,
  s.store_id,
  ST_Distance(
    ST_Point(c.longitude, c.latitude),
    ST_Point(s.longitude, s.latitude)
  ) AS distance_m
FROM customer_h3 c
JOIN store_h3 s ON h3_ischildof(c.h3_cell, h3_toparent(s.h3_cell, 5))
WHERE ST_Distance(
  ST_Point(c.longitude, c.latitude),
  ST_Point(s.longitude, s.latitude)
) < 5000;

コレーション - 大文字・小文字を区別しない検索

-- 大文字・小文字を区別しないコレーションでテーブルを作成
CREATE TABLE catalog.schema.products (
  product_id BIGINT GENERATED ALWAYS AS IDENTITY,
  name STRING COLLATE UTF8_LCASE,
  category STRING COLLATE UTF8_LCASE,
  price DECIMAL(10, 2)
);

-- クエリは自動的に大文字・小文字を区別しない（LOWER() 不要）
SELECT * FROM catalog.schema.products
WHERE name = 'MacBook Pro';  -- 'macbook pro', 'MACBOOK PRO' なども一致

http_request - 外部 API の呼び出し

-- 最初にコネクションを作成（初回のみ）
CREATE CONNECTION my_api_conn
  TYPE HTTP
  OPTIONS (host 'https://api.example.com', bearer_token secret('scope', 'token'));

-- SQL から API を呼び出す
SELECT
  order_id,
  http_request(
    conn => 'my_api_conn',
    method => 'POST',
    path => '/v1/validate',
    json => to_json(named_struct('order_id', order_id, 'amount', amount))
  ).text AS api_response
FROM catalog.schema.orders
WHERE needs_validation = true;

read_files - 生ファイルの取り込み

-- Volume 内の JSON ファイルをスキーマヒント付きで読み込む
SELECT *
FROM read_files(
  '/Volumes/catalog/schema/raw/events/',
  format => 'json',
  schemaHints => 'event_id STRING, timestamp TIMESTAMP, payload MAP<STRING, STRING>',
  pathGlobFilter => '*.json',
  recursiveFileLookup => true
);

-- オプション付きで CSV を読み込む
SELECT *
FROM read_files(
  '/Volumes/catalog/schema/raw/sales/',
  format => 'csv',
  header => true,
  delimiter => '|',
  dateFormat => 'yyyy-MM-dd',
  schema => 'sale_id INT, sale_date DATE, amount DECIMAL(10,2), store STRING'
);

再帰 CTE - 階層構造のトラバーサル

WITH RECURSIVE org_chart AS (
  -- アンカー: 最上位マネージャー
  SELECT employee_id, name, manager_id, 0 AS depth, ARRAY(name) AS path
  FROM catalog.schema.employees
  WHERE manager_id IS NULL

  UNION ALL

  -- 再帰: 直属の部下
  SELECT e.employee_id, e.name, e.manager_id, o.depth + 1, array_append(o.path, e.name)
  FROM catalog.schema.employees e
  JOIN org_chart o ON e.manager_id = o.employee_id
  WHERE o.depth < 10  -- 安全上限
)
SELECT * FROM org_chart ORDER BY depth, name;

remote_query - フェデレーテッドクエリ

-- Lakehouse Federation 経由で PostgreSQL をクエリ
SELECT *
FROM remote_query(
  'my_postgres_connection',
  database => 'my_database',
  query    => 'SELECT customer_id, email, created_at FROM customers WHERE active = true'
);

リファレンスファイル

詳細な構文・完全なパラメータ一覧・応用パターンについては、以下のファイルを参照してください。

ファイル	内容	次のような場合に使用
references/sql-scripting.md	SQL スクリプティング、ストアドプロシージャ、再帰 CTE、トランザクション	手続き型 SQL、エラーハンドリング、ループ、動的 SQL が必要な場合
references/materialized-views-pipes.md	マテリアライズドビュー、一時テーブル/ビュー、パイプ構文	MV、更新スケジュール、一時オブジェクト、パイプ演算子が必要な場合
references/geospatial-collations.md	H3 関数 39 個、ST 関数 80+ 個、コレーションの種類と階層	空間分析、H3 インデックス、大文字/アクセント処理が必要な場合
references/ai-functions.md	AI 関数 13 個、http_request、remote_query、read_files（全オプション）	AI エンリッチメント、API 呼び出し、フェデレーション、ファイル取り込みが必要な場合
references/best-practices.md	データモデリング、パフォーマンス、Liquid Clustering、アンチパターン	アーキテクチャ指針、最適化、またはモデリングのアドバイスが必要な場合

主なガイドライン

AI 関数・MV・http_request には Serverless SQL ウェアハウスを使用すること
AI 関数の開発中はコスト管理のために LIMIT を付けること
新規テーブルではパーティショニングより Liquid Clustering を優先すること（クラスタリングキーは最大 1〜4 個）
クラスタリングキーが不明な場合は CLUSTER BY AUTO を使用すること
BI 向け Gold レイヤーではスタースキーマを採用すること（Silver レイヤーでは OBT も許容）
ディメンショナルモデルにはクエリ最適化のために PK/FK 制約を定義すること
大文字・小文字を区別しない検索が必要なユーザー向け文字列カラムには COLLATE UTF8_LCASE を使用すること
デプロイ前に CLI（databricks experimental aitools tools query）またはノートブックで SQL をテストすること。 CLI バージョンによって --warehouse オプションが拒否される場合は、代わりに環境変数 DATABRICKS_WAREHOUSE_ID を設定すること。

原文（English）を表示

Databricks SQL (DBSQL) - Advanced Features

Quick Reference

Feature	Key Syntax	Since	Reference
SQL Scripting	`BEGIN...END`, `DECLARE`, `IF/WHILE/FOR`	DBR 16.3+	references/sql-scripting.md
Stored Procedures	`CREATE PROCEDURE`, `CALL`	DBR 17.0+	references/sql-scripting.md
Recursive CTEs	`WITH RECURSIVE`	DBR 17.0+	references/sql-scripting.md
Transactions	`BEGIN ATOMIC...END`	Preview	references/sql-scripting.md
Materialized Views	`CREATE MATERIALIZED VIEW`	Pro/Serverless	references/materialized-views-pipes.md
Temp Tables	`CREATE TEMPORARY TABLE`	All	references/materialized-views-pipes.md
Pipe Syntax	`\|>` operator	DBR 16.1+	references/materialized-views-pipes.md
Geospatial (H3)	`h3_longlatash3()`, `h3_polyfillash3()`	DBR 11.2+	references/geospatial-collations.md
Geospatial (ST)	`ST_Point()`, `ST_Contains()`, 80+ funcs	DBR 16.0+	references/geospatial-collations.md
Collations	`COLLATE`, `UTF8_LCASE`, locale-aware	DBR 16.1+	references/geospatial-collations.md
AI Functions	`ai_query()`, `ai_classify()`, 11+ funcs	DBR 15.1+	references/ai-functions.md
http_request	`http_request(conn, ...)`	Pro/Serverless	references/ai-functions.md
remote_query	`SELECT * FROM remote_query(...)`	Pro/Serverless	references/ai-functions.md
read_files	`SELECT * FROM read_files(...)`	All	references/ai-functions.md
Data Modeling	Star schema, Liquid Clustering	All	references/best-practices.md

Common Patterns

SQL Scripting - Procedural ETL

BEGIN
  DECLARE v_count INT;
  DECLARE v_status STRING DEFAULT 'pending';

  SET v_count = (SELECT COUNT(*) FROM catalog.schema.raw_orders WHERE status = 'new');

  IF v_count > 0 THEN
    INSERT INTO catalog.schema.processed_orders
    SELECT *, current_timestamp() AS processed_at
    FROM catalog.schema.raw_orders
    WHERE status = 'new';

    SET v_status = 'completed';
  ELSE
    SET v_status = 'skipped';
  END IF;

  SELECT v_status AS result, v_count AS rows_processed;
END

Stored Procedure with Error Handling

CREATE OR REPLACE PROCEDURE catalog.schema.upsert_customers(
  IN p_source STRING,
  OUT p_rows_affected INT
)
LANGUAGE SQL
SQL SECURITY INVOKER
BEGIN
  DECLARE EXIT HANDLER FOR SQLEXCEPTION
  BEGIN
    SET p_rows_affected = -1;
    SIGNAL SQLSTATE '45000'
      SET MESSAGE_TEXT = concat('Upsert failed for source: ', p_source);
  END;

  MERGE INTO catalog.schema.dim_customer AS t
  USING (SELECT * FROM identifier(p_source)) AS s
  ON t.customer_id = s.customer_id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *;

  SET p_rows_affected = (SELECT COUNT(*) FROM identifier(p_source));
END;

-- Invoke:
CALL catalog.schema.upsert_customers('catalog.schema.staging_customers', ?);

Materialized View with Scheduled Refresh

CREATE OR REPLACE MATERIALIZED VIEW catalog.schema.daily_revenue
  CLUSTER BY (order_date)
  SCHEDULE EVERY 1 HOUR
  COMMENT 'Hourly-refreshed daily revenue by region'
AS SELECT
    order_date,
    region,
    SUM(amount) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM catalog.schema.fact_orders
JOIN catalog.schema.dim_store USING (store_id)
GROUP BY order_date, region;

Pipe Syntax - Readable Transformations

-- Traditional SQL rewritten with pipe syntax
FROM catalog.schema.fact_orders
  |> WHERE order_date >= current_date() - INTERVAL 30 DAYS
  |> AGGREGATE SUM(amount) AS total, COUNT(*) AS cnt GROUP BY region, product_category
  |> WHERE total > 10000
  |> ORDER BY total DESC
  |> LIMIT 20;

AI Functions - Enrich Data with LLMs

-- Classify support tickets
SELECT
  ticket_id,
  description,
  ai_classify(description, ARRAY('billing', 'technical', 'account', 'feature_request')) AS category,
  ai_analyze_sentiment(description) AS sentiment
FROM catalog.schema.support_tickets
LIMIT 100;

-- Extract entities from text
SELECT
  doc_id,
  ai_extract(content, ARRAY('person_name', 'company', 'dollar_amount')) AS entities
FROM catalog.schema.contracts;

-- General-purpose AI query with structured output
SELECT ai_query(
  'databricks-meta-llama-3-3-70b-instruct',
  concat('Summarize this customer feedback in JSON with keys: topic, sentiment, action_items. Feedback: ', feedback),
  returnType => 'STRUCT<topic STRING, sentiment STRING, action_items ARRAY<STRING>>'
) AS analysis
FROM catalog.schema.customer_feedback
LIMIT 50;

Geospatial - Proximity Search with H3

-- Find stores within 5km of each customer using H3 indexing
WITH customer_h3 AS (
  SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell
  FROM catalog.schema.customers
),
store_h3 AS (
  SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell
  FROM catalog.schema.stores
)
SELECT
  c.customer_id,
  s.store_id,
  ST_Distance(
    ST_Point(c.longitude, c.latitude),
    ST_Point(s.longitude, s.latitude)
  ) AS distance_m
FROM customer_h3 c
JOIN store_h3 s ON h3_ischildof(c.h3_cell, h3_toparent(s.h3_cell, 5))
WHERE ST_Distance(
  ST_Point(c.longitude, c.latitude),
  ST_Point(s.longitude, s.latitude)
) < 5000;

Collation - Case-Insensitive Search

-- Create table with case-insensitive collation
CREATE TABLE catalog.schema.products (
  product_id BIGINT GENERATED ALWAYS AS IDENTITY,
  name STRING COLLATE UTF8_LCASE,
  category STRING COLLATE UTF8_LCASE,
  price DECIMAL(10, 2)
);

-- Queries automatically case-insensitive (no LOWER() needed)
SELECT * FROM catalog.schema.products
WHERE name = 'MacBook Pro';  -- matches 'macbook pro', 'MACBOOK PRO', etc.

http_request - Call External APIs

-- Set up connection first (one-time)
CREATE CONNECTION my_api_conn
  TYPE HTTP
  OPTIONS (host 'https://api.example.com', bearer_token secret('scope', 'token'));

-- Call API from SQL
SELECT
  order_id,
  http_request(
    conn => 'my_api_conn',
    method => 'POST',
    path => '/v1/validate',
    json => to_json(named_struct('order_id', order_id, 'amount', amount))
  ).text AS api_response
FROM catalog.schema.orders
WHERE needs_validation = true;

read_files - Ingest Raw Files

-- Read JSON files from a Volume with schema hints
SELECT *
FROM read_files(
  '/Volumes/catalog/schema/raw/events/',
  format => 'json',
  schemaHints => 'event_id STRING, timestamp TIMESTAMP, payload MAP<STRING, STRING>',
  pathGlobFilter => '*.json',
  recursiveFileLookup => true
);

-- Read CSV with options
SELECT *
FROM read_files(
  '/Volumes/catalog/schema/raw/sales/',
  format => 'csv',
  header => true,
  delimiter => '|',
  dateFormat => 'yyyy-MM-dd',
  schema => 'sale_id INT, sale_date DATE, amount DECIMAL(10,2), store STRING'
);

Recursive CTE - Hierarchy Traversal

WITH RECURSIVE org_chart AS (
  -- Anchor: top-level managers
  SELECT employee_id, name, manager_id, 0 AS depth, ARRAY(name) AS path
  FROM catalog.schema.employees
  WHERE manager_id IS NULL

  UNION ALL

  -- Recursive: direct reports
  SELECT e.employee_id, e.name, e.manager_id, o.depth + 1, array_append(o.path, e.name)
  FROM catalog.schema.employees e
  JOIN org_chart o ON e.manager_id = o.employee_id
  WHERE o.depth < 10  -- safety limit
)
SELECT * FROM org_chart ORDER BY depth, name;

remote_query - Federated Queries

-- Query PostgreSQL via Lakehouse Federation
SELECT *
FROM remote_query(
  'my_postgres_connection',
  database => 'my_database',
  query    => 'SELECT customer_id, email, created_at FROM customers WHERE active = true'
);

Reference Files

Load these for detailed syntax, full parameter lists, and advanced patterns:

File	Contents	When to Read
references/sql-scripting.md	SQL Scripting, Stored Procedures, Recursive CTEs, Transactions	User needs procedural SQL, error handling, loops, dynamic SQL
references/materialized-views-pipes.md	Materialized Views, Temp Tables/Views, Pipe Syntax	User needs MVs, refresh scheduling, temp objects, pipe operator
references/geospatial-collations.md	39 H3 functions, 80+ ST functions, Collation types and hierarchy	User needs spatial analysis, H3 indexing, case/accent handling
references/ai-functions.md	13 AI functions, http_request, remote_query, read_files (all options)	User needs AI enrichment, API calls, federation, file ingestion
references/best-practices.md	Data modeling, performance, Liquid Clustering, anti-patterns	User needs architecture guidance, optimization, or modeling advice

Key Guidelines

Always use Serverless SQL warehouses for AI functions, MVs, and http_request
Use LIMIT during development with AI functions to control costs
Prefer Liquid Clustering over partitioning for new tables (1-4 keys max)
Use CLUSTER BY AUTO when unsure about clustering keys
Star schema in Gold layer for BI; OBT acceptable in Silver
Define PK/FK constraints on dimensional models for query optimization
Use COLLATE UTF8_LCASE for user-facing string columns that need case-insensitive search
Test SQL via CLI (databricks experimental aitools tools query) or notebooks before deploying. If --warehouse is rejected on your CLI version, set DATABRICKS_WAREHOUSE_ID in the environment instead.

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。