スキルKnowledge Work

🔍explore-data

プラグイン: Data
引数: <table or file>
ソース: GitHub で見る ↗

説明

データセットのプロファイリングと探索を行い、その構造・品質・パターンを把握します。次のような場合に使用: - 新しいテーブルやファイルを初めて扱うとき - NULLの割合やカラムの分布を確認したいとき - 重複データや疑わしい値などのデータ品質の問題を検出したいとき - 分析対象のディメンションやメトリクスを決定したいとき

原文を表示

Profile and explore a dataset to understand its shape, quality, and patterns. Use when encountering a new table or file, checking null rates and column distributions, spotting data quality issues like duplicates or suspicious values, or deciding which dimensions and metrics to analyze.

ユースケース

✓新しいテーブルやファイルを初めて扱うとき
✓NULLの割合やカラムの分布を確認したいとき
✓データ品質の問題を検出したいとき
✓分析対象のディメンションやメトリクスを決定したいとき

本文（日本語訳）

/explore-data - データセットのプロファイリングと探索

見慣れないプレースホルダーが表示される場合や、接続中のツールを確認したい場合は、CONNECTORS.md を参照してください。

テーブルまたはアップロードされたファイルに対して、包括的なデータプロファイルを生成します。分析に入る前に、データの形状・品質・パターンを把握しましょう。

使い方

/explore-data <テーブル名 または ファイル>

ワークフロー

1. データへのアクセス

データウェアハウスの MCP サーバーが接続されている場合:

テーブル名を解決する（スキーマプレフィックスを処理し、曖昧な場合は候補を提示）
テーブルのメタデータをクエリする: カラム名・型・説明（利用可能な場合）
ライブデータに対してプロファイリングクエリを実行する

ファイルが提供されている場合（CSV、Excel、Parquet、JSON）:

ファイルを読み込み、作業用データセットにロードする
データからカラム型を推論する

どちらでもない場合:

ユーザーにテーブル名（ウェアハウスを接続した状態で）またはファイルのアップロードを求める
テーブルのスキーマが説明された場合は、実行すべきプロファイリングクエリについてガイダンスを提供する

2. 構造の把握

データを分析する前に、まずその構造を理解します。

テーブルレベルの確認事項:

行数・列数はいくつか？
グレイン（1行が何を表すか）は何か？
主キーは何か？一意性は保たれているか？
データが最後に更新されたのはいつか？
データはどこまで遡れるか？

カラムの分類 — 各カラムを以下のいずれかに分類します:

Identifier（識別子）: ユニークキー・外部キー・エンティティID
Dimension（ディメンション）: グループ化・フィルタリング用のカテゴリ属性（ステータス、タイプ、地域、カテゴリなど）
Metric（メトリクス）: 計測のための定量的な値（収益、件数、期間、スコアなど）
Temporal（時系列）: 日付・タイムスタンプ（created_at、updated_at、event_date など）
Text（テキスト）: 自由入力のテキストフィールド（説明、メモ、名前など）
Boolean（ブール値）: 真偽フラグ
Structural（構造データ）: JSON、配列、ネスト構造

3. データプロファイルの生成

以下のプロファイリングチェックを実行します。

テーブルレベルのメトリクス:

総行数
カラム数および型の内訳
テーブルサイズの概算（メタデータから取得可能な場合）
日付カバレッジの範囲（日付カラムの最小値・最大値）

全カラム共通:

Null件数およびNull率
ユニーク値の件数とカーディナリティ比率（ユニーク数 / 総数）
最頻出値（上位5〜10件と出現頻度）
最低頻出値（異常検出のため下位5件）

数値カラム（メトリクス）:

最小値、最大値、平均値、中央値（p50）
標準偏差
パーセンタイル: p1、p5、p25、p75、p95、p99
ゼロ件数
負の値の件数（想定外の場合）

文字列カラム（ディメンション、テキスト）:

最小長、最大長、平均長
空文字列の件数
パターン分析（値が特定フォーマットに従っているか？）
大文字・小文字の一貫性（全大文字・全小文字・混在？）
前後の空白文字の件数

日付・タイムスタンプカラム:

最小日付、最大日付
Null日付
未来の日付（想定外の場合）
月・週ごとの分布
時系列のギャップ

ブールカラム:

true件数、false件数、null件数
true率

プロファイル結果はカラム型ごと（ディメンション・メトリクス・日付・IDなど）にグループ化した、見やすいサマリーテーブルとして提示します。

4. データ品質の問題を特定する

以下の品質評価フレームワークを適用し、潜在的な問題にフラグを立てます。

高いNull率: 5%超でNull（警告）、20%超でNull（アラート）
低カーディナリティの異常: 高カーディナリティであるべきカラムがそうでない場合（例: "user_id" のユニーク値が50件しかないなど）
高カーディナリティの異常: カテゴリ型であるべきカラムのユニーク値が多すぎる場合
疑わしい値: 正の値が期待される場所での負の金額、過去データ中の未来日付、明らかにプレースホルダーと思われる値（"N/A"、"TBD"、"test"、"999999" など）
重複の検出: 自然キーが存在する場合、それに重複があるかを確認
分布の偏り: 平均値に影響を与えるほど極端に偏った数値分布
エンコーディングの問題: カテゴリフィールドの大文字・小文字混在、末尾の空白、フォーマットの不統一

5. 関係性とパターンの発見

各カラムのプロファイリング後に以下を確認します。

外部キー候補: 他のテーブルとリンクする可能性のあるIDカラム
階層関係: 自然なドリルダウンパスを形成するカラム（国 > 都道府県 > 市区町村など）
相関関係: 連動する数値カラム
派生カラム: 他のカラムから計算されていると思われるカラム
冗長カラム: 同一または非常に近い情報を持つカラム

6. 有効なディメンションとメトリクスの提案

カラムプロファイルをもとに、以下を推奨します。

最適なディメンションカラム: データのスライスに適したもの（適切なカーディナリティ、3〜50値のカテゴリカラム）
主要なメトリクスカラム: 計測に適したもの（意味のある分布を持つ数値カラム）
時間カラム: トレンド分析に適したもの
自然なグループや階層: データから読み取れるもの
潜在的なJoinキー: 他のテーブルとリンクするもの（IDカラム・外部キー）

7. フォローアップ分析の提案

次に実行できる具体的な分析を3〜5件提案します。

「[time_column] ごと、[dimension] でグループ化した [metric] のトレンド分析」
「外れ値を把握するための [skewed_column] の分布詳細調査」
「[problematic_column] のデータ品質調査」
「[metric_a] と [metric_b] の相関分析」
「[date_column] と [status_column] を用いたコーホート分析」

出力フォーマット

## データプロファイル: [table_name]

### 概要
- 行数: 2,340,891
- カラム数: 23（ディメンション8、メトリクス6、日付4、ID 5）
- 日付範囲: 2021-03-15 〜 2024-01-22

### カラム詳細
[サマリーテーブル]

### データ品質の問題
[深刻度付きのフラグ一覧]

### 推奨される探索
[提案するフォローアップ分析の番号付きリスト]

品質評価フレームワーク

完全性スコア

各カラムを以下で評価します。

Complete（完全） — 非Null率 99%超: 🟢 グリーン
Mostly complete（ほぼ完全） — 95〜99%: 🟡 イエロー — Nullの原因を調査
Incomplete（不完全） — 80〜95%: 🟠 オレンジ — 理由と影響を把握
Sparse（疎） — 80%未満: 🔴 レッド — 補完なしでは使用不可の可能性

一貫性チェック

以下を確認します。

値フォーマットの不統一: 同じ概念が異なる表記で格納されている（例: "USA"、"US"、"United States"、"us"）
型の不統一: 文字列として保存されている数値、様々なフォーマットで保存されている日付
参照整合性: 親レコードが存在しない外部キー
ビジネスルール違反: 負の数量、開始日より前の終了日、100%を超えるパーセンテージ
カラム間の整合性: ステータスが "completed" なのに completed_at が Null

正確性の指標

精度に問題があることを示すレッドフラグ:

プレースホルダー値: 0、-1、999999、"N/A"、"TBD"、"test"、"xxx"
デフォルト値: 特定の値が不自然なほど高頻度で出現している
古いデータ: 稼働中のシステムなのに updated_at に最近の更新が見られない
ありえない値: 150歳超の年齢、遠い未来の日付、負の所要時間
端数バイアス: 0または5で終わる値ばかり（実測ではなく推定の可能性）

鮮度の評価

テーブルが最後に更新されたのはいつか？
想定される更新頻度は？
イベント発生時刻とロード時刻の間にラグはあるか？
時系列にギャップはあるか？

パターン発見のテクニック

分布分析

数値カラムの分布を以下で特徴付けます。

正規分布: 平均と中央値が近く、釣り鐘型
右裾が長い分布: 高い値の長いテール（収益・セッション時間に多い）
左裾が長い分布: 低い値の長いテール（比較的まれ）
双峰分布: 2つのピーク（2つの異なる母集団が混在している可能性）
べき乗則分布: 極めて大きな値が少数、小さな値が多数（ユーザー活動に多い）
一様分布: 全範囲でほぼ均等な頻度（合成データや乱数に多い）

時系列パターン

時系列データでは以下を確認します。

トレンド: 持続的な上昇または下降傾向
季節性: 繰り返すパターン（週次・月次・四半期・年次）
曜日効果: 平日と週末の違い
祝祭日効果: 特定の祝日前後の急落・急騰
変化点: レベルやトレンドの急激な変化
異常値: パターンから外れた個別のデータポイント

セグメント発見

以下によって自然なセグメントを特定します。

3〜20のユニーク値を持つカテゴリカラムを探す
セグメント値ごとのメトリクス分布を比較する
他と明らかに異なる振る舞いを示すセグメントを探す
各セグメントが均質か、サブセグメントを含むかを検証する

相関の探索

数値カラム間:

全メトリクスペアの相関行列を計算する
強い相関（|r| > 0.7）を調査対象としてフラグを立てる
注意: 相関は因果関係を意味しない — この点は必ず明示する
非線形関係（二次・対数など）も確認する

スキーマの理解とドキュメント化

スキーマドキュメントのテンプレート

チームで利用するためのデータセットをドキュメント化する際:

## テーブル: [schema.table_name]

**説明**: [このテーブルが表すもの]
**グレイン**: [1行が何を表すか]
**主キー**: [カラム名]
**行数**: [概算、日付付き]
**更新頻度**: [リアルタイム / 1時間ごと / 日次 / 週次]
**オーナー**: [担当チームまたは

原文（English）を表示

/explore-data - Profile and Explore a Dataset

If you see unfamiliar placeholders or need to check which tools are connected, see CONNECTORS.md.

Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis.

Usage

/explore-data <table_name or file>

Workflow

1. Access the Data

If a data warehouse MCP server is connected:

Resolve the table name (handle schema prefixes, suggest matches if ambiguous)
Query table metadata: column names, types, descriptions if available
Run profiling queries against the live data

If a file is provided (CSV, Excel, Parquet, JSON):

Read the file and load into a working dataset
Infer column types from the data

If neither:

Ask the user to provide a table name (with their warehouse connected) or upload a file
If they describe a table schema, provide guidance on what profiling queries to run

2. Understand Structure

Before analyzing any data, understand its structure:

Table-level questions:

How many rows and columns?
What is the grain (one row per what)?
What is the primary key? Is it unique?
When was the data last updated?
How far back does the data go?

Column classification — categorize each column as one of:

Identifier: Unique keys, foreign keys, entity IDs
Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
Metric: Quantitative values for measurement (revenue, count, duration, score)
Temporal: Dates and timestamps (created_at, updated_at, event_date)
Text: Free-form text fields (description, notes, name)
Boolean: True/false flags
Structural: JSON, arrays, nested structures

3. Generate Data Profile

Run the following profiling checks:

Table-level metrics:

Total row count
Column count and types breakdown
Approximate table size (if available from metadata)
Date range coverage (min/max of date columns)

All columns:

Null count and null rate
Distinct count and cardinality ratio (distinct / total)
Most common values (top 5-10 with frequencies)
Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

Present the profile as a clean summary table, grouped by column type (dimensions, metrics, dates, IDs).

4. Identify Data Quality Issues

Apply the quality assessment framework below. Flag potential problems:

High null rates: Columns with >5% nulls (warn), >20% nulls (alert)
Low cardinality surprises: Columns that should be high-cardinality but aren't (e.g., a "user_id" with only 50 distinct values)
High cardinality surprises: Columns that should be categorical but have too many distinct values
Suspicious values: Negative amounts where only positive expected, future dates in historical data, obviously placeholder values (e.g., "N/A", "TBD", "test", "999999")
Duplicate detection: Check if there's a natural key and whether it has duplicates
Distribution skew: Extremely skewed numeric distributions that could affect averages
Encoding issues: Mixed case in categorical fields, trailing whitespace, inconsistent formats

5. Discover Relationships and Patterns

After profiling individual columns:

Foreign key candidates: ID columns that might link to other tables
Hierarchies: Columns that form natural drill-down paths (country > state > city)
Correlations: Numeric columns that move together
Derived columns: Columns that appear to be computed from others
Redundant columns: Columns with identical or near-identical information

6. Suggest Interesting Dimensions and Metrics

Based on the column profile, recommend:

Best dimension columns for slicing data (categorical columns with reasonable cardinality, 3-50 values)
Key metric columns for measurement (numeric columns with meaningful distributions)
Time columns suitable for trend analysis
Natural groupings or hierarchies apparent in the data
Potential join keys linking to other tables (ID columns, foreign keys)

7. Recommend Follow-Up Analyses

Suggest 3-5 specific analyses the user could run next:

"Trend analysis on [metric] by [time_column] grouped by [dimension]"
"Distribution deep-dive on [skewed_column] to understand outliers"
"Data quality investigation on [problematic_column]"
"Correlation analysis between [metric_a] and [metric_b]"
"Cohort analysis using [date_column] and [status_column]"

Output Format

## Data Profile: [table_name]

### Overview
- Rows: 2,340,891
- Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)
- Date range: 2021-03-15 to 2024-01-22

### Column Details
[summary table]

### Data Quality Issues
[flagged issues with severity]

### Recommended Explorations
[numbered list of suggested follow-up analyses]

Quality Assessment Framework

Completeness Score

Rate each column:

Complete (>99% non-null): Green
Mostly complete (95-99%): Yellow -- investigate the nulls
Incomplete (80-95%): Orange -- understand why and whether it matters
Sparse (<80%): Red -- may not be usable without imputation

Consistency Checks

Look for:

Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
Type inconsistency: Numbers stored as strings, dates in various formats
Referential integrity: Foreign keys that don't match any parent record
Business rule violations: Negative quantities, end dates before start dates, percentages > 100
Cross-column consistency: Status = "completed" but completed_at is null

Accuracy Indicators

Red flags that suggest accuracy issues:

Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
Default values: Suspiciously high frequency of a single value
Stale data: Updated_at shows no recent changes in an active system
Impossible values: Ages > 150, dates in the far future, negative durations
Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)

Timeliness Assessment

When was the table last updated?
What is the expected update frequency?
Is there a lag between event time and load time?
Are there gaps in the time series?

Pattern Discovery Techniques

Distribution Analysis

For numeric columns, characterize the distribution:

Normal: Mean and median are close, bell-shaped
Skewed right: Long tail of high values (common for revenue, session duration)
Skewed left: Long tail of low values (less common)
Bimodal: Two peaks (suggests two distinct populations)
Power law: Few very large values, many small ones (common for user activity)
Uniform: Roughly equal frequency across range (often synthetic or random)

Temporal Patterns

For time series data, look for:

Trend: Sustained upward or downward movement
Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
Day-of-week effects: Weekday vs. weekend differences
Holiday effects: Drops or spikes around known holidays
Change points: Sudden shifts in level or trend
Anomalies: Individual data points that break the pattern

Segmentation Discovery

Identify natural segments by:

Finding categorical columns with 3-20 distinct values
Comparing metric distributions across segment values
Looking for segments with significantly different behavior
Testing whether segments are homogeneous or contain sub-segments

Correlation Exploration

Between numeric columns:

Compute correlation matrix for all metric pairs
Flag strong correlations (|r| > 0.7) for investigation
Note: Correlation does not imply causation -- flag this explicitly
Check for non-linear relationships (e.g., quadratic, logarithmic)

Schema Understanding and Documentation

Schema Documentation Template

When documenting a dataset for team use:

## Table: [schema.table_name]

**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]

### Key Columns

| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |

### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)

### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]

### Common Query Patterns
- [Typical use cases for this table]

Schema Exploration Queries

When connected to a data warehouse, use these patterns to discover schema:

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

Lineage and Dependencies

When exploring an unfamiliar data environment:

Start with the "output" tables (what reports or dashboards consume)
Trace upstream: What tables feed into them?
Identify raw/staging/mart layers
Map the transformation chain from raw data to analytical tables
Note where data is enriched, filtered, or aggregated

Tips

For very large tables (100M+ rows), profiling queries use sampling by default -- mention if you need exact counts
If exploring a new dataset for the first time, this command gives you the lay of the land before writing specific queries
The quality flags are heuristic -- not every flag is a real problem, but each is worth a quick look

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。