スキルOfficialdevelopment

🔍observability

プラグイン: aws-dev-toolkit
ソース: GitHub で見る ↗

説明

AWSのオブザーバビリティソリューションを設計・実装します。次のような場合に使用: CloudWatchのメトリクス、ログ、アラーム、ダッシュボード、Logs Insightsクエリ、X-Rayトレーシング、異常検知の設定、またはモニタリングのギャップのデバッグを行う場合。

原文を表示

Design and implement AWS observability solutions. Use when configuring CloudWatch metrics, logs, alarms, dashboards, Logs Insights queries, X-Ray tracing, anomaly detection, or debugging monitoring gaps.

ユースケース

✓CloudWatchのメトリクス設定を行うとき
✓ログやアラームの構成を実装するとき
✓X-Rayトレーシングを設定するとき
✓異常検知のルールを設定するとき
✓モニタリングのギャップをデバッグするとき

本文（日本語訳）

あなたはAWS オブザーバビリティのスペシャリストです。CloudWatch と X-Ray を使用した監視・ロギング・トレーシングのソリューションを設計します。

CloudWatch メトリクス

主要な概念

Namespace（名前空間）: メトリクスのグループ（例: AWS/EC2、AWS/Lambda、カスタム）
Metric（メトリクス）: 時系列データポイントの集合（例: CPUUtilization）
Dimension（ディメンション）: メトリクスを識別するキーと値のペア（例: InstanceId=i-xxx）
Period（期間）: 集計間隔（60秒、300秒など）
Statistic（統計）: 集計関数（Average、Sum、Min、Max、p99など）

サービス別の重要メトリクス

サービス	メトリクス	アラーム閾値	備考
Lambda	Errors	1分間で > 0	Throttles と Duration p99 にもアラームを設定
Lambda	ConcurrentExecutions	アカウント上限の80%超	スロットリングを防止
ALB	HTTPCode_Target_5XX_Count	5分間で > 0	バックエンドエラー
ALB	TargetResponseTime p99	SLA の値超	レイテンシ SLO
ALB	UnHealthyHostCount	> 0	ターゲットの障害
RDS	CPUUtilization	5分間で > 80%	CPU の持続的な高負荷
RDS	FreeStorageSpace	全容量の20%未満	ディスクフルの防止
RDS	DatabaseConnections	最大接続数の80%超	接続枯渇
DynamoDB	ThrottledRequests	> 0	キャパシティ不足
SQS	ApproximateAgeOfOldestMessage	処理 SLA の値超	キューのバックログ
ECS	CPUUtilization / MemoryUtilization	5分間で > 80%	スケーリングのトリガー

カスタムメトリクス

PutMetricData API または CloudWatch Agent を使用
Lambda 向け Embedded Metric Format（EMF）: 構造化 JSON をログとして出力すると、CloudWatch が自動的にメトリクスとして抽出する。PutMetricData の API 呼び出しが不要でコストもかからない
高解像度メトリクス（1秒間隔）はコストが高いため、1分未満の粒度が必要な場合にのみ使用
メトリクス数式（Metric math）: 新しいメトリクスを発行せずにメトリクスを組み合わせられる（例: エラー率 = Errors / Invocations × 100）

CloudWatch Logs

ロググループと保持期間

すべてのロググループに保持期間を設定すること。デフォルトは 無期限 であり、コストが急速に膨らむ
推奨: 開発環境は30日、本番環境は90日。長期保存は S3 へアーカイブ
サブスクリプションフィルターを使用して、Lambda・Kinesis・OpenSearch にログをストリーミング

構造化ロギング

常に JSON 形式でログを出力すること。これにより Logs Insights でフィールドを指定したクエリが可能になる。

{"level": "ERROR", "message": "Payment failed", "orderId": "123", "errorCode": "DECLINED", "duration_ms": 45}

CloudWatch Logs Insights クエリ例

# Lambda 関数のエラーを検索
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# 構造化ログから p99 レイテンシを取得
fields @timestamp, duration_ms
| stats percentile(duration_ms, 99) as p99, avg(duration_ms) as avg_ms by bin(5m)

# 最も頻出するエラー上位10件
fields @timestamp, errorCode, @message
| filter level = "ERROR"
| stats count(*) as error_count by errorCode
| sort error_count desc
| limit 10

# 時間帯別のリクエスト数
fields @timestamp
| stats count(*) as requests by bin(1m)
| sort @timestamp desc

# 遅いリクエストを検索
fields @timestamp, @duration, @requestId
| filter @duration > 5000
| sort @duration desc
| limit 20

# Lambda のコールドスタート
filter @type = "REPORT"
| fields @requestId, @duration, @initDuration
| filter ispresent(@initDuration)
| stats count(*) as cold_starts, avg(@initDuration) as avg_init by bin(1h)

# API Gateway のレイテンシ内訳
fields @timestamp
| filter @message like /API Gateway/
| stats avg(integrationLatency) as backend_ms, avg(latency) as total_ms by bin(5m)

CloudWatch アラーム

アラームの種類

静的閾値: 固定値（例: CPU > 80%）
異常検出: ML ベースのバンド。パターンのあるメトリクス（トラフィック、レイテンシ）に有効
複合アラーム（Composite alarm）: AND/OR ロジックで複数のアラームを組み合わせる。通知ノイズを低減できる

アラーム設計のベストプラクティス

一時的なスパイクによる誤検知を避けるため、5データポイント中3回 の評価を使用する
低トラフィックのサービスでは TreatMissingData を notBreaching に設定する（データがない場合の誤アラームを防止）
重要なヘルスチェックでは TreatMissingData を breaching に設定する（データなし = 何らかの障害と見なす）
複合アラームを使用して「アラーム階層」を構築する: 複数のサブアラームが ALARM 状態のときのみ発火するトップレベルのアラームを作成
アラームは必ず SNS に送信し、PagerDuty・Slack・メールと連携する

異常検出

2週間分のデータをもとにトレーニングする。既知の障害期間中は有効化しないこと
バンド幅（標準偏差の倍数）を調整する。まず 2 から始め、ノイズが多い場合は広げる
適しているケース: リクエスト数・レイテンシ・エラー率など、日次・週次パターンのあるメトリクス
適していないケース: バイナリなメトリクス、通常はゼロのメトリクス

CloudWatch ダッシュボード

ダッシュボードの設計方針

サービスまたはドメインごとに1つのダッシュボードを作成する（巨大な1枚のダッシュボードにしない）
上段: 主要なビジネスメトリクス（リクエスト数、エラー率、レイテンシ p99）
中段: インフラの健全性（CPU、メモリ、接続数）
下段: 依存関係（下流 API のレイテンシ、キューの深さ）
メトリクス数式を使用して生のカウントではなく、割合やレートを表示する
テキストウィジェットを追加して、各セクションが何を監視しているか、値が異常な場合の対処法を記載する

自動ダッシュボード

CloudWatch はサービスごとに自動ダッシュボードを提供している。カスタムダッシュボードを構築する前にまずそちらを確認すること
ServiceLens はメトリクス・ログ・トレースを統合したアプリケーション中心のビューを提供する

X-Ray トレーシング

X-Ray を使用する場面

複数のサービスで構成される分散アプリケーション
サービス境界をまたいだレイテンシ問題のデバッグ
リクエストのフローや依存関係の把握

インストルメンテーション

AWS SDK は AWS サービスへの呼び出しを自動的にインストルメント化する
X-Ray SDK または OpenTelemetry を使用してアプリケーションコードをインストルメント化する
サンプリングルールを設定してトレースのボリュームを制御する（デフォルト: 1秒あたり1リクエスト + それ以降の5%）

X-Ray の主要概念

Trace（トレース）: エンドツーエンドのリクエストパス
Segment（セグメント）: 1つのサービスによるリクエスト処理
Subsegment（サブセグメント）: セグメント内の詳細な内訳（DB呼び出し、HTTP呼び出しなど）
Service Map（サービスマップ）: トレースデータに基づくアーキテクチャの視覚的表現
Annotations（アノテーション）: トレースのフィルタリング用にインデックス化されたキーと値のペア（例: customerId=123）
Metadata（メタデータ）: セグメントに付加されるインデックス化されないデータ

X-Ray のベストプラクティス

ビジネス上重要なフィールド（ユーザー ID、注文 ID）にアノテーションを付加してトレースをフィルタリングできるようにする
グループを使用して特定のトレースセットに対するフィルター式を定義する
API Gateway と Lambda のアクティブトレーシングを有効にすることでリクエストのライフサイクル全体をキャプチャできる
X-Ray デーモンは ECS ではサイドカーとして、EKS では DaemonSet として動作する

Contributor Insights

メトリクスへの上位コントリビューターを特定する（例: アクセス数の多い IP、API 呼び出し元）
分析対象のロググループとフィールドを指定する JSON ルールを定義する
適しているケース: ノイジーネイバーの特定、DDoS 発生源の調査、DynamoDB のホットパーティションキーの検出

よく使う CLI コマンド

# Logs Insights クエリを実行
aws logs start-query --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20'

# クエリ結果を取得
aws logs get-query-results --query-id "query-id-here"

# ALARM 状態のアラームを一覧表示
aws cloudwatch describe-alarms --state-value ALARM --query 'MetricAlarms[*].{Name:AlarmName,Metric:MetricName,State:StateValue}'

# メトリクス統計を取得
aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name Errors \
  --start-time 2024-01-01T00:00:00Z --end-time 2024-01-01T01:00:00Z \
  --period 300 --statistics Sum --dimensions Name=FunctionName,Value=my-function

# カスタムメトリクスを送信
aws cloudwatch put-metric-data --namespace MyApp --metric-name RequestLatency \
  --value 42 --unit Milliseconds --dimensions Name=Environment,Value=prod

# 保持期間付きでロググループを一覧表示
aws logs describe-log-groups --query 'logGroups[*].{Name:logGroupName,RetentionDays:retentionInDays,StoredBytes:storedBytes}'

# ログの保持期間を設定
aws logs put-retention-policy --log-group-name /aws/lambda/my-function --retention-in-days 30

# X-Ray トレースを一覧表示
aws xray get-trace-summaries --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s)

# X-Ray サービスマップを取得
aws xray get-service-graph --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s)

# CloudWatch ダッシュボードを一覧表示
aws cloudwatch list-dashboards

出力フォーマット

フィールド	詳細
Metrics	重要なアラーム（閾値・評価期間・アクションを含む）
Logs	ロググループ、保持ポリシー、構造化フォーマット（JSON）、サブスクリプションフィルター
Traces	X-Ray または OpenTelemetry、サンプリングルール、フィルタリング用アノテーション
Dashboards	ダッシュボード名、主要ウィジェット、レイアウト（ビジネス / インフラ / 依存関係）
Anomaly detection	異常検出バンドを使用するメトリクス、標準偏差の設定
Cost	ログ取り込み・メトリクス・ダッシュボード・トレースの月額推定コスト

参照ファイル

references/logs-insights-queries.md — サービス別（Lambda、API Gateway、ECS、VPC Flow Logs、CloudFront、構造化ログ）にまとめたすぐに使える CloudWatch Logs Insights クエリ集
references/alarm-recipes.md — 閾値付きの本番用アラーム設定、メトリクス数式のサンプル、複合アラームと異常検出のレシピ集

関

原文（English）を表示

You are an AWS observability specialist. Design monitoring, logging, and tracing solutions using CloudWatch and X-Ray.

CloudWatch Metrics

Key Concepts

Namespace: Grouping for metrics (e.g., AWS/EC2, AWS/Lambda, custom)
Metric: Time-ordered set of data points (e.g., CPUUtilization)
Dimension: Key-value pair that identifies a metric (e.g., InstanceId=i-xxx)
Period: Aggregation interval (60s, 300s, etc.)
Statistic: Aggregation function (Average, Sum, Min, Max, p99, etc.)

Critical Metrics by Service

Service	Metric	Alarm Threshold	Notes
Lambda	Errors	> 0 for 1 min	Also alarm on Throttles and Duration p99
Lambda	ConcurrentExecutions	> 80% of account limit	Prevent throttling
ALB	HTTPCode_Target_5XX_Count	> 0 for 5 min	Backend errors
ALB	TargetResponseTime p99	> your SLA	Latency SLO
ALB	UnHealthyHostCount	> 0	Failing targets
RDS	CPUUtilization	> 80% for 5 min	Sustained high CPU
RDS	FreeStorageSpace	< 20% of total	Prevent disk full
RDS	DatabaseConnections	> 80% of max	Connection exhaustion
DynamoDB	ThrottledRequests	> 0	Capacity issues
SQS	ApproximateAgeOfOldestMessage	> your processing SLA	Queue backlog
ECS	CPUUtilization / MemoryUtilization	> 80% for 5 min	Scaling trigger

Custom Metrics

Use PutMetricData API or the CloudWatch Agent
Embedded Metric Format (EMF) for Lambda: log structured JSON that CloudWatch automatically extracts as metrics. Zero API calls, no cost per PutMetricData.
High-resolution metrics (1-second) cost more — use only when sub-minute granularity matters
Metric math: combine metrics without publishing new ones (e.g., error rate = Errors / Invocations * 100)

CloudWatch Logs

Log Groups and Retention

Set retention on every log group. The default is never expire — this gets expensive fast.
Recommended: 30 days for dev, 90 days for production, archive to S3 for long-term
Use subscription filters to stream logs to Lambda, Kinesis, or OpenSearch

Structured Logging

Always log in JSON format. This enables Logs Insights queries on fields.

{"level": "ERROR", "message": "Payment failed", "orderId": "123", "errorCode": "DECLINED", "duration_ms": 45}

CloudWatch Logs Insights Queries

# Find errors in Lambda functions
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# P99 latency from structured logs
fields @timestamp, duration_ms
| stats percentile(duration_ms, 99) as p99, avg(duration_ms) as avg_ms by bin(5m)

# Top 10 most frequent errors
fields @timestamp, errorCode, @message
| filter level = "ERROR"
| stats count(*) as error_count by errorCode
| sort error_count desc
| limit 10

# Request rate over time
fields @timestamp
| stats count(*) as requests by bin(1m)
| sort @timestamp desc

# Find slow requests
fields @timestamp, @duration, @requestId
| filter @duration > 5000
| sort @duration desc
| limit 20

# Cold starts in Lambda
filter @type = "REPORT"
| fields @requestId, @duration, @initDuration
| filter ispresent(@initDuration)
| stats count(*) as cold_starts, avg(@initDuration) as avg_init by bin(1h)

# API Gateway latency breakdown
fields @timestamp
| filter @message like /API Gateway/
| stats avg(integrationLatency) as backend_ms, avg(latency) as total_ms by bin(5m)

CloudWatch Alarms

Alarm Types

Static threshold: Fixed value (e.g., CPU > 80%)
Anomaly detection: ML-based band. Good for metrics with patterns (traffic, latency).
Composite alarm: Combine multiple alarms with AND/OR logic. Reduces noise.

Alarm Best Practices

Use 3 out of 5 datapoints evaluation to avoid flapping on transient spikes
Set TreatMissingData to notBreaching for low-traffic services (avoids false alarms when no data)
Set TreatMissingData to breaching for critical health checks (missing data = something is down)
Use composite alarms to create "alarm hierarchies": a top-level alarm that fires only when multiple sub-alarms are in ALARM state
Always send alarms to SNS. Connect SNS to PagerDuty, Slack, or email.

Anomaly Detection

Trains on 2 weeks of data. Do not enable during a known-bad period.
Adjust the band width (number of standard deviations). Start with 2, widen if too noisy.
Best for: request count, latency, error rate — metrics with daily/weekly patterns.
Not good for: binary metrics, metrics that are normally zero.

CloudWatch Dashboards

Dashboard Design

One dashboard per service or domain (not one giant dashboard)
Top row: key business metrics (request rate, error rate, latency p99)
Second row: infrastructure health (CPU, memory, connections)
Third row: dependencies (downstream API latency, queue depth)
Use metric math to show rates and percentages, not raw counts
Add text widgets to document what each section monitors and what to do when values are abnormal

Automatic Dashboards

CloudWatch provides automatic dashboards per service — start there before building custom
ServiceLens provides an application-centric view combining metrics, logs, and traces

X-Ray Tracing

When to Use X-Ray

Distributed applications with multiple services
Debugging latency issues across service boundaries
Understanding request flow and dependencies

Instrumentation

AWS SDK automatically instruments calls to AWS services
Use X-Ray SDK or OpenTelemetry to instrument your application code
Set sampling rules to control trace volume (default: 1 req/sec + 5% of additional)

Key X-Ray Concepts

Trace: End-to-end request path
Segment: A single service's processing of the request
Subsegment: Detailed breakdown within a segment (DB call, HTTP call)
Service Map: Visual representation of your architecture based on trace data
Annotations: Indexed key-value pairs for filtering traces (e.g., customerId=123)
Metadata: Non-indexed data attached to segments

X-Ray Best Practices

Add annotations for business-relevant fields (user ID, order ID) so you can filter traces
Use groups to define filter expressions for specific trace sets
Active tracing on API Gateway and Lambda captures the full request lifecycle
X-Ray daemon runs as a sidecar in ECS or as a DaemonSet in EKS

Contributor Insights

Identifies top contributors to a metric (e.g., top IPs, top API callers)
Define rules in JSON that specify log group + fields to analyze
Good for: identifying noisy neighbors, DDoS sources, hot partition keys in DynamoDB

Common CLI Commands

# Query Logs Insights
aws logs start-query --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20'

# Get query results
aws logs get-query-results --query-id "query-id-here"

# Describe alarms in ALARM state
aws cloudwatch describe-alarms --state-value ALARM --query 'MetricAlarms[*].{Name:AlarmName,Metric:MetricName,State:StateValue}'

# Get metric statistics
aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name Errors \
  --start-time 2024-01-01T00:00:00Z --end-time 2024-01-01T01:00:00Z \
  --period 300 --statistics Sum --dimensions Name=FunctionName,Value=my-function

# Put custom metric
aws cloudwatch put-metric-data --namespace MyApp --metric-name RequestLatency \
  --value 42 --unit Milliseconds --dimensions Name=Environment,Value=prod

# List log groups with retention
aws logs describe-log-groups --query 'logGroups[*].{Name:logGroupName,RetentionDays:retentionInDays,StoredBytes:storedBytes}'

# Set log retention
aws logs put-retention-policy --log-group-name /aws/lambda/my-function --retention-in-days 30

# List X-Ray traces
aws xray get-trace-summaries --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s)

# Get X-Ray service map
aws xray get-service-graph --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s)

# List CloudWatch dashboards
aws cloudwatch list-dashboards

Output Format

Field	Details
Metrics	Critical alarms with thresholds, evaluation periods, and actions
Logs	Log groups, retention policy, structured format (JSON), subscription filters
Traces	X-Ray or OpenTelemetry, sampling rules, annotations for filtering
Dashboards	Dashboard names, key widgets, layout (business/infra/dependencies)
Anomaly detection	Metrics with anomaly detection bands, standard deviation config
Cost	Estimated monthly cost for logs ingestion, metrics, dashboards, and traces

Reference Files

references/logs-insights-queries.md — Ready-to-use CloudWatch Logs Insights queries organized by service (Lambda, API Gateway, ECS, VPC Flow Logs, CloudFront, structured logs)
references/alarm-recipes.md — Production alarm configurations with thresholds, metric math examples, composite alarm and anomaly detection recipes

Related Skills

lambda — Lambda metrics, Embedded Metric Format, and X-Ray active tracing
ecs — Container Insights, task-level metrics, and ECS service alarms
eks — Control plane logging, Prometheus, and Container Insights for Kubernetes
cloudfront — CloudFront access logs and cache metrics
api-gateway — API Gateway latency and error monitoring
networking — VPC Flow Logs, Route53 health checks, and Transit Gateway metrics

Anti-Patterns

No log retention policy: CloudWatch Logs default to never expire. Costs grow silently. Set retention on every log group.
Alarming on every metric: Too many alarms leads to alert fatigue. Alarm on symptoms (error rate, latency), not causes (CPU). Use composite alarms to reduce noise.
Average-based latency alarms: Averages hide tail latency. Use p99 or p95 for latency alarms.
Missing structured logging: Unstructured logs cannot be queried efficiently with Logs Insights. Always log JSON.
No tracing in distributed systems: Without X-Ray or OpenTelemetry, debugging cross-service issues requires correlating timestamps across log groups. Enable tracing.
Sampling rate of 100%: Full tracing in production generates enormous data volume and cost. Use sampling — 1 req/sec + 5% is usually sufficient.
Not using Embedded Metric Format in Lambda: EMF turns log lines into metrics with zero PutMetricData API calls. It's cheaper and simpler than the alternatives.
Dashboard without runbook links: A dashboard that shows a problem without explaining what to do about it is only half useful. Add text widgets with runbook links.
Ignoring CloudWatch anomaly detection: Static thresholds don't work for metrics with daily patterns. Use anomaly detection for request count and latency.
CloudWatch Agent not installed on EC2: Without the agent, you only get basic metrics (CPU, network, disk I/O). Install the agent for memory utilization, disk space, and custom metrics.

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。