スキルOfficialdevelopment

🔍hyperpod-nccl

プラグイン: sagemaker-ai
ソース: GitHub で見る ↗

説明

HyperPod GPU クラスター（EKS または Slurm）上の NCCL 障害および関連するトレーニング Pod 障害を診断します。対象となる障害の例: - トレーニングのハング - AllReduce / コレクティブ操作のタイムアウト - EFA または libfabric のエラー - ランデブー（rendezvous）失敗 - EFA TCP フォールバック - `/dev/shm` またはメモリロック（memlock）の問題 - Pod 間での NCCL バージョン不一致 - コンテナの OOM / exit-137 / OOMKilled - GPU の OOM（CUDA out of memory） - CrashLoopBackOff / Pending 状態の Pod - `MASTER_ADDR` の DNS 解決失敗 - NetworkPolicy によるブロック次のような場合には使用しない: 単一ノードのハードウェア障害（→ `hyperpod-node-debugger § G` を参照）、またはクラスター作成時の EFA / SSM 障害（→ `hyperpod-cluster-debugger § A / § F` を参照）。

原文を表示

Diagnose NCCL failures and adjacent training-pod failures on HyperPod GPU clusters (EKS or Slurm) — training hangs, AllReduce / collective-op timeouts, EFA or libfabric errors, rendezvous failures, EFA TCP fallback, /dev/shm or memlock issues, NCCL version mismatch across pods, container OOM / exit-137 / OOMKilled, GPU OOM (CUDA out of memory), CrashLoopBackOff / Pending pods, MASTER_ADDR DNS, NetworkPolicy blocking. Not for single-node hardware faults (→ hyperpod-node-debugger § G) or cluster-creation EFA / SSM failures (→ hyperpod-cluster-debugger § A / § F).

ユースケース

✓トレーニングがハングするとき
✓AllReduce操作がタイムアウトするとき
✓Pod間のNCCLバージョン不一致を調査するとき
✓コンテナがOOM/CrashLoopBackOff状態になるとき
✓GPU間の通信エラーを診断するとき

本文

HyperPod NCCL Debugger

Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Never discard training state on speculation.

Diagnose NCCL failures on SageMaker HyperPod (EKS and Slurm). scripts/nccl-diagnose.sh reads state via AWS APIs, kubectl, and SSM, then prints each issue as [FAIL] ... → references/<file>.md § <section>. Read-only.

Signal sourcing: list-cluster-events carries infrastructure-level state only (lifecycle, bootstrap, EFA health check, capacity, replacement, reboot, AMI rollback). It does not carry NCCL timeouts, GPU XID/ECC, or per-pod training signals — those come from pod logs, CloudWatch training streams, on-node SSM probes, and NCCL env audit. "No events" on a training-time NCCL issue is expected, not a clean bill of health.

Workflow

Collect cluster name, region, namespace/job (EKS), exact NCCL error string.
Run the diagnostic (always — the output drives everything else).
For every [FAIL] line, Read the referenced section.
Present finding, root cause, and the Suggested-command block with concrete values (instance IDs, SG IDs, namespaces) filled in from the script output. Wait for customer approval.
Re-run the diagnostic to confirm.

If a finding has no matching section, report it as a bug — do not invent a fix.

Step 1: Authenticate kubectl (EKS)

EKS_ARN=$(aws sagemaker describe-cluster --cluster-name <HYPERPOD-NAME> --region <REGION> \
  --query 'Orchestrator.Eks.ClusterArn' --output text)
EKS_NAME=$(echo "$EKS_ARN" | awk -F'/' '{print $NF}')
aws eks update-kubeconfig --name "$EKS_NAME" --region <REGION>
kubectl get nodes

Step 2: Run the diagnostic

# Basic:
bash scripts/nccl-diagnose.sh --cluster <HYPERPOD-NAME> --region <REGION>

# Scope to an EKS job/namespace:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --namespace <NS> --job <JOB>

# Force orchestrator:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --orchestrator slurm

# Larger hardware sample (default 3):
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --sample-nodes 10

# Specific node only:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --node i-0abc123def456

Tags: [PASS] · [FAIL] (counted in Issues Found, has reference pointer) · [WARN] · [INFO]. Priorities: P0 blocks training · P1 degraded · P2 informational.

Remediation index

Each [FAIL] line in the script already points directly at the right section. This table is a lookup for manual triage.

Finding	Section
SG missing inbound/outbound self-reference	operations.md § 8
Blocking NetworkPolicy / allow-all missing	operations.md § 8
Slurm node DOWN / DRAINING / RemoveIPC	operations.md § 7
GPU XID / SYSTEM_ERROR / hardware fault	hyperpod-node-debugger § F / § G
GPU row-remap / DCGM Fail / silent NaNs	hyperpod-node-debugger § G.1.a/b
NCCL timeout / rendezvous / straggler	debugging-guide.md § 1
EFA configuration / not used	debugging-guide.md § 6
EFA TCP fallback (`NET/OFI Using TCP`)	debugging-guide.md § 13
NCCL version mismatch across pods	debugging-guide.md § 10
Container OOM (pod killed, exit 137)	debugging-guide.md § 4
GPU OOM (`CUDA out of memory`)	debugging-guide.md § 11
RDMA memlock / `/dev/shm` too small	debugging-guide.md § 17
MASTER_ADDR DNS / headless Service	debugging-guide.md § 12
NVLS / PXN / topology tuning	debugging-guide.md § 19
Any NCCL / EFA / rendezvous log pattern	error-patterns-quick-ref.md
Performance / nccl-tests / bandwidth	performance-testing.md

Prerequisites

aws CLI v2.13+ authenticated (aws sts get-caller-identity)
jq, python3, bash 4.2+
unbuffer (from the expect package: yum install expect / apt install expect)
kubectl authenticated to the EKS cluster (K8s checks skipped if absent)
session-manager-plugin for on-node hardware checks

Defaults

Region — required: pass --region or set $AWS_DEFAULT_REGION.
Orchestrator — auto-detected; override with --orchestrator eks|slurm.
Namespace / job (EKS) — all namespaces; scope with --namespace <NS> --job <JOB>.
Hardware sampling — 3 nodes over SSM (capped at 50). --node <ID> for a specific node. Node probes run serially (180 s per node): --sample-nodes 10 can take ~30 min.
CloudWatch window — last 2 hours.
Colors — auto-disabled on non-TTY or TERM=dumb.

Error handling

Failure	Script	Tell the customer
`aws sts get-caller-identity` fails	Exit 1 with the AWS error	"Fix AWS credentials and rerun."
`describe-cluster` AccessDenied	Warn, add `Missing IAM for sagemaker:DescribeCluster`	"Grant `sagemaker:DescribeCluster` (operations.md § 2)."
Cluster not found	Exit 1 after listing region's clusters	"Confirm HyperPod cluster name and region."
`kubectl` absent / unauthenticated	Warn, skip K8s checks	"`aws eks update-kubeconfig --name <EKS> --region <R>`."
SSM plugin absent	Warn, skip on-node hardware checks	"Install session-manager-plugin."
SSM times out (180s)	Partial output, mark node unreachable	"Rerun with `--node <ID> --sample-nodes 1`; check SSM agent on the node."
CloudWatch log group not found	Skip CloudWatch scan	"Enable CloudWatch on the cluster (operations.md § 4)."
Cluster events API throttled	Warn, continue with partial data	"Rerun later — script is idempotent."

Exit codes: 0 diagnostic complete · 1 fatal prerequisite missing or cluster unreachable.

IAM permissions

Full policy + RBAC in operations.md § 2. SSM on HyperPod uses start-session against sagemaker-cluster:<cluster-id>_<group>-<iid> targets — grant ssm:StartSession / ssm:TerminateSession, not ssm:SendCommand.

Scale strategy

Scope	Method	Coverage
All nodes	`sagemaker:ListClusterNodes` (paginated)	100% nodes
All K8s objects	`kubectl`	100% pods/nodes/policies
Hardware	SSM `--sample-nodes N` (default 3)	Sampled
Node logs	CloudWatch	100% nodes

Large clusters: the PyTorch NCCL backend defaults to a 10-minute collective-op timeout (per the PyTorch distributed docs). Large clusters routinely exceed that on first rendezvous; raise it via torch.distributed.init_process_group(timeout=timedelta(seconds=<N>)). HyperPod support has also observed NCCL topology-graph-search hangs on 256+ node clusters when memlock is unlimited; using a large fixed memlock (e.g. 8388608) in pod securityContext or /etc/security/limits.conf has cleared these in field cases. This memlock pattern is a field observation, not AWS- or NCCL-documented behavior.

For FSDP, DeepSpeed, or Megatron-LM tuning: debugging-guide.md § 18.

Skill delegation

Need	Use
Cluster creation / deployment failures	`hyperpod-cluster-debugger` (§ A / B / C / H + `--validate`)
Post-deployment cluster-wide management	`hyperpod-cluster-debugger`
Per-node issues (disk, lifecycle, hardware)	`hyperpod-node-debugger`
Trainium/Inferentia collective-comm (AWS Neuron Collectives, not NCCL)	`hyperpod-node-debugger` § G.2
Shell on nodes	`hyperpod-ssm`
Version comparison across nodes	`hyperpod-version-checker`
Diagnostic bundle for AWS Support	`hyperpod-issue-report`
MFU / performance degradation	`hyperpod-mfu-debugger`

Escalate to AWS Support

Escalate when:

All SG rules correct, EFA verified on-node, but NCCL still times out.
Hardware checks pass on all nodes but AllReduce still hangs.
Issues Found: 0 but training still fails.
GPU XID errors persist after node replacement.
Collective-op timeout raised and memlock workaround applied but large-cluster rendezvous still hangs.

Before opening the case

# 1. Cluster identity + status
aws sagemaker describe-cluster --cluster-name <C> --region <R>

# 2. Full NCCL diagnostic (sample more nodes for escalation)
bash scripts/nccl-diagnose.sh --cluster <C> --region <R> --sample-nodes 10 > nccl-diag.txt

# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report)
#    See skills/hyperpod-issue-report/SKILL.md for the exact invocation.

Include in the case

Cluster name + ARN and AWS region
Orchestrator (EKS or Slurm) and EKS cluster name / Slurm controller node
Timestamp window (UTC start / end) of the failure
Exact NCCL / libfabric error strings (copy verbatim from pod logs or journalctl)
Affected instance IDs / node names / pod names / namespace / job name
nccl-diag.txt from step 2 above
S3 URI of the hyperpod-issue-report bundle from step 3
NCCL env vars in effect (printenv | grep -E '^NCCL|^FI_|^TORCH_' from one pod)

References

error-patterns-quick-ref.md — log pattern → code → fix table
debugging-guide.md — per-scenario procedures (21 sections incl. NVLS/PXN/topology)
performance-testing.md — nccl-tests, bandwidth thresholds, straggler detection
operations.md — IAM, SSM format, CloudWatch, env-var reference, node labels, Slurm ops, remediations

原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。