🔍hyperpod-nccl
- プラグイン
- sagemaker-ai
- ソース
- GitHub で見る ↗
説明
HyperPod GPU クラスター(EKS または Slurm)上の NCCL 障害および関連するトレーニング Pod 障害を診断します。 対象となる障害の例: - トレーニングのハング - AllReduce / コレクティブ操作のタイムアウト - EFA または libfabric のエラー - ランデブー(rendezvous)失敗 - EFA TCP フォールバック - `/dev/shm` またはメモリロック(memlock)の問題 - Pod 間での NCCL バージョン不一致 - コンテナの OOM / exit-137 / OOMKilled - GPU の OOM(CUDA out of memory) - CrashLoopBackOff / Pending 状態の Pod - `MASTER_ADDR` の DNS 解決失敗 - NetworkPolicy によるブロック 次のような場合には使用しない: 単一ノードのハードウェア障害(→ `hyperpod-node-debugger § G` を参照)、またはクラスター作成時の EFA / SSM 障害(→ `hyperpod-cluster-debugger § A / § F` を参照)。
原文を表示
Diagnose NCCL failures and adjacent training-pod failures on HyperPod GPU clusters (EKS or Slurm) — training hangs, AllReduce / collective-op timeouts, EFA or libfabric errors, rendezvous failures, EFA TCP fallback, /dev/shm or memlock issues, NCCL version mismatch across pods, container OOM / exit-137 / OOMKilled, GPU OOM (CUDA out of memory), CrashLoopBackOff / Pending pods, MASTER_ADDR DNS, NetworkPolicy blocking. Not for single-node hardware faults (→ hyperpod-node-debugger § G) or cluster-creation EFA / SSM failures (→ hyperpod-cluster-debugger § A / § F).
ユースケース
- ✓トレーニングがハングするとき
- ✓AllReduce操作がタイムアウトするとき
- ✓Pod間のNCCLバージョン不一致を調査するとき
- ✓コンテナがOOM/CrashLoopBackOff状態になるとき
- ✓GPU間の通信エラーを診断するとき
本文
HyperPod NCCL Debugger
Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Never discard training state on speculation.
Diagnose NCCL failures on SageMaker HyperPod (EKS and Slurm). scripts/nccl-diagnose.sh reads state via AWS APIs, kubectl, and SSM, then prints each issue as [FAIL] ... → references/<file>.md § <section>. Read-only.
Signal sourcing: list-cluster-events carries infrastructure-level state only (lifecycle, bootstrap, EFA health check, capacity, replacement, reboot, AMI rollback). It does not carry NCCL timeouts, GPU XID/ECC, or per-pod training signals — those come from pod logs, CloudWatch training streams, on-node SSM probes, and NCCL env audit. "No events" on a training-time NCCL issue is expected, not a clean bill of health.
Workflow
- Collect cluster name, region, namespace/job (EKS), exact NCCL error string.
- Run the diagnostic (always — the output drives everything else).
- For every
[FAIL]line,Readthe referenced section. - Present finding, root cause, and the Suggested-command block with concrete values (instance IDs, SG IDs, namespaces) filled in from the script output. Wait for customer approval.
- Re-run the diagnostic to confirm.
If a finding has no matching section, report it as a bug — do not invent a fix.
Step 1: Authenticate kubectl (EKS)
EKS_ARN=$(aws sagemaker describe-cluster --cluster-name <HYPERPOD-NAME> --region <REGION> \
--query 'Orchestrator.Eks.ClusterArn' --output text)
EKS_NAME=$(echo "$EKS_ARN" | awk -F'/' '{print $NF}')
aws eks update-kubeconfig --name "$EKS_NAME" --region <REGION>
kubectl get nodes
Step 2: Run the diagnostic
# Basic:
bash scripts/nccl-diagnose.sh --cluster <HYPERPOD-NAME> --region <REGION>
# Scope to an EKS job/namespace:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --namespace <NS> --job <JOB>
# Force orchestrator:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --orchestrator slurm
# Larger hardware sample (default 3):
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --sample-nodes 10
# Specific node only:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --node i-0abc123def456
Tags: [PASS] · [FAIL] (counted in Issues Found, has reference pointer) · [WARN] · [INFO]. Priorities: P0 blocks training · P1 degraded · P2 informational.
Remediation index
Each [FAIL] line in the script already points directly at the right section. This table is a lookup for manual triage.
| Finding | Section |
|---|---|
| SG missing inbound/outbound self-reference | operations.md § 8 |
| Blocking NetworkPolicy / allow-all missing | operations.md § 8 |
| Slurm node DOWN / DRAINING / RemoveIPC | operations.md § 7 |
| GPU XID / SYSTEM_ERROR / hardware fault | hyperpod-node-debugger § F / § G |
| GPU row-remap / DCGM Fail / silent NaNs | hyperpod-node-debugger § G.1.a/b |
| NCCL timeout / rendezvous / straggler | debugging-guide.md § 1 |
| EFA configuration / not used | debugging-guide.md § 6 |
EFA TCP fallback (NET/OFI Using TCP) |
debugging-guide.md § 13 |
| NCCL version mismatch across pods | debugging-guide.md § 10 |
| Container OOM (pod killed, exit 137) | debugging-guide.md § 4 |
GPU OOM (CUDA out of memory) |
debugging-guide.md § 11 |
RDMA memlock / /dev/shm too small |
debugging-guide.md § 17 |
| MASTER_ADDR DNS / headless Service | debugging-guide.md § 12 |
| NVLS / PXN / topology tuning | debugging-guide.md § 19 |
| Any NCCL / EFA / rendezvous log pattern | error-patterns-quick-ref.md |
| Performance / nccl-tests / bandwidth | performance-testing.md |
Prerequisites
awsCLI v2.13+ authenticated (aws sts get-caller-identity)jq,python3,bash4.2+unbuffer(from theexpectpackage:yum install expect/apt install expect)kubectlauthenticated to the EKS cluster (K8s checks skipped if absent)session-manager-pluginfor on-node hardware checks
Defaults
- Region — required: pass
--regionor set$AWS_DEFAULT_REGION. - Orchestrator — auto-detected; override with
--orchestrator eks|slurm. - Namespace / job (EKS) — all namespaces; scope with
--namespace <NS> --job <JOB>. - Hardware sampling — 3 nodes over SSM (capped at 50).
--node <ID>for a specific node. Node probes run serially (180 s per node):--sample-nodes 10can take ~30 min. - CloudWatch window — last 2 hours.
- Colors — auto-disabled on non-TTY or
TERM=dumb.
Error handling
| Failure | Script | Tell the customer |
|---|---|---|
aws sts get-caller-identity fails |
Exit 1 with the AWS error | "Fix AWS credentials and rerun." |
describe-cluster AccessDenied |
Warn, add Missing IAM for sagemaker:DescribeCluster |
"Grant sagemaker:DescribeCluster (operations.md § 2)." |
| Cluster not found | Exit 1 after listing region's clusters | "Confirm HyperPod cluster name and region." |
kubectl absent / unauthenticated |
Warn, skip K8s checks | "aws eks update-kubeconfig --name <EKS> --region <R>." |
| SSM plugin absent | Warn, skip on-node hardware checks | "Install session-manager-plugin." |
| SSM times out (180s) | Partial output, mark node unreachable | "Rerun with --node <ID> --sample-nodes 1; check SSM agent on the node." |
| CloudWatch log group not found | Skip CloudWatch scan | "Enable CloudWatch on the cluster (operations.md § 4)." |
| Cluster events API throttled | Warn, continue with partial data | "Rerun later — script is idempotent." |
Exit codes: 0 diagnostic complete · 1 fatal prerequisite missing or cluster unreachable.
IAM permissions
Full policy + RBAC in operations.md § 2. SSM on HyperPod uses start-session against sagemaker-cluster:<cluster-id>_<group>-<iid> targets — grant ssm:StartSession / ssm:TerminateSession, not ssm:SendCommand.
Scale strategy
| Scope | Method | Coverage |
|---|---|---|
| All nodes | sagemaker:ListClusterNodes (paginated) |
100% nodes |
| All K8s objects | kubectl |
100% pods/nodes/policies |
| Hardware | SSM --sample-nodes N (default 3) |
Sampled |
| Node logs | CloudWatch | 100% nodes |
Large clusters: the PyTorch NCCL backend defaults to a 10-minute collective-op timeout (per the PyTorch distributed docs). Large clusters routinely exceed that on first rendezvous; raise it via torch.distributed.init_process_group(timeout=timedelta(seconds=<N>)). HyperPod support has also observed NCCL topology-graph-search hangs on 256+ node clusters when memlock is unlimited; using a large fixed memlock (e.g. 8388608) in pod securityContext or /etc/security/limits.conf has cleared these in field cases. This memlock pattern is a field observation, not AWS- or NCCL-documented behavior.
For FSDP, DeepSpeed, or Megatron-LM tuning: debugging-guide.md § 18.
Skill delegation
| Need | Use |
|---|---|
| Cluster creation / deployment failures | hyperpod-cluster-debugger (§ A / B / C / H + --validate) |
| Post-deployment cluster-wide management | hyperpod-cluster-debugger |
| Per-node issues (disk, lifecycle, hardware) | hyperpod-node-debugger |
| Trainium/Inferentia collective-comm (AWS Neuron Collectives, not NCCL) | hyperpod-node-debugger § G.2 |
| Shell on nodes | hyperpod-ssm |
| Version comparison across nodes | hyperpod-version-checker |
| Diagnostic bundle for AWS Support | hyperpod-issue-report |
| MFU / performance degradation | hyperpod-mfu-debugger |
Escalate to AWS Support
Escalate when:
- All SG rules correct, EFA verified on-node, but NCCL still times out.
- Hardware checks pass on all nodes but AllReduce still hangs.
Issues Found: 0but training still fails.- GPU XID errors persist after node replacement.
- Collective-op timeout raised and memlock workaround applied but large-cluster rendezvous still hangs.
Before opening the case
# 1. Cluster identity + status
aws sagemaker describe-cluster --cluster-name <C> --region <R>
# 2. Full NCCL diagnostic (sample more nodes for escalation)
bash scripts/nccl-diagnose.sh --cluster <C> --region <R> --sample-nodes 10 > nccl-diag.txt
# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report)
# See skills/hyperpod-issue-report/SKILL.md for the exact invocation.
Include in the case
- Cluster name + ARN and AWS region
- Orchestrator (EKS or Slurm) and EKS cluster name / Slurm controller node
- Timestamp window (UTC start / end) of the failure
- Exact NCCL / libfabric error strings (copy verbatim from pod logs or journalctl)
- Affected instance IDs / node names / pod names / namespace / job name
nccl-diag.txtfrom step 2 above- S3 URI of the
hyperpod-issue-reportbundle from step 3 - NCCL env vars in effect (
printenv | grep -E '^NCCL|^FI_|^TORCH_'from one pod)
References
- error-patterns-quick-ref.md — log pattern → code → fix table
- debugging-guide.md — per-scenario procedures (21 sections incl. NVLS/PXN/topology)
- performance-testing.md — nccl-tests, bandwidth thresholds, straggler detection
- operations.md — IAM, SSM format, CloudWatch, env-var reference, node labels, Slurm ops, remediations
原文・著作権は Anthropic および各プラグイン作者に帰属します。日本語訳は Claude API による自動翻訳です。