Encounter NCCL error when runing Pytorch example code #1504

Noblezhong · 2024-11-03T07:56:02Z

Hi! when I try to run a python scripts for llm inference in pipeline parallelism on single server with multi GPUs. It turned out to errors related to NCCL.

Here is my develop env

(pippy) root@678c7278cb2d:/zt/code/my_dev# pip list
Package                  Version
------------------------ -----------
accelerate               1.0.1
aiohappyeyeballs         2.4.0
aiohttp                  3.10.5
aiosignal                1.2.0
async-timeout            4.0.3
attrs                    24.2.0
autocommand              2.2.2
backports.tarfile        1.2.0
Bottleneck               1.3.7
Brotli                   1.0.9
certifi                  2024.8.30
charset-normalizer       3.3.2
datasets                 2.19.1
dill                     0.3.8
filelock                 3.13.1
frozenlist               1.4.0
fsspec                   2024.10.0
huggingface_hub          0.24.6
idna                     3.7
importlib_metadata       8.0.0
importlib_resources      6.4.0
inflect                  7.3.1
jaraco.context           5.3.0
jaraco.functools         4.0.1
jaraco.text              3.12.1
Jinja2                   3.1.4
MarkupSafe               3.0.2
mkl_fft                  1.3.10
mkl_random               1.2.7
mkl-service              2.4.0
more-itertools           10.3.0
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.15
networkx                 3.4.2
numexpr                  2.8.7
numpy                    1.26.4
nvidia-cublas-cu12       12.4.5.8
nvidia-cuda-cupti-cu12   12.4.127
nvidia-cuda-nvrtc-cu12   12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.2.1.3
nvidia-curand-cu12       10.3.5.147
nvidia-cusolver-cu12     11.6.1.9
nvidia-cusparse-cu12     12.3.1.170
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.4.127
ordered-set              4.1.0
packaging                24.1
pandas                   2.2.2
pip                      24.2
platformdirs             4.2.2
psutil                   6.1.0
pyarrow                  16.1.0
PySocks                  1.7.1
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.2
regex                    2024.9.11
requests                 2.32.3
safetensors              0.4.5
setuptools               72.1.0
six                      1.16.0
sympy                    1.13.1
tokenizers               0.15.2
tomli                    2.0.1
torch                    2.5.0
torchpippy               0.2.0
tqdm                     4.66.5
transformers             4.36.2
triton                   3.1.0
typeguard                4.3.0
typing_extensions        4.11.0
tzdata                   2023.3
urllib3                  2.2.3
wheel                    0.44.0
xxhash                   2.0.2
yarl                     1.11.0
zipp                     3.19.2

Here is my command to running this python code

torchrun --nproc-per-node 4 pippy_llama.py

Here is the bug

W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] 
W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] *****************************************
W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1103 07:04:16.199000 21556 site-packages/torch/distributed/run.py:793] *****************************************
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.35s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.39s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.39s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:19<00:00,  9.69s/it]
layers_per_rank = 8
layers_per_rank = 8
layers_per_rank = 8
layers_per_rank = 8
[rank0]:[W1103 07:04:58.102756460 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank3]:[E1103 07:34:57.813334593 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
[rank3]:[E1103 07:34:57.815173839 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1103 07:34:57.823033599 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800014 milliseconds before timing out.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank3]:     output = schedule.step(args)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank3]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 702, in _step_microbatches
[rank3]:     works = _sorted_batch_p2p(ops, desc="fwd_recv")
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 549, in _sorted_batch_p2p
[rank3]:     work_by_peer[peer] = _batch_p2p(ops, desc=desc)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 524, in _batch_p2p
[rank3]:     return dist.batch_isend_irecv(p2p_ops).pop()
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2370, in batch_isend_irecv
[rank3]:     with _coalescing_manager(group, device, async_ops=True) as cm:
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/contextlib.py", line 142, in __exit__
[rank3]:     next(self.gen)
[rank3]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2317, in _coalescing_manager
[rank3]:     work = group._end_coalescing(device)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4409, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank3]: ncclInternalError: Internal check failed.
[rank3]: Last error:

[rank1]:[E1103 07:34:57.823718519 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1103 07:34:57.825270404 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800015 milliseconds before timing out.
[rank2]:[E1103 07:34:57.826580065 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank2]:     output = schedule.step(args)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank2]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 702, in _step_microbatches
[rank2]:     works = _sorted_batch_p2p(ops, desc="fwd_recv")
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 549, in _sorted_batch_p2p
[rank2]:     work_by_peer[peer] = _batch_p2p(ops, desc=desc)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 524, in _batch_p2p
[rank2]:     return dist.batch_isend_irecv(p2p_ops).pop()
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2370, in batch_isend_irecv
[rank2]:     with _coalescing_manager(group, device, async_ops=True) as cm:
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/contextlib.py", line 142, in __exit__
[rank2]:     next(self.gen)
[rank2]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2317, in _coalescing_manager
[rank2]:     work = group._end_coalescing(device)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4409, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank2]: ncclInternalError: Internal check failed.
[rank2]: Last error:

[rank2]:[E1103 07:34:57.096569447 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1103 07:34:57.096590567 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1103 07:34:57.096597157 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1103 07:34:57.097694314 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800015 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e974fc45446 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7e970542a762 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e9705431ba3 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e970543360d in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7e974fdac5c0 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7e9750669ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7e97506fabf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [67,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [68,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [69,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [70,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [71,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [72,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [73,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [74,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [75,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [76,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [77,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [78,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [79,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [80,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [81,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [82,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [83,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [84,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [85,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [86,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [87,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [88,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [89,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [90,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [91,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [92,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [93,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [94,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [95,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [6,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [7,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [8,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [9,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [10,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [11,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [12,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [13,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [14,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [15,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [16,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [17,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [18,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [19,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [20,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [21,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [22,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [23,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [24,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [25,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [26,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [27,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [28,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [29,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [30,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [31,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [32,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [33,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [34,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [35,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [36,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [37,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [38,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [39,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [40,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [41,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [42,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [43,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [44,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [45,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [46,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [47,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [48,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [49,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [50,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [51,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [52,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [53,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [54,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [55,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [56,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [57,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [58,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [59,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [60,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [61,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [63,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 580, in forward_one_chunk
[rank1]:     output = self.forward_maybe_with_nosync(*composite_args, **composite_kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 468, in forward_maybe_with_nosync
[rank1]:     out_val = self.submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 784, in call_wrapped
[rank1]:     return self._wrapped_call(self, *args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 361, in __call__
[rank1]:     raise e
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/graph_module.py", line 348, in __call__
[rank1]:     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "<eval_with_key>.263", line 5, in forward
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 320, in call_module
[rank1]:     return submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 320, in call_module
[rank1]:     return submod(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/export/unflatten.py", line 136, in forward
[rank1]:     return torch.fx.Interpreter(self, graph=self.graph).run(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 146, in run
[rank1]:     self.env[node] = self.run_node(node)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 203, in run_node
[rank1]:     return getattr(self, n.op)(n.target, args, kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/fx/interpreter.py", line 275, in call_function
[rank1]:     return target(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/_ops.py", line 716, in __call__
[rank1]:     return self._op(*args, **kwargs)
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank1]: While executing %neg_default : [num_users=1] = call_function[target=torch.ops.aten.neg.default](args = (%slice_tensor_3,), kwargs = {})
[rank1]: Original traceback:
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
[rank1]:     outputs = self.model(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
[rank1]:     layer_outputs = decoder_layer(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py", line 1160, in _split_before_forward
[rank1]:     return self._orig_forward(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
[rank1]:     hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 704, in forward
[rank1]:     query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
[rank1]:     q_embed = (q * cos) + (rotate_half(q) * sin)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 208, in rotate_half
[rank1]:     return torch.cat((-x2, x1), dim=-1)


[rank1]: While executing %self_attn : [num_users=3] = call_module[target=self_attn](args = (%unsqueeze, %input_layernorm), kwargs = {})
[rank1]: Original traceback:
[rank1]: None

[rank1]: While executing %layers_8 : [num_users=3] = call_module[target=layers.8](args = (%unsqueeze, %add_47), kwargs = {})
[rank1]: Original traceback:
[rank1]: None

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/zt/code/my_dev/pippy_llama.py", line 59, in <module>
[rank1]:     output = schedule.step(args)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 615, in step
[rank1]:     self._step_microbatches(args_split, kwargs_split, targets_split, losses)
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py", line 706, in _step_microbatches
[rank1]:     output = self._stage.forward_one_chunk(i, arg_mbs[i], kwarg_mbs[i])  # type: ignore[index]
[rank1]:   File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/pipelining/stage.py", line 588, in forward_one_chunk
[rank1]:     raise RuntimeError(exc_msg) from e
[rank1]: RuntimeError: 
[rank1]:             [Stage 1] failed to run forward:
[rank1]:             args: ('Tensor(torch.Size([2, 4, 4096]), grad=False, dtype=torch.float32)', 'Tensor(torch.Size([1, 4]), grad=False, dtype=torch.int64)')
[rank1]:             kwargs: {}
[rank1]:             
[rank1]:[E1103 07:34:57.460801783 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1103 07:34:57.460843251 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1103 07:34:57.460851231 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1103 07:34:57.462234381 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=1800000) ran for 1800014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79993284a446 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7998e802a762 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7998e8031ba3 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7998e803360d in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7999329b15c0 in /root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x79993326eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7999332ffbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1103 07:34:58.055000 21556 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 21559 closing signal SIGTERM
W1103 07:34:58.056000 21556 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 21561 closing signal SIGTERM
E1103 07:34:58.688000 21556 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 2 (pid: 21560) of binary: /root/miniconda3/envs/pippy/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/pippy/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0', 'console_scripts', 'torchrun')())
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/pippy/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
pippy_llama.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-03_07:34:58
  host      : 678c7278cb2d
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 21560)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 21560
======================================================

I have no exeperenice for cuda programming and implemention of NCCL. It's hard for me to fix this bug. So anyone can answer my question? Thanks!!!!

The text was updated successfully, but these errors were encountered:

kiskra-nvidia · 2024-11-04T16:26:20Z

It looks like Pytorch times out for some reason. Rerunning the job with NCCL_DEBUG=INFO might provide more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encounter NCCL error when runing Pytorch example code #1504

Encounter NCCL error when runing Pytorch example code #1504

Noblezhong commented Nov 3, 2024

kiskra-nvidia commented Nov 4, 2024

Encounter NCCL error when runing Pytorch example code #1504

Encounter NCCL error when runing Pytorch example code #1504

Comments

Noblezhong commented Nov 3, 2024

kiskra-nvidia commented Nov 4, 2024