RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29544, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 601762 milliseconds before timing out. ... ... composer.core.engine: ...
[80756] [rank3]:[E321 02:07:12.442249455 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1580, OpType=ALLREDUCE, NumelIn=466119168, NumelOut=466119168, Timeout(ms)=6000...
[rank29]:[E222 08:31:44.059609287 ProcessGroupNCCL.cpp:616] [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [rank29]:[E222 08:31:44.060357685 Process...
2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing...
out. [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800419 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALL...
that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ...
currentTimepoint - work->workStartTime_) > work->opTimeout_) { std::exception_ptr exception_ptr =std::make_exception_ptr( std::runtime_error("NCCL Operation Timed Out")); work->setException(exception_ptr); for(constauto& ncclComm : work->ncclComms_) { ...