[c10d][nccl] job hanging with CUDA_LAUNCH_BLOCKING=1 and...
🐛 Describe the bug NCCL_SHM_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL torchrun --standalone --nproc_per_node=6 run_nccl_debug.py when tensor numel = 1064 with subgroup PG of 3 GPUs, it got stuck when numel =...