NCCL hung witih NCCL_P2P_USE_CUDA_MEMCPY=1 by pytorch...
Hi NCCL version: v2.21.5 when I set NCCL_P2P_USE_CUDA_MEMCPY=1,train a resnet model using pytorch with two GPUs in same NUMA, NCCL will hung,pytorch timeout crash pytorch error: `[rank1]:[E ProcessGroupNCCL.cpp: