训练作业的状态运行失败,查看训练作业的日志,存在NCCL的报错,例如NCCL timeout、RuntimeError: NCCL communicator was aborted on rank 7、NCCL WARN Bootstrap : no socket interface found或NCCL INFO Call to con
检查环境变量:确保环境变量如NCCL_SOCKET_IFNAME、NCCL_IB_TIMEOUT等已正确设置,特别是在多机多卡训练中。 操作系统支持:注意NCCL在Windows系统上可能不支持,通常需要使用Gloo作为后端。 3. 根据报错信息,查找可能的解决方案 对于“Distributed package doesn't have NCCL built in”的错误: 如果你在Windows系统上运...
1、遇到 This may indicate a possible application crash on rank 0 or a network set up issue.[4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout ...
节点会使用init_method提供的标签去作为节点标识。 VLLM默认使用的是ip+端口号,默认端口号为0.--这里有误,vllm用的是socket.bind(0),实际返回是随机可用端口号。 2.2.2 nccl相关资源创建 nccl会持有如下资源: UniqueId,每个通讯组的rank0会创建一个UniqueId,标识这个通讯组。 通常每个device都会绑定一个ncclComm...
self -mca btl_tcp_if_include enp218s0 -mca plm_rsh_args "-p 38888" --host 192.168.0.37,192.168.0.130 -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_TC=128 -xNCCL_ALGO=Tree-x NCCL_IB_HCA=mlx5 -x NCCL_IB_TIMEOUT=18 -x NCCL_SOCKET_IFNAME=enp218s0 -x LD_LIBRARY_...
NCCL_DEBUG=INFO # 出现NCCL timeout 可以适当调大 NCCL_IB_TIMEOUT=18 NCCL_IB_RETRY_CNT=16 # 请不要修改,ModelArts会提前预置好 NCCL_IB_HCA=^mlx5_bond_0 NCCL_SOCKET_IFNAME="=bond0,eth0,enp218s0,enp219s0,enp220s0,enp221s0" # 请不要修改,ModelArts会提前预置好 ...
The training job fails to be executed. The training job logs contain NCCL-related errors, such as "NCCL timeout", "RuntimeError: NCCL communicator was aborted on rank 7",
NCCL_SOCKET_NTHREADS¶ (since 2.4.8) TheNCCL_SOCKET_NTHREADSvariable specifies the number of CPU helper threads used per network connection for socket transport. Increasing this value may increase the socket transport performance, at the cost of higher CPU usage. ...
22.3 NCCL_IB_TIMEOUT=11 \ NCCL_DEBUG=INFO \ NCCL_DEBUG_FILE=/data1/nccl_debug_%h.%p \ NCCL_IB_CUDA_SUPPORT=1 \ NCCL_IBEXT_DISABLE=1 \ NCCL_DEBUG_SUBSYS=ALL \ NCCL_IB_DISABLE=0 \ NCCL_NVLS_ENABLE=0 \ NCCL_IB_RETRY_CNT=7 \ GLOO_SOCKET_IFNAME=eth1x \ NCCL_SOCKET_IFNAME=et...
rkeys[0] == 0) { char line[SOCKET_NAME_MAXLEN + 1]; union ncclSocketAddress addr; ncclSocketGetAddr(&comm->base.sock, &addr); WARN("NET/IB : req %d/%d tag %x peer %s posted incorrect receive info: size %ld addr %lx rkeys[0]=%x", r, nreqs, tag, ncclSocketToString(&addr...