NCCL_SOCKET_IFNAME设置为eth0,指定NCCL应使用名为eth0的网络接口进行通信。 NCCL INFO NET/Socket : Using [0]eth0:10.233.90.231<0> NCCL INFO Using network Socket 由于没有找到IB设备,NCCL转而使用TCP/IP(Socket)网络,并通过接口eth0进行通信。 NCCL INFO Setting affinity for GPU 2 to 0fffff,ff00000...
nccl4:685547:685563[0]NCCL INFO NET/Socket : Using[0]eno2:10.112.205.39<0> nccl4:685547:685563[0]NCCL INFO Using network Socket nccl4:685547:685564[1]NCCL INFO Using network Socket nccl5:1728006:1728006[0]NCCL INFO cudaDriverVersion12020nccl5:1728006:1728006[0]NCCL INFO Bootstrap : Using...
<cluster_name>:921:1146 [2] NCCL INFO Using network Socket <cluster_name>:926:1151 [7] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0> <cluster_name>:926:1151 [7] NCCL INFO Using net...
hp-1:28603:28603 [0] NCCL INFO NET/Socket : Using [0]ens3:10.0.0.27<0> hp-1:28603:28603 [0] NCCL INFO Using network Socket NCCL version 2.10.3+cuda10.2 hp-1:28608:28608 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.27<0> hp-1:28608:28608 [0] NCCL INFO NET/Plugin : No pl...
"IB" : "RoCE"); } line[1023] = '\0'; char addrline[1024]; INFO(NCCL_INIT|NCCL_NET, "NET/IB : Using%s ; OOB %s:%s", line, ncclIbIfName, socketToString(&ncclIbIfAddr.sa, addrline)); } pthread_mutex_unlock(&ncclIbLock); } return ncclSuccess;}首先第三...
INFO(NCCL_INIT|NCCL_NET,"NET/IB : Using%s ; OOB %s:%s", line, ncclIbIfName, socketToString(&ncclIbIfAddr.sa, addrline)); } pthread_mutex_unlock(&ncclIbLock); }returnncclSuccess; } AI代码助手复制代码 首先第三行通过wrap_ibv_symbols加载动态库libibverbs.so,然后获取动态库的各个函数。
rkeys[0] == 0) { char line[SOCKET_NAME_MAXLEN + 1]; union ncclSocketAddress addr; ncclSocketGetAddr(&comm->base.sock, &addr); WARN("NET/IB : req %d/%d tag %x peer %s posted incorrect receive info: size %ld addr %lx rkeys[0]=%x", r, nreqs, tag, ncclSocketToString(&addr...
INFO(NCCL_INIT|NCCL_NET, "NET/IB : Using%s ; OOB %s:%s", line, ncclIbIfName, socketToString(&ncclIbIfAddr.sa, addrline)); } pthread_mutex_unlock(&ncclIbLock); } return ncclSuccess; } 首先第三行通过wrap_ibv_symbols加载动态库libibverbs.so,然后获取动态库的各个函数。
ncclNet_t结构体是一系列的函数指针,比如初始化,发送,接收等;socket,IB等通信方式都实现了自己的ncclNet_t,如ncclNetSocket,ncclNetIb,初始化通信网络的过程就是依次看哪个通信模式可用,然后赋值给全局的ncclNet。 首先执行initNetPlugin,查看是否有libnccl-net.so,测试环境没有这个so,所以直接返回。
‣ Retry in case of socket connection failure (unreachable host). ‣ Retry in case of IB QP connection failure. ‣ Improved support for external network plugins: allow plugins to force a flush, indicate when completion is not needed, allow for full offload of allgather operations when ...