|| (ibDev->portAttr.flags & IBV_QPF_GRH_REQUIRED) @@ -1612,9 +1669,10 @@ ncclResult_t ncclIbTest(void* request, int* done, int* sizes) { }char line[SOCKET_NAME_MAXLEN+1]; WARN("NET/IB : Got completion from peer %s with status=%d opcode=%d len=%d vendor err %d (%s)%s...
(gid index shown from 'show_gids') node01: VM-0-14-centos:106:196 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change node01: node01: VM-0-14-centos:107:194 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change ...
不添加参数-mca btl_tcp_if_include eno2的话会报错如下:Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. eno2替换为自己的网卡接口名称,可通过ifconfig查看。 执行结果如下: 可以看到,同样的操作,同样...
ib_modify_qp_is_ok 也被更新以考虑链路层。 有些参数对于以太网链路层是必需的,而对于IB来说则无关。 修改供应商驱动程序以支持新的函数签名 rdma_lag_get_ah_roce_slave rdma_read_gid_attr_ndev_rcu rdma_get_xmit_slave_udp rdma_build_skb netdev_get_xmit_slave RDMA_LAG_FLAGS_HASH_ALL_SLAVES ...
ncclResult_tncclIbConnect(intdev,void*opaqueHandle,void**sendComm)if(stage->state==ncclIbCommStateSend)gotoib_send;NCCLCHECK(ncclIbInitVerbs(dev,ctx,&comm->verbs))...ncclIbCreateQprdma:ncclResult_tncclIbCreateQpqpInitAttr.qp_type=IBV_QPT_RCqpInitAttr.cap.max_send_wr=2*MAX_REQUESTS->12...
dw-2-2:35593:35649 [0] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 192.168.205.2<34566> with error 12, opcode 0, len 0, vendor err 129 (Recv) dw-2-2:35593:35649 [0] NCCL INFO transport/net.cc:1134 -> 6 ...
[0] transport/net_ib.cc:1192 NCCL WARN NET/IB : Got completion from peer 192.168.0.19<54698> with error 4, opcode 32533, len 32535, vendor err 81 vm1:58934:58946 [0] NCCL INFO include/net.h:32 -> 2 vm1:58934:58946 [0] NCCL INFO transport/net.cc:870 -> 2 vm1:58934:58946 ...