3.2 reduce-scatter overlap p2p 3.3 reduce-scatter overlap pipeline chunk 四、 tp_comm_bulk_ag 和 tp_comm_bulk_rs 五、小结 六、参考 这篇文章想来探索Megatron中实现计算通信overlap的方法。具体来说,Megatron的dp、tp和pp部分,都有可以做overlap的地方,本文探索的是tp部分(更准确地说是megatron sp-tp)。
该算法要求 GPU 之间支持 P2P 通信,现代 NVIDIA GPU 无论是 NVLink 互联还是 PCIe 互联,在节点内都已具备此能力,而 NVSHMEM(NVSHMEM | NVIDIA Developer [7])进一步扩展了 NVIDSIA GPU 在节点间的 P2P 通信。 如下图 Algorithm 1 所示为具体的算法: 7.2.3 AllGather Overlap 与ReduceScatter 不同,AllGather...
The PP communication overlap is enabled when settingoverlap_p2p_comm=true. Also, settingbatch_p2p_comm=falseuses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization. NeMo supports PP communication overlap only with virtual...
Similar to TP communication overlap, PP communication overlap configurations are added via the callbackMegatronCommOverlapCallback. The PP communication overlap is enabled when settingoverlap_p2p_comm=True. Also, settingbatch_p2p_comm=Falseuses separate kern...
'schedule does not support overlapping p2p communication') if args.overlap_param_gather: assert args.use_distributed_optimizer, \ '--overlap-param-gather only supported with distributed optimizer' # Parameters dtype. args.params_dtype = torch.float if args.fp16: @@ -1093,8 +1097,12 @@ def...
NeMo:在P2P通信中,将K和V的数据放在一个buffer中,因此每次只会触发两个不同stream上的通信:send和receive。 Ring-Flash-Attention:在适配过程中,分别对K和V执行通信,因此每次会触发四个通信:2组send和2组receive。 2)既然通信提交方式不是导致性能差异的根本原因,我们转而考虑另一种可能性:两个框架在Torch CUDA...
The PP communication overlap is enabled when settingoverlap_p2p_comm=true. Also, settingbatch_p2p_comm=falseuses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization. NeMo supports PP communication overlap only with virtual...