GradBuffer 类构造 Bucket 类构造 三、DP Overlap 调用入口 一、DP Overlap 原理 将反向计算的 model gradients 切分成 small buckets buckets 和 all_reduce / reduce_scatter 通信异步 Megatron-LM 实现的 DP overlap: use_distributed_optimizer=True 时,是 reduce_scatter 和每个 layer backprop 算子 overlap use...
(备注:tp_comm_overlap_rs_dgrad,也就是右侧bwd中fc1_dgrad和下一个黄框中的RS做overlap的本质也是如此,所以后文不会再单独介绍它了) 3.1 朴素reduce-scatter 假设我们有2张gpu(tp_size = 2) B0和B1即为fc2,也就是按行切割的模型权重 A0和A1理解成fc2的输入。这里A0 = [A00, A10],A1 = [A10, A1...
@@ -347,6 +363,22 @@ train.gpt3.345m_tp1_pp1_1node_50steps_overlap_grad_reduce: METADATA: overlap_grad_reduce ADDITIONAL_PARAMS: "--overlap-grad-reduce" train.gpt3.345m_tp1_pp1_1node_50steps_dist_optimizer_overlap_grad_reduce: <<: *selene-test-launcher variables: <<: [*VARS] RUN...
delay_grad_reduce: True align_grad_reduce: True overlap_param_gather: False delay_param_gather: False align_param_gather: False scatter_gather_tensors_in_pipeline: True local_rank: null lazy_mpu_init: null 193 changes: 135 additions & 58 deletions 193 megatron/core/optimizer/__init__.py ...
DP gradient reduce-scatter and parameter all-gather overlaps are enabled when settingoverlap_grad_sync=trueandoverlap_param_sync=true, respectively. The precision of the gradient reduce-scatter is set bygrad_sync_dtypeand reduction in bf16 ensures improved performance at large scale training co...
experts 8 --expert-model-parallel-size 2 --use-distributed-optimizer --moe-router-load-balancing-type sinkhorn --moe-router-topk 1 --overlap-grad-reduce --overlap-param-gather"'], moe_grouped_gemm: [1], args_meta: ["te_8experts2parallel_overlap_grad_reduce_param_gather_groupedGEMM"]}...
default=False, help='If set, overlap DDP grad reduce.') group.add_argument('--no-delay-grad-reduce', action='store_false', help='If not set, delay grad reduction in all but first PP stage.', help='If not set, delay / synchronize grad reductions in all but first PP stage.', des...
deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor only sets reduction stream waiting for default stream. This is ok in cases where the computation time is longer than the c...
overlap_grad_reduce: False delay_grad_reduce: True align_grad_reduce: True overlap_param_gather: False delay_param_gather: False align_param_gather: False scatter_gather_tensors_in_pipeline: True local_rank: null lazy_mpu_init: null 193 changes: 135 additions & 58 deletions 193 megatron/core...