(data_iterator, model: GPTModel): # 定义 forward step def train_valid_test_datasets_provider(train_val_test_num_samples): # 构造数据集 # 训练入口,调用 Megatron-LM/megatron/training.py 中定义的 pretrain 函数 if __name__ == "__main__": pretrain(train_valid_test_datasets_provider, ...
# num_query_groups_per_partition 把头按张量并行进行分片, num_attention_heads/word_size new_tensor_shape = mixed_x_layer.size()[:-1] + ( # 注意头大小已经按张量大小进行分片 self.num_query_groups_per_partition, ( (self.num_attention_heads_per_partition // self.num_query_groups_per_partit...
--num-query-groups: 8 --seq-length: 4096 --max-position-embeddings: 4096 --make-vocab-size-divisible-by: 128 # Add regularization args --attention-dropout: 0.0 --hidden-dropout: 0.0 --clip-grad: 1.0 --weight-decay: 0.1 # Add learning rate args --lr-decay-samples: 1949218748 --lr-...
num_layers ... 24 num_layers_per_virtual_pipeline_stage ... None num_query_groups ... 1 num_workers ... 2 onnx_safe ... None openai_gelu ... False optimizer ...
Theuniformmethod uniformly divides the transformer layers into groups of layers (each group of size--recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. When...
Theuniformmethod uniformly divides the transformer layers into groups of layers (each group of size--recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. When...
The uniform method uniformly divides the transformer layers into groups of layers (each group of size --recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. ...
The uniform method uniformly divides the transformer layers into groups of layers (each group of size --recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. ...
--recompute-num-layers: 对于uniform类型,表示设置在每个重计算的transformer layer group中的层数, 默认为1表示对每一层transformer layer都分别进行checkpoint;对于block类型,设为N表示单个pipeline stage中的前N个layers会缓存input activation。 2. 源码详解 ...
(2*4) num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size # e.g., 16//2=8个张量并行组,每组2人 num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size # 16//4=4个管道并行组,每组4人 num_data_parallel_groups = world_size // data_...