se, pipeline_parallel_size=2, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching =False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, ...
pipeline_model_parallel_size(必选,默认为1):表示一个pipeline模型并行通信组中的GPU卡数,pipeline并行相当于把layer纵向切为了N个stage阶段,每个阶段对应一个卡,所以这里也就等于stage阶段数。例如 pipeline_model parallel_size 为2,tensor_model parallel_size 为4,表示一个模型会被纵向分为2个stage进行pipeline并行...
Your current environment The output of `python collect_env.py` (RayWorkerWrapper pid=5057, ip=10.121.129.5) Cache shape torch.Size([163840, 64]) [repeated 30x across cluster] (RayWorkerWrapper pid=5849, ip=10.121.129.12) INFO 01-21 00:46...
Why are these changes needed? Allow pipeline-parallel-size to be configurable in the vLLM example Related issue number Related to #2354 Checks I've made sure the tests are passing. Testing Strate...