Without tensor parallelism, it succeeds. (Not an option for 8x7B as it doesn't fit on one GPU) 👍 1 noamgat changed the title Mixtral 8x7b instruct fails to load on GCP --tensor-parallel-size 2 fails to load on GCP Feb 18, 2024 thisissum commented Feb 18, 2024 I meet the...
Your current environment vllm version: '0.5.0.post1' 🐛 Describe the bug When I set tensor_parallel_size=1, it works well. But, if I set tensor_parallel_size>1, below error occurs: RuntimeError: Cannot re-initialize CUDA in forked subproc...
当tensor_parallel_size=2被使用时,输出结果为:
当tensor_parallel_size=2被使用时,输出结果为:
try add --privileged to docker
当tensor_parallel_size=2被使用时,输出结果为:
if I set it as 2, how is the parallelism been executed when a inference request comes, It parallel the input tensors to the 2 different GPUs? or it paralle-distribute the model's weight? When I test the time-comsuming of the server with "tensor-parallel-size" = 1or2, I didn't ...
Assign vllm+cpu 后端(无 gpu 硬件)时,tensor_parallel_size 应该默认设置成 1 而不是 cuda_count(等于 0) #3207 Sign in to view logs Summary Summary Jobs issue_assign Run details Usage Workflow file Triggered via issue November 14, 2024 08:07 qinxuye commented on #2552 042eb5b ...
I just upgraded my drivers to 545.29.02 and it has broken being able to run models larger than a single GPU ram for me with vLLM. If I pass in --tensor-parallel-size 2, things just hang when trying to create the engine. Without it, the m...
tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct...