cu12:$LD_LIBRARY_PATH export MODEL=facebook/opt-125m # start OpenAI compatible server # # https://docs.vllm.ai/en/latest/models/engine_args.html python -m vllm.entrypoints.openai.api_server \ --model $MODEL \ --dtype $DYPTE \ --tensor-parallel-size $NUM_GPUS \ --quantization $...
if I set it as 2, how is the parallelism been executed when a inference request comes, It parallel the input tensors to the 2 different GPUs? or it paralle-distribute the model's weight? When I test the time-comsuming of the server with "tensor-parallel-size" = 1or2, I didn't ...
当tensor_parallel_size=2被使用时,输出结果为:
当tensor_parallel_size=2被使用时,输出结果为:
try add --privileged to docker
Your current environment vllm version: '0.5.0.post1' 🐛 Describe the bug When I set tensor_parallel_size=1, it works well. But, if I set tensor_parallel_size>1, below error occurs: RuntimeError: Cannot re-initialize CUDA in forked subproc...
I just upgraded my drivers to 545.29.02 and it has broken being able to run models larger than a single GPU ram for me with vLLM. If I pass in --tensor-parallel-size 2, things just hang when trying to create the engine. Without it, the m...
tensor_parallel_size参数改为2,使用2张卡; 2.用多线程调用api: def send_request(prompt): response = simple_chat(prompt) return response with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: executor.map(send_request, test_list) 其中的simple_chat就是openai_api_client中的函数 ...
However, when I run it with--tensor-parallel-size 4, the model does not finish loading and the server crashes after about 10 minutes: $python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-8B-Instruct \
Assign vllm+cpu 后端(无 gpu 硬件)时,tensor_parallel_size 应该默认设置成 1 而不是 cuda_count(等于 0) #3207 Sign in to view logs Summary Summary Jobs issue_assign Run details Usage Workflow file Triggered via issue November 14, 2024 08:07 qinxuye commented on #2552 042eb5b ...