Expected behavior / 期待表现 和tensor_parallel_size为1时表现一致 感觉不是模型文件的原因,也不是glm-4模型的问题,我用qwen的模型一样会有这个问题,当2卡的vllm出现kv空间不足的warning时就会出现感叹号。我在vllm的仓库了发现了类似的issue Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡...
Your current environment vllm version: '0.5.0.post1' 🐛 Describe the bug When I set tensor_parallel_size=1, it works well. But, if I set tensor_parallel_size>1, below error occurs: RuntimeError: Cannot re-initialize CUDA in forked subproc...
v0.7.3正式支持DeepSeek-AI多令牌预测模块,实测推理速度最高提升69%。只需在启动参数添加--num-speculative-tokens=1即可开启,还能选配--draft-tensor-parallel-size=1进一步优化。更惊人的是,在ShareGPT数据集测试中,该功能实现了81%-82.3%的预测接受率。这意味着在保持精度的同时,大幅缩短了推理耗时。生成式AI开...
我终于让--tensor-parallel-size 2正常工作了。经过对许多模型的测试,它是可靠的。@chrisbraddock Could...
vllm 当我设置tensor_parallel_size=2时,发生了一个时间错误,当tensor_parallel_size=2被使用时,输出...
vllm 当我设置tensor_parallel_size=2时,发生了一个时间错误,当tensor_parallel_size=2被使用时,输出...
[Bug]: WSL2(也适用于Docker)可以处理1个GPU工作负载,但无法处理2个,(--tensor-parallel-size 2)...
[Bug] deepseek-R1 671b can not set tensor_parallel_size=32 sgl-project/sglang#3345 Closed 5 tasks lytlwhl commented Feb 11, 2025 The communication volume of tp is large and you should avoid cross-machine communication. Thanks, very valuable insight, we hadn't realized this before....
tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/scratch/pbs.5401450.kman.restech....
vllm 当我设置tensor_parallel_size=2时,发生了一个时间错误,当tensor_parallel_size=2被使用时,输出...