Your current environment My model is Llama3-8B which takes about 14GB GPU-memory. And the machine have 2 * 40GB GPUs. (NVIDIA L40S) How would you like to use vllm Hey, Recently I tried to use AsyncLLMEngine to
vLLM新版本性能炸裂!v0.7.3正式支持DeepSeek-AI多令牌预测模块,实测推理速度最高提升69%。只需在启动参数添加--num-speculative-tokens=1即可开启,还能选配--draft-tensor-parallel-size=1进一步优化。更惊人的是,在ShareGPT数据集测试中,该功能实现了81%-82.3%的预测接受率。这意味着在保持精度的同时,大幅缩短了...