通过ui启动deepseek-r1模型,使用vLLM引擎,配置enable_prefix_cache:True 启动模型,xinference服务就会报错,不支持这个参数 Expected behavior / 期待表现 可以支持enable_prefix_cache,有需要相同提示词的并发场景,可以明显提升吞吐量 vLLM引擎的可选参数上面是enable_prefix_cache...
Prefix caching support Multi-LoRA support vLLM seamlessly supports most popular open-source models on HuggingFace, including: Transformer-like LLMs (e.g., Llama) Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3) Embedding Models (e.g. E5-Mistral) ...
gg/llama-kv-cache-v6 compilade/imatrix-batched-chunks gg/metal-heap gg/mla b5342 b5341 b5340 b5338 b5336 b5335 b5334 b5333 b5332 b5331 b5330 b5329 b5328 b5327 b5326 b5325 b5324 b5323 b5322 b5321 克隆/下载 克隆/下载 HTTPSSSHSVNSVN+SSH下载ZIP ...
Your current environment vLLM version 0.5.0.post1 🐛 Describe the bug Hi, Seems that there is a dirty cache issue with --enable-prefix-caching. We noticed it as we saw internal eval scores significantly degrade when running with --enable-...
[CI/Build] remove .github from .dockerignore, add dirty repo check (v… Oct 18, 2024 .gitignore [V1] Enable V1 Fp8 cache for FA3 in the oracle (vllm-project#15191) Mar 24, 2025 .pre-commit-config.yaml [VLM] Limit multimodal input cache by memory (vllm-project#14805) ...
Your current environment vLLM 0.4.3 RTX 4090 24GB (reproduces also on A100) 🐛 Describe the bug Hi, When server started with: python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --enable-prefix-caching ...
Sparse KV cache framework ([RFC]: Support sparse KV cache framework#5751) Long context optimizations: context parallelism, etc. Production Features KV cache offload to CPU and disk Disaggregated Prefill More control in prefix caching, and scheduler policies ...
using os.environ["VLLM_ATTENTION_BACKEND"]="XFORMERS" leads toThe Python process exited with exit code 139 (SIGSEGV: Segmentation fault) I have seen quite a few different issues withenable_prefix_caching, could anyone comment if the feature actually worked for them? We have a lot of 80-...
Thanks so much for the work on this repo so far. I think prefix caching could be very useful and I see that vLLM is also starting to support it for some architectures. It looks like the BaseBackend.prefix_cache method still needs to be i...
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache Optimized CUDA kernels vLLM is flexible and easy to use with: Seamless integration with popular Hugging Face models High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more Tensor parallelism support...