通过ui启动deepseek-r1模型,使用vLLM引擎,配置enable_prefix_cache:True 启动模型,xinference服务就会报错,不支持这个参数 Expected behavior / 期待表现 可以支持enable_prefix_cache,有需要相同提示词的并发场景,可以明显提升吞吐量 vLLM引擎的可选参数上面是enable_prefix_cache
LLM科普2:Prefix cache hit rate是什么意思 | “Prefix cache hit rate” 这个指标是在 vLLM 中与 Automatic Prefix Caching(APC,自动前缀缓存) 功能密切相关的性能统计数据。APC 是一项优化技术,旨在通过缓存先前请求的键值对(KV cache)来加速推理,尤其是在处理具有共享前缀的序列时。 从哪个版本开始? 根据vLLM ...
Your current environment vLLM version 0.5.0.post1 🐛 Describe the bug Hi, Seems that there is a dirty cache issue with --enable-prefix-caching. We noticed it as we saw internal eval scores significantly degrade when running with --enable-...
context : allow cache-less context for embeddings (#13108) 26天前 ggml CUDA: fix race conditions FlashAttention kernels (#13438) 23天前 gguf-py mtmd : support InternVL 2.5 and 3 (#13422) 24天前 grammars llama : move end-user examples to tools directory (#13249) ...
{ env.SHA }} tags: | type=edge,branch=$repo.default_branch type=semver,pattern=v{{version}} type=sha,prefix=,suffix=,format=short # Build and push Docker image with Buildx # (don't push on PR, load instead) - name: Build and push Docker image id: build-and-push uses: docker/...
Prefix caching support Multi-lora support vLLM seamlessly supports most popular open-source models on HuggingFace, including: Transformer-like LLMs (e.g., Llama) Mixture-of-Expert LLMs (e.g., Mixtral) Embedding Models (e.g. E5-Mistral) Multi-modal LLMs (e.g., LLaVA) Find the full ...
[CI/Build] remove .github from .dockerignore, add dirty repo check (v… Oct 18, 2024 .gitignore [V1] Enable V1 Fp8 cache for FA3 in the oracle (vllm-project#15191) Mar 24, 2025 .pre-commit-config.yaml [VLM] Limit multimodal input cache by memory (vllm-project#14805) ...
Your current environment vLLM 0.4.3 RTX 4090 24GB (reproduces also on A100) 🐛 Describe the bug Hi, When server started with: python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --enable-prefix-caching ...
Sparse KV cache framework ([RFC]: Support sparse KV cache framework#5751) Long context optimizations: context parallelism, etc. Production Features KV cache offload to CPU and disk Disaggregated Prefill More control in prefix caching, and scheduler policies ...
vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache Optimized CUDA kernels vLLM is fle...