27 changes: 18 additions & 9 deletions 27 model/kv_cache.py Original file line numberDiff line numberDiff line change @@ -1,5 +1,5 @@ import torch import torch.nn as nnclass KVCache: """ @@ -14,7 +14,7 @@ class KVCache:...
提供use_cache_quantization以及use_cache_kernel两个参数对模型控制,当use_cache_quantization以及use_cache_kernel均开启时,将启动kv-cache量化的功能。具体使用如下: ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, use_cache_quant...
Summary Currently not obvious that we do not support running quantized kv_cache inference Make it obvious that vLLM should be used for this case
k, v, new_cache = self._update_kv_and_cache(k, v, cache) File "wenet/transformer/attention.py", line 207, in _update_kv_and_cache key_cache, value_cache = cache ValueError: not enough values to unpack (expected 2, got 0)
MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. - 添加KVCacheScatterUpdate原语,推理专用算子,支持ge模式无反向, 适配并行切分 · ju-tian-712/mindspore@0cba843
k, v, new_cache = self._update_kv_and_cache(k, v, cache) File "/media/hulk2/BigData2/wenet_23_jan_2024/examples/reverie/v5/s0/wenet/transformer/attention.py", line 209, in _update_kv_and_cache key_cache, value_cache = cache ValueError: too many values to unpack (expected 2) ...
[Docs] Update FP8 KV Cache documentation (vllm-project#12238) … c3d6140 abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025 [Docs] Update FP8 KV Cache documentation (vllm-project#12238) … 09c9898 Sign up for free to join this conversatio...
@@ -323,7 +323,7 @@ def llama_flash_attn2_forward_PyramidKV( # print(f"after self.key_cache[layer_idx] {past_key_value.key_cache[self.layer_idx].device}") # print(f"after self.value_states[layer_idx] {past_key_value.value_cache[self.layer_idx].device}") print(f"debug key_...
print(f"debug layer_idx {layer_idx} past_seen_tokens {past_seen_tokens}") # print(f"debug layer_idx {layer_idx} past_seen_tokens {past_seen_tokens}") @@ -741,4 +741,4 @@ def llama_model_forward( past_key_values=next_cache, hidden_states=all_hidden_states, attentions=all_self_...
* method: Support "PyramidKV" for now. * method: Support "PyramidKV", "SnapKV", "StreamingLLM", "H2O". * max_capacity_prompts: Selected KV Size in each layer. (e.g. 128, 2048 in paper). When method is "PyramidKV", given that the total number of KV remains unchanged, the speci...