这种离群值最终会导致 attention sink,我们把出现极端离群值的词元称为关键词元;IntactKV 通过保证量化模型使用关键词元的无损表征推理,可以作为插件有效提升 GPTQ、AWQ、QuaRot 等现有主流量化方法精度并且不会带来任何额外推理开销,兼容不同的量化方法和量化设置 (W-only / WA / KV cache 量化),真正做到即插...
对此,作者提出了一种新的量化方法IntactKV,通过预先缓存关键词元的无损KV Cache来保证量化模型中关键词元表征无损,通过理论推导证明这能有效降低模型的量化误差上界。此外,缓存的无损KV还可以作为模型的额外参数做进一步校准,进一步弥补量化误差。IntactKV实现简单并且与GPTQ、AWQ、QuaRot等当前主流的LLM量化方法正交,...
IntactKV的核心思想是生成并保持输入序列初始标记(pivot tokens)的键值(KV)缓存完整无损。具体来说,解决方案包括以下几个关键步骤: 生成KV缓存:使用全精度(full-precision)模型生成pivot tokens的KV缓存,并将这些缓存保存为IntactKV。这一步骤确保了在量化过程中,这些关键的KV缓存不会受到量化误差的影响。 结合现有量化...
IntactKV is a simple and orthogonal method to enhance the quantized LLMs. It can be feasibly combined with various existing quantization approaches (e.g., AWQ, OmniQuant, GPTQ, QuaRot) with no inference overhead on various LLMs (LLaMA, Vicuna, OPT, Mistral e.t.c.). IntactKV is built...
Official PyTorch implementation of IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact - IntactKV/requirements.txt at main · ruikangliu/IntactKV
IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact This repository contains the PyTorch implementation of IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact. IntactKV is a simple and orthogonal method to enhance the quantized LLMs. It ...