Sensitivity-Based Non-uniform Quantization LLM权重分布呈现不均匀的特性,此前广泛采用的均匀量化具有两个问题。 均匀量化将量化等级均匀分布,这与LLM权重分布特性具有较大差异。 尽管均匀分布有助于有效的整数运算,但是因为LLM是访存受限型任务,所以并没有带来很大的端到端的性能提升。 本文采用非均匀量化,对量化等级...
SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving. TLDR: Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method ...
QuantizationPruningMatrix multiplication accelerationConvolutionLSTMIn this paper, we present hardware accelerators created with high-level synthesis techniques for sparse and dense matrix multiplication operations. The cores can operate with different precisions and are designed to be integrated in a ...
(multiplier and adder) and registers for preloaded weights and temporarily latched partial sums and inputs. One should note that 8-bit integer formats are widely used in DNN inference engines due to the prevalence of quantization methods [19]. For systolic arrays, we used 128 × 128 and 256...