但是因為看了大佬的有關量化之後,理解了trt中的W8A8的運算,理解了為什麼量化之後會加速的原因,但是針對gptq的 W8A16或者W4A16 卻不明白到底屬於是 dynamic quant 還是 static quant,因此糾結了好久,後續透過看了gptq的原始碼理解到,整個過程其實是 將量化的 weight 先反量化為 fp16 然後再和 ...
brian-dellabetta commented Jan 30, 2025 • edited by github-actions bot Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in llm-compressor FIX #n...
Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization...
W8A8量化 对称量化。 权重量化支持per-channel,支持非对称量化。 Deepseek-v2系列模型的W8A8量化需要使用llm-compressor工具。 SmoothQuant量化模型 本章节介绍如何使用SmoothQuant量化工具实现推理量化。 SmoothQuant量化工具 来自:帮助中心 查看更多 → W8A16量化 ...