用户可以开箱即用地加载诸如 Whisper、ViT、Blip2 之类的 8 比特或 4 比特(FP4/NF4)模型。 如果你在量化基础模型之上使用PEFT库基于Lora进行训练,则可以将训练得到的Apapter合并在基础模型之上进行部署,而不会降低推理性能。你甚至还可以在反量化模型之上合并 Apapter! 下面是使用 NF4 量化加载 4 比特模型的示例: ...
For 3B drop its bigger But 7B can fit in 24GiB with bs=1 and seq_len=256, and nf4 helps further increase seq_len Also maybe 13-30B model will have lover performance drop with qlora and bs=1 NeonBohdan closed this as completed Jun 2, 2023 Sign...
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor