from transformers import BitsAndBytesConfig # 创建一个BitsAndBytesConfig对象 quantization_config = BitsAndBytesConfig( load_in_4bit=True, # 加载模型时使用4位量化 bnb_4bit_quant_type="nf4", # 使用NF4量化类型 bnb_4bit_use_double_q
ask is to also support Huggingface's ownOptimum Quanto right now its possible to use it, but only as post-load on-demand quantization, there is no option to use it like BnB or TorchAO to apply quantization automatically during load itself. @yiyixuxu@sayakpaul@DN6@asomoza...