如果quantization_config是通过代码传递的,你需要在创建或修改该配置对象时添加disable_exllama属性。 将disable_exllama的值设置为true: 确保disable_exllama的值被明确设置为true,以禁用Exllama后端。 保存对量化配置的更改: 保存对config.json文件的更改,或者在代码中保存对quantization_config对象的更改。 测试修改...
Qwen is the sota open source llm in China and its 72b-chat model will be released this month. Qwen-int4 is supported by autogptq. but it will become very slow run in multiple gpus. so if exllama supports model like Qwen-72b-chat-gptq, it shall be so exciting!
ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by settingdisable_exllama=Truein the quantization config objec Hi, may I ask how do you load the model, in my case with single GPU I also had that probl...