A high-throughput and memory-efficient inference and serving engine for LLMs - enforce eager mode with bnb quantization temporarily (#6846) · bong-furiosa/vllm-bong@bb54946
if self.quantization == "gptq" and not self.enforce_eager: # Related issue: https://github.com/vllm-project/vllm/issues/2147 logger.warning("GPTQ does not support CUDA graph yet. Disabling " "CUDA graph.") self.enforce_eager = Truedef verify_with_parallel_config( self,0...
[Bugfix] Set enforce_eager automatically for mllama 👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only runfastcheckCI which starts running only a small and essential subset of CI tests to quickly ...
enforce eager mode with bnb quantization temporarily 👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only runfastcheckCI which consists a small and essential subset of CI tests to quickly catch errors....
Temporarily enforce eager_mode in bitsandbytes quantization before the known issue (#5569) is fixed.