GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1. FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM. Examples Quantization Expect th...
WARNING 01-02 20:21:59 config.py:179] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 01-02 20:21:59 llm_engine.py:73] Initializing an LLM engine with config: model='/Yi/quantized_model', tokenizer='/Yi/quantized_model', tokenizer...
The speed can be slower than non-quantized models. 2024-04-28 20:25:31,039 INFO worker.py:1724 -- Started a local Ray instance. INFO 04-28 20:25:33 llm_engine.py:72] Initializing an LLM engine with config: model='Qwen1.5-32B-Chat-AWQ', tokenizer='Qwen1.5-32B-Chat-AWQ', ...
The speed can be slower than non-quantized models. INFO 01-14 20:09:03 llm_engine.py:73] Initializing an LLM engine with config: model='/data/sda/models/vicuna-7b-v1.5-awq', tokenizer='/data/sda/models/vicuna-7b-v1.5-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=...
() with the`input_ids`being on a devicetypedifferent than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example...
WARNING 04-11 18:00:30 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 04-11 18:00:30 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4...
The speed can be slower than non-quantized models. 02.警告:对 awq 量化还未完全优化呢。速度比未量化模型会慢一些。 assert linear_method is NoneAssertionError 03.断言 l_m 为空,断言错误。 参考:github.com/vllm-project vLLM v0.2.6 supports Mixtral + AWQ. Thanks! 看起来是版本太低的锅~...
I tried integrating the fused MoE Triton kernel with the AutoGPTQ triton kernel yesterday, however it turned out to be a lot slower than the old vllm implementation, end-to-end latency is over 30% worse at all batch sizes I tested. The AutoGPTQ kernel is already pretty slow as-is. ...
AutoGPTQ: --triton Use triton. --no_inject_fused_mlp Triton mode only: disable the use of fused MLP, which will use less VRAM at the cost of slower inference. --no_use_cuda_fp16 This can make models faster on some systems. --desc_act For models that do not have a quantize_...
GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1. FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommendTGIorvLLM. ...