To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (see examples/gpt for concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a ...
To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (seeexamples/gptfor concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete...
To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (seeexamples/gptfor concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete...