Thisdocumentdescribes the different quantization methods implemented in TensorRT-LLM and contains a support matrix for the different models. In-flight Batching TensorRT-LLM supports in-flight batching of requests (also known as continuous batching or iteration-level batching). It's atechniquethat aims ...
Thisdocumentdescribes the different quantization methods implemented in TensorRT-LLM and contains a support matrix for the different models. In-flight Batching TensorRT-LLM supports in-flight batching of requests (also known as continuous batching or iteration-level batching). It's atechniquethat aims ...
To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (see examples/gpt for concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a ...