It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant, ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs. Recently re-architected with a ...
It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs TensorRT-LLM provides a Python API to build LLMs into ...
chore: Cursor ignore cubin in headers (NVIDIA#3202) Apr 1, 2025 .dockerignore Update TensorRT-LLM (NVIDIA#941) Jan 23, 2024 .gitattributes chore: Stabilize ABI boundary for internal kernel library (NVIDIA#3117) Apr 11, 2025 .gitignore ...
It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs TensorRT-LLM provides a Python API to build LLMs into ...
It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs TensorRT-LLM provides a Python API to build LLMs into ...
TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4AWQ, INT8SmoothQuant, ++) and much more, to perform inference efficiently on...
It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs TensorRT-LLM provides a Python API to build LLMs into ...
TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4AWQ, INT8SmoothQuant, ++) and much more, to perform inference efficiently on...