TensorRT-LLM is an advanced open-source library developed by NVIDIA to optimize the inference performance of large language models (LLMs) on NVIDIA GPUs. TensorRT-LLM incorporates numerous optimizations specific to LLMs, such as custom attention kernels, in-flight batching, paged key-value caching, and various quantization techniques (e.g., FP8, INT4 AWQ, INT8 SmoothQuant) to enhance inference efficiency
Platform Specific Instuctions and scripts used for LLM-Inference-Bench