Skip to content

Commit 587d063

Browse files
kaiyuxShixiaowei02
andauthored
Update TensorRT-LLM (#506)
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <[email protected]>
1 parent a21e2f8 commit 587d063

File tree

465 files changed

+38430
-18870
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

465 files changed

+38430
-18870
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ venv/
1818
.hypothesis/
1919
.idea/
2020
cpp/cmake-build-*
21+
cpp/.ccache/
22+
tensorrt_llm/libs
23+
tensorrt_llm/bindings.pyi
2124

2225
# Testing
2326
.coverage.*

.gitmodules

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
[submodule "3rdparty/cutlass"]
22
path = 3rdparty/cutlass
33
url = https://github.com/NVIDIA/cutlass.git
4-
branch = v2.10.0
54
[submodule "3rdparty/json"]
65
path = 3rdparty/json
76
url = https://github.com/nlohmann/json.git

.pre-commit-config.yaml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ repos:
1515
rev: v4.1.0
1616
hooks:
1717
- id: check-added-large-files
18-
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin'
18+
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
1919
- id: check-merge-conflict
2020
- id: check-symlinks
2121
- id: detect-private-key
@@ -33,10 +33,15 @@ repos:
3333
- id: clang-format
3434
types_or: [c++, c, cuda]
3535
exclude: |
36-
(?x)^(
37-
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/.*
38-
)$
36+
(?x)^(.*cubin.cpp$ | .*fmha_cubin.h)$
3937
- repo: https://github.com/cheshirekow/cmake-format-precommit
4038
rev: v0.6.10
4139
hooks:
4240
- id: cmake-format
41+
- repo: https://github.com/codespell-project/codespell
42+
rev: v2.2.4
43+
hooks:
44+
- id: codespell
45+
args:
46+
- --skip=".git,3rdparty"
47+
- --ignore-words-list=rouge,inout,atleast,strat

3rdparty/cutlass

Submodule cutlass updated 2041 files

3rdparty/json

Submodule json updated 165 files

README.md

Lines changed: 58 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -43,17 +43,22 @@ H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x
4343
- [Installation](#installation)
4444
- [Quick Start](#quick-start)
4545
- [Support Matrix](#support-matrix)
46+
- [Devices](#devices)
47+
- [Precision](#precision)
48+
- [Key Features](#key-features)
49+
- [Models](#models)
4650
- [Performance](#performance)
4751
- [Advanced Topics](#advanced-topics)
4852
- [Quantization](#quantization)
4953
- [In-flight Batching](#in-flight-batching)
5054
- [Attention](#attention)
5155
- [Graph Rewriting](#graph-rewriting)
52-
- [Benchmarking](#benchmarking)
56+
- [Benchmark](#benchmark)
5357
- [Troubleshooting](#troubleshooting)
54-
- [Release Notes](#release-notes)
55-
- [Changelog](#changelog)
56-
- [Known issues](#known-issues)
58+
- [Release notes](#release-notes)
59+
- [Change Log](#change-log)
60+
- [Known Issues](#known-issues)
61+
- [Report Issues](#report-issues)
5762

5863
## TensorRT-LLM Overview
5964

@@ -99,7 +104,7 @@ concepts used in TensorRT-LLM, we recommend you to read the following
99104

100105
## Installation
101106

102-
*For Windows installation, see [`Windows/`](windows/).*
107+
*For Windows installation, see [`Windows`](windows/README.md).*
103108

104109
TensorRT-LLM must be built from source, instructions can be found
105110
[here](./docs/source/installation.md). An image of a Docker container with
@@ -154,14 +159,14 @@ See the BLOOM [example](examples/bloom) for more details and options regarding t
154159

155160
***3. Run***
156161

157-
The `summarize.py` script can be used to perform the summarization of articles
162+
The `../summarize.py` script can be used to perform the summarization of articles
158163
from the CNN Daily dataset:
159164

160165
```python
161-
python summarize.py --test_trt_llm \
162-
--hf_model_location ./bloom/560M/ \
163-
--data_type fp16 \
164-
--engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
166+
python ../summarize.py --test_trt_llm \
167+
--hf_model_dir ./bloom/560M/ \
168+
--data_type fp16 \
169+
--engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
165170
```
166171

167172
More details about the script and how to run the BLOOM model can be found in
@@ -193,13 +198,13 @@ Lovelace architectures. Certain limitations may, however, apply.
193198
Various numerical precisions are supported in TensorRT-LLM. The support for
194199
some of those numerical features require specific architectures:
195200

196-
| | FP32 | FP16 | BF16 | FP8 | INT8 | INT4 |
197-
| :--------------------------- | :---- | :---- | :---- | :--- | :--- | :--- |
198-
| Volta (SM70) | Y | Y | N | N | Y | Y |
199-
| Turing (SM75) | Y | Y | N | N | Y | Y |
200-
| Ampere (SM80, SM86) | Y | Y | Y | N | Y | Y |
201-
| Ada-Lovelace (SM89) | Y | Y | Y | Y | Y | Y |
202-
| Hopper (SM90) | Y | Y | Y | Y | Y | Y |
201+
| | FP32 | FP16 | BF16 | FP8 | INT8 | INT4 |
202+
| :------------------ | :--- | :--- | :--- | :--- | :--- | :--- |
203+
| Volta (SM70) | Y | Y | N | N | Y | Y |
204+
| Turing (SM75) | Y | Y | N | N | Y | Y |
205+
| Ampere (SM80, SM86) | Y | Y | Y | N | Y | Y |
206+
| Ada-Lovelace (SM89) | Y | Y | Y | Y | Y | Y |
207+
| Hopper (SM90) | Y | Y | Y | Y | Y | Y |
203208

204209
In this release of TensorRT-LLM, the support for FP8 and quantized data types
205210
(INT8 or INT4) is not implemented for all the models. See the
@@ -237,19 +242,26 @@ The list of supported models is:
237242
* [Bert](examples/bert)
238243
* [Blip2](examples/blip2)
239244
* [BLOOM](examples/bloom)
240-
* [ChatGLM-6B](examples/chatglm6b)
241-
* [ChatGLM2-6B](examples/chatglm2-6b/)
245+
* [ChatGLM](examples/chatglm)
242246
* [Falcon](examples/falcon)
247+
* [Flan-T5](examples/enc_dec)
243248
* [GPT](examples/gpt)
244249
* [GPT-J](examples/gptj)
245250
* [GPT-Nemo](examples/gpt)
246251
* [GPT-NeoX](examples/gptneox)
252+
* [InternLM](examples/internlm)
247253
* [LLaMA](examples/llama)
248254
* [LLaMA-v2](examples/llama)
255+
* [Mistral](examples/llama)
249256
* [MPT](examples/mpt)
250257
* [OPT](examples/opt)
258+
* [Qwen](examples/qwen)
259+
* [Replit Code](examples/mpt)
251260
* [SantaCoder](examples/gpt)
252261
* [StarCoder](examples/gpt)
262+
* [T5](examples/enc_dec)
263+
264+
Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder support that contains many encoder-decoder models such as T5, Flan-T5, etc. We unroll the exact model names in the list above to let users find specific models easier.
253265

254266
## Performance
255267

@@ -311,6 +323,33 @@ may happen. One possible solution is to reduce the amount of memory needed by
311323
reducing the maximum batch size, input and output lengths. Another option is to
312324
enable plugins, for example: `--use_gpt_attention_plugin`.
313325

326+
* MPI + Slurm
327+
328+
TensorRT-LLM is a [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might encounter interferences:
329+
```
330+
--------------------------------------------------------------------------
331+
PMI2_Init failed to initialize. Return code: 14
332+
--------------------------------------------------------------------------
333+
--------------------------------------------------------------------------
334+
The application appears to have been direct launched using "srun",
335+
but OMPI was not built with SLURM's PMI support and therefore cannot
336+
execute. There are several options for building PMI support under
337+
SLURM, depending upon the SLURM version you are using:
338+
339+
version 16.05 or later: you can use SLURM's PMIx support. This
340+
requires that you configure and build SLURM --with-pmix.
341+
342+
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
343+
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
344+
install PMI-2. You must then build Open MPI using --with-pmi pointing
345+
to the SLURM PMI library location.
346+
347+
Please configure as appropriate and try again.
348+
--------------------------------------------------------------------------
349+
```
350+
As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.
351+
For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
352+
314353
## Release notes
315354

316355
* TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.

benchmarks/cpp/README.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,14 @@ multiple GPUs or multiple nodes with multiple GPUs.
77

88
### 1. Build TensorRT-LLM and benchmarking source code
99

10-
Please follow the [`installation document`](../../../README.md) to build TensorRT-LLM.
10+
Please follow the [`installation document`](../../docs/source/installation.md) to build TensorRT-LLM.
11+
12+
Note that the benchmarking source code for C++ runtime is not built by default, you can use the argument `--benchmarks` in [`build_wheel.py`](../../scripts/build_wheel.py) to build that.
1113

1214
Windows users: Follow the
13-
[`Windows installation document`](../../../windows/README.md)
15+
[`Windows installation document`](../../windows/README.md)
1416
instead, and be sure to set DLL paths as specified in
15-
[Extra Steps for C++ Runtime Usage](../../../windows/README.md#extra-steps-for-c-runtime-usage).
16-
17-
After that, you can build benchmarking source code for C++ runtime
18-
```
19-
cd cpp/build
20-
make -j benchmarks
21-
```
17+
[Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).
2218

2319
### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
2420

@@ -44,7 +40,7 @@ Take GPT-350M as an example for single GPU
4440
--batch_size "1" \
4541
--input_output_len "60,20"
4642
47-
# Expected ouput:
43+
# Expected output:
4844
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
4945
```
5046
Take GPT-175B as an example for multiple GPUs
@@ -55,10 +51,12 @@ mpirun -n 8 ./benchmarks/gptSessionBenchmark \
5551
--batch_size "1" \
5652
--input_output_len "60,20"
5753
58-
# Expected ouput:
54+
# Expected output:
5955
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
6056
```
6157

58+
If you want to obtain context and generation logits, you could build an enigne with `--gather_all_token_logits` and run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
59+
6260
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
6361

6462
### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)

benchmarks/cpp/gptManagerBenchmark.cpp

Lines changed: 36 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -273,13 +273,9 @@ class GptServer
273273
{
274274
public:
275275
GptServer(std::filesystem::path const& trtEnginePath, TrtGptModelType modelType, int32_t maxBeamWidth,
276-
batch_scheduler::SchedulerPolicy schedulerPolicy, std::optional<int32_t> maxNumSequences,
277-
std::optional<int32_t> maxTokensInPagedKvCache, std::optional<float> kvCacheFreeGpuMemFraction,
278-
std::optional<bool> enableTrtOverlap, std::shared_ptr<Recorder> recorder,
279-
std::optional<uint64_t> terminateReqId)
276+
batch_scheduler::SchedulerPolicy schedulerPolicy, TrtGptModelOptionalParams const& optionalParams,
277+
std::shared_ptr<Recorder> recorder, std::optional<uint64_t> terminateReqId)
280278
{
281-
const TrtGptModelOptionalParams& optionalParams = TrtGptModelOptionalParams(
282-
maxNumSequences, maxTokensInPagedKvCache, kvCacheFreeGpuMemFraction, enableTrtOverlap);
283279
mBatchManager = std::make_shared<GptManager>(
284280
trtEnginePath, modelType, maxBeamWidth, schedulerPolicy,
285281
[this](int max_num_requests) { return getInferenceRequests(max_num_requests); },
@@ -460,10 +456,8 @@ std::pair<std::vector<std::vector<int32_t>>, std::vector<int32_t>> parseDataset(
460456
}
461457

462458
void benchmarkGptManager(std::string const& modelName, std::filesystem::path const& engineDir, std::string const& type,
463-
std::string const& datasetPath, std::shared_ptr<nvinfer1::ILogger> const& logger,
464-
std::optional<int32_t> maxNumSequences, std::optional<int32_t> maxTokensInPagedKvCache,
465-
std::optional<float> kvCacheFreeGpuMemFraction, std::optional<bool> enableTrtOverlap,
466-
batch_scheduler::SchedulerPolicy schedulerPolicy)
459+
std::string const& datasetPath, int beamWidth, std::shared_ptr<nvinfer1::ILogger> const& logger,
460+
TrtGptModelOptionalParams const& optionalParams, batch_scheduler::SchedulerPolicy schedulerPolicy)
467461
{
468462
auto const worldConfig = WorldConfig::mpi(*logger);
469463

@@ -482,6 +476,11 @@ void benchmarkGptManager(std::string const& modelName, std::filesystem::path con
482476
TLLM_LOG_ERROR(errStr);
483477
}
484478

479+
ITensor::SharedPtr beamWidthBuffer = BufferManager::cpu(ITensor::makeShape({1}), nvinfer1::DataType::kINT32);
480+
auto beamWidthBufferPtr = bufferCast<SizeType>(*beamWidthBuffer);
481+
*beamWidthBufferPtr = beamWidth;
482+
auto beamWidthTensor = NamedTensor(beamWidthBuffer, "beam_width");
483+
485484
// Load dataset
486485
auto dataset = parseDataset(datasetPath);
487486
std::vector<std::vector<NamedTensor>> tensors_list;
@@ -494,15 +493,16 @@ void benchmarkGptManager(std::string const& modelName, std::filesystem::path con
494493
auto input_ids_tensor = NamedTensor(nvinfer1::DataType::kINT32, input_ids_shape, "input_ids", input_ids.data());
495494
auto request_output_len_tensor
496495
= NamedTensor(nvinfer1::DataType::kINT32, {1, 1}, "request_output_len", &request_output_len);
497-
std::vector<NamedTensor> tensors = {input_ids_tensor, request_output_len_tensor};
498-
tensors_list.push_back(tensors);
496+
std::vector<NamedTensor> tensors
497+
= {std::move(input_ids_tensor), std::move(request_output_len_tensor), beamWidthTensor};
498+
tensors_list.emplace_back(std::move(tensors));
499499
}
500500

501-
const int maxBeamWidth = 1;
501+
const int maxBeamWidth = beamWidth;
502502
auto recorder = std::make_shared<Recorder>();
503503
uint64_t terminateReqId = num_samples + 1;
504-
auto gptServer = std::make_shared<GptServer>(engineDir, modelType, maxBeamWidth, schedulerPolicy, maxNumSequences,
505-
maxTokensInPagedKvCache, kvCacheFreeGpuMemFraction, enableTrtOverlap, recorder, terminateReqId);
504+
auto gptServer = std::make_shared<GptServer>(
505+
engineDir, modelType, maxBeamWidth, schedulerPolicy, optionalParams, recorder, terminateReqId);
506506

507507
if (worldConfig.getRank() == 0)
508508
{
@@ -537,16 +537,18 @@ int main(int argc, char* argv[])
537537
"type", "Batching type: IFB or V1(non-IFB) batching.", cxxopts::value<std::string>()->default_value("IFB"));
538538
options.add_options()("dataset", "Dataset that is used for benchmarking BatchManager.",
539539
cxxopts::value<std::string>()->default_value(""));
540+
options.add_options()(
541+
"beam_width", "Specify beam width you want to benchmark.", cxxopts::value<int>()->default_value("1"));
540542

541-
options.add_options()("max_num_sequences", "Max number of Sequences.", cxxopts::value<int>()->default_value("-1"));
543+
options.add_options()("max_num_sequences", "Max number of Sequences.", cxxopts::value<int>());
544+
options.add_options()("max_tokens_in_paged_kvcache", "Max tokens in paged K-V Cache.", cxxopts::value<int>());
542545
options.add_options()(
543-
"max_tokens_in_paged_kvcache", "Max tokens in paged K-V Cache.", cxxopts::value<int>()->default_value("-1"));
544-
options.add_options()("kv_cache_free_gpu_mem_fraction", "K-V Cache Free Gpu Mem Fraction.",
545-
cxxopts::value<float>()->default_value("-1"));
546+
"kv_cache_free_gpu_mem_fraction", "K-V Cache Free Gpu Mem Fraction.", cxxopts::value<float>());
547+
options.add_options()(
548+
"enable_trt_overlap", "Overlap TRT context preparation and execution", cxxopts::value<bool>());
549+
546550
options.add_options()("scheduler_policy", "Choose scheduler policy between max_utilization/guaranteed_no_evict.",
547551
cxxopts::value<std::string>()->default_value("guaranteed_no_evict"));
548-
options.add_options()("enable_trt_overlap", "Overlap TRT context preparation and execution",
549-
cxxopts::value<bool>()->default_value("false"));
550552

551553
options.add_options()("log_level", "Choose log level between verbose/info/warning/error/internal_error.",
552554
cxxopts::value<std::string>()->default_value("error"));
@@ -573,32 +575,29 @@ int main(int argc, char* argv[])
573575
// Argument: Dataset
574576
auto const datasetPath = result["dataset"].as<std::string>();
575577

578+
// Argument: beam width
579+
auto const beamWidth = result["beam_width"].as<int>();
580+
581+
TrtGptModelOptionalParams optionalParams;
576582
// Argument: Max Num Sequences
577-
std::optional<int32_t> maxNumSequences = std::nullopt;
578-
if (result["max_num_sequences"].as<int>() != -1)
583+
if (result.count("max_num_sequences"))
579584
{
580-
maxNumSequences = result["max_num_sequences"].as<int>();
585+
optionalParams.maxNumSequences = result["max_num_sequences"].as<int>();
581586
}
582-
583587
// Argument: Max tokens in paged K-V Cache
584-
std::optional<int32_t> maxTokensInPagedKvCache = std::nullopt;
585-
if (result["max_tokens_in_paged_kvcache"].as<int>() != -1)
588+
if (result.count("max_tokens_in_paged_kvcache"))
586589
{
587-
maxTokensInPagedKvCache = result["max_tokens_in_paged_kvcache"].as<int>();
590+
optionalParams.kvCacheConfig.maxTokens = result["max_tokens_in_paged_kvcache"].as<int>();
588591
}
589-
590592
// Argument: K-V Cache Free Gpu Mem Fraction
591-
std::optional<float> kvCacheFreeGpuMemFraction = std::nullopt;
592-
if (result["kv_cache_free_gpu_mem_fraction"].as<float>() != -1)
593+
if (result.count("kv_cache_free_gpu_mem_fraction"))
593594
{
594-
kvCacheFreeGpuMemFraction = result["kv_cache_free_gpu_mem_fraction"].as<float>();
595+
optionalParams.kvCacheConfig.freeGpuMemoryFraction = result["kv_cache_free_gpu_mem_fraction"].as<float>();
595596
}
596-
597597
// Argument: Enable TRT overlap
598-
std::optional<bool> enableTrtOverlap = std::nullopt;
599-
if (result["enable_trt_overlap"].as<bool>() != -1)
598+
if (result.count("enable_trt_overlap"))
600599
{
601-
enableTrtOverlap = result["enable_trt_overlap"].as<bool>();
600+
optionalParams.enableTrtOverlap = result["enable_trt_overlap"].as<bool>();
602601
}
603602

604603
// Argument: Scheduler policy
@@ -652,8 +651,7 @@ int main(int argc, char* argv[])
652651
try
653652
{
654653
benchmarkGptManager(result["model"].as<std::string>(), result["engine_dir"].as<std::string>(), type,
655-
datasetPath, logger, maxNumSequences, maxTokensInPagedKvCache, kvCacheFreeGpuMemFraction, enableTrtOverlap,
656-
schedulerPolicy);
654+
datasetPath, beamWidth, logger, optionalParams, schedulerPolicy);
657655
}
658656
catch (const std::exception& e)
659657
{

0 commit comments

Comments
 (0)