Weekly release: 0.19.0rc0 #3588

kaiyux · 2025-04-16T00:12:36Z

kaiyux
Apr 16, 2025
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we have updated a weekly release 0.19.0rc0, and pushed an update to the Triton backend this April 15, 2025.

The 0.19.0rc0 dev release includes:

Model Support
- Added Llama 4 support. (feat: Add Llama 4 #3302)
- Added support for Phi‑4‑MM (feat: Add support for Phi-4-MM #3296)
- Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at examples/gemma/README.md. (feat: Support gemma-3-1b-it #3247)
- Added Qwen2.5‑VL support for PyTorch workflow and refactored Qwen2‑VL (feat: Add Qwen2.5-VL and refactor Qwen2-VL #3156)
Features
- Added FP8 support for SM120 architecture (feat: Add FP8 support for SM 120 #3248)
- Registered ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (feat: register ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options #3343)
- Made the scaffolding Controller more generic (feat: Make scaffolding Controller more generic #3408 #3416)
- Breaking change: Added individual gatherContext support for each additional output (feat: Allow individual gatherContext for each additional output #3374)
- Added trtllm‑gen FP4 GEMM for the PyTorch workflow (feat: trtllm-gen fp4 GEMM for pytorch workflow #3423)
- Added Qwen2 MoE support for PyTorch flow (feat: add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa… #3369)
- Enabled PyExecutor inference flow to estimate max_num_tokens for kv_cache_manager (feat: Run PyExecutor's inference flow to estimate max_num_tokens for kv_cache_manager #3092)
- Added TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD environment variables for debugging (feat: Support TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD for debugging #3417)
- Applied the PyTorch workflow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators (feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. #3151)
- Introduced a UserBuffers allocator for PyTorch flow (feat: Introduce UB allocator for pytorch flow #3257)
- Supported aborting disconnected requests (feat: support abort disconnected requests #3214)
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs (feat: Add support for FP8 MLA on Hopper and Blackwell. #3190)
- Added an option to run disaggregated serving without context servers (feat: Add option to run disaggregated serving without ctx servers,… #3243)
- Enhanced RoPE support in AutoDeploy (refactor:[AutoDeploy] Enhance RoPE support #3115)
- Fixed and improved allreduce and fusion kernels (feat: allreduce and fusion kernel development #3064)
- Added DeepSeek-V3 support in AutoDeploy (feat: [TRTLLM-3510] DeepseekV3 support in AutoDeploy #3281)
- Enhanced the integrated robustness of scaffolding via init.py (feat: Enhance the integrated robustness of scaffolding with __init__.… #3312)
API
- Added numNodes to ParallelConfig (feat: Add numNodes to ParallelConfig #3346)
- Redesigned the multi‑stream API for DeepSeek (feat: [Deepseek] Redesign multi-stream API #3459)
Bug fixes
- Fixed a wrong import of KvCacheConfig in examples/gpqa_llmapi.py (feat: add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa… #3369)
- Fixed the test name (test: fix test name #3534)
- Fixed max_seq_len in executor_config (fix: fix max_seq_len in executor_config #3487)
- Removed a duplicated line of code (fix: remove one duplicated line of code #3523)
- Disabled kv cache reuse for the prompt tuning test (fix: disable the kv cache reuse for prompt tuning test #3474)
- Fixed the issue of a first‑generation token being returned twice in streaming (fix: Fixing issue with first gen token being returned twice in streaming #3427)
- Added kv memory size per token calculation in the draft model (fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache #3497)
- Switched ZMQ from a file socket to a TCP socket in RemoteMpiCommSession (fix: switch ZMQ from file socket to tcp socket in RemoteMpiCommSession #3462)
- Fixed PP for Llama (fix: Fix PP for llama. #3449)
- Updated the default excluded_modules value for the fp8rowwise recipe (fix:update the default excluded_modules value for fp8rowwise recipe. #3477)
- Fixed disaggregation MTP with overlap (fix: Fix disagg MTP with overlap #3406)
- Stopped memory estimation in start_attention (fix: don't perform memory estimation for star_attention #3485)
- Allowed the context_and_generation request type in disaggregated overlap (fix: Allow context_and_generation request type in disagg overlap #3489)
- Fixed the partial match issue (fix: fix partialMatch #3413)
- Fixed Eagle decoding (fix: Eagle decoding #3456)
- Fixed the py_decoding_iter update in the decoder (fix: fix the py_decoding_iter update in decoder #3297)
- Fixed the beam search diversity issue (Fix: Beam Search Diversity #3375)
- Updated ucxx to avoid occasional segfaults when profiling (fix: updating ucxx, which appears to avoid occasional segfaults when profiling #3420)
- Fixed redrafter sampling (fix: redrafter sampling #3278)
- Fixed mllama end‑to‑end PyTorch flow (fix: mllama e2e pytorch flow fix #3397)
- Reverted an extra CMake variable (fix: revert extra cmake var #3351)
- Fixed issues with the fused MoE path (fix: Fix the issues related to fused moe path. #3435)
- Fixed conflicting test names (test: fix conflicting test names #3316)
- Fixed failing DeepSeek-V3 unit tests (Fix failing DSV3 unit tests #3385)
- Fixed missing bias addition for FP4Linear (fix [NVBUG 5208255] Fix missing bias add for FP4Linear. #3361)
- Fixed the runtime error in test_deepseek_allreduce.py (fix: runtime error in test_deepseek_allreduce.py #3226)
- Fixed speculative decoding and multimodal input support (fix: #3137 speculative decoding and multimodal input support #3276)
- Fixed PyTorch nvsmall via PyExecutor and improved TP support (Fix torch nvsmall through pyexecutor and fix its TP support #3238)
- Fixed the p‑tuning test bug (fix: Fix p-tuning test bug #3326)
Performance
- Cached sin and cos in the model instead of using a global LRU cache (feat: Cache sin cos in model instead of global LRU cache. #3378)
- Deallocated tensors after use in MLA (chore: [MLA] Deallocate tensors after use #3286)
- Enabled DeepGEMM by default (feat: Enable DeepGEMM by default #3341)
- Added a thread leak check and fixed thread/memory leak issues (fix: Add thread leak check and fix thread/memory leak issues #3270)
- Used cudaMalloc to allocate kvCache (feat: use cudaMalloc to allocate kvCache #3303)
- Made ipc_periodically the default responses_handler (breaking change) (breaking change: perf: Make ipc_periodically the default responses_handler #3102)
- Used NVRTC for DeepGEMM JIT compilation (feat: use NVRTC for DeepGEMM JIT compilation #3239)
- Optimized quantization kernels used in DeepSeek on Hopper (perf: Optimize quantization kernels used in DeepSeek on Hopper #3466)
Documentation
- Added an example section for the multi‑node DeepSeek R1 benchmark on GB200 (doc: Add example section: "Example: Multi-node benchmark on GB200" #3519)
- Documented disaggregation performance tuning (disagg perf tune doc #3516)
- Updated the perf‑benchmarking documentation for GPU configuration (doc: Update perf-benchmarking doc on GPU configuration for consistent benchmarking. #3458)
- Updated the README and added a benchmarking blog for DeepSeek‑R1 (doc: Add DeepSeek-R1 perf doc #3232)
- Updated the documentation for using Draft‑Target‑Model (DTM) (Doc: update steps of using Draft-Target-Model (DTM) in the documents. #3366)
- Updated the README for disaggregated serving (doc: update readme for disaggregated #3323)
- Updated instructions to enable FP8 MLA for Deepseek. (doc: Update instructions to enable FP8 MLA for Deepseek. #3488)

The cut-off commit to this release is 258ae9c. The code changes can be seen here: 5aeef6d...258ae9c.

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weekly release: 0.19.0rc0 #3588

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Weekly release: 0.19.0rc0 #3588

Uh oh!

Uh oh!

kaiyux Apr 16, 2025 Maintainer

Replies: 0 comments

kaiyux
Apr 16, 2025
Maintainer