Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
5fa1914
[None][chore] Bump version to 1.1.0rc0 (#6651)
yiqingy0 Aug 7, 2025
85af621
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6…
amitz-nv Aug 7, 2025
6c1f7d8
[None][test] correct test-db context for perf yaml file (#6686)
ruodil Aug 7, 2025
8207d5f
[None] [feat] Add model gpt-oss (#6645)
hlu1 Aug 7, 2025
0a467b0
[https://nvbugs/5409414][fix] fix Not registered specs (#6660)
xinhe-nv Aug 7, 2025
8ec3b1d
[None][feat] : Add FP8 context MLA support for SM120 (#6059)
peaceh-nv Aug 7, 2025
c23e8e7
[TRTLLM-6092][doc] Add LoRA feature usage doc (#6603)
shaharmor98 Aug 7, 2025
1b9781e
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding …
syuoni Aug 7, 2025
453a06e
[TRTLLM-6881][feat] Include attention dp rank info with KV cache even…
pcastonguay Aug 7, 2025
3c44b44
[None][infra] Fix guardwords (#6711)
EmmaQiaoCh Aug 7, 2025
46357e7
[None][package] Pin cuda-python version to >=12,<13 (#6702)
yiqingy0 Aug 7, 2025
0223de0
[None][doc] Add deployment guide section for VDR task (#6669)
nv-guomingz Aug 7, 2025
4055b76
[None][fix] disagg ctx pp4 + gen pp4 integ test (#6489)
raayandhar Aug 7, 2025
e968f98
[None][feat] Clean up ngram auto mode, add max_concurrency to configs…
mikeiovine Aug 7, 2025
3b2dd40
[None][chore] Remove py_executor from disagg gh team (#6716)
pcastonguay Aug 7, 2025
4ecda91
[https://nvbugs/5423962][fix] Address broken links (#6531)
chenopis Aug 7, 2025
db8dc97
[None][fix] Migrate to new cuda binding package name (#6700)
tongyuantongyu Aug 7, 2025
980929e
[https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave (#…
symphonylyh Aug 7, 2025
8227616
[None][feat] Add NCCL Symmetric Integration for All Reduce (#4500)
Tabrizian Aug 8, 2025
efca359
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default …
dcampora Aug 8, 2025
88ced50
[TRTQA-2920][fix] Add failed cases into waives.txt (#6719)
xinhe-nv Aug 8, 2025
22f45a0
[TRTLLM-5252][test] add for mistral_small_3.1_24b perf test (#6685)
ruodil Aug 8, 2025
2f2f5cc
[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231)
StudyingShao Aug 8, 2025
1cf6694
[None][fix] Fix unnecessary GPU synchronization in torch sampler caus…
zhanghaotong Aug 8, 2025
aee828d
[TRTLLM-6854][feat] Enable guided decoding with disagg serving (#6704)
syuoni Aug 8, 2025
064eb7a
[TRTLLM-5252][fix] Propagate mapping to intermediate layers (#6611)
2ez4bz Aug 8, 2025
b15d6fb
[None][test] fix yml condition error under qa folder (#6734)
ruodil Aug 8, 2025
9687bb4
[None][doc] Add doc for multimodal feature support matrix (#6619)
chang-l Aug 8, 2025
d913955
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell (#6616)
limin2021 Aug 8, 2025
294e0d3
[https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of te…
leslie-fang25 Aug 8, 2025
9ff4e75
[None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_r…
yuxianq Aug 8, 2025
5f45227
[https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend (#…
JunyiXu-nv Aug 8, 2025
32ad7f3
[None][fix] Remove lock related typo in py_executor (#6653)
lancelly Aug 8, 2025
ebdc43e
[None][feat] move kv cache measure into transfer session (#6633)
zhengd-nv Aug 8, 2025
e251f7c
[None][fix]revert kvcache transfer (#6709)
chuangz0 Aug 8, 2025
b8f036f
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly hand…
stnie Aug 8, 2025
d45236b
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm (#6184)
Wanli-Jiang Aug 8, 2025
90145cf
[None][feat] Optimize CUDA graph memory usage for spec decode cases (…
mikeiovine Aug 8, 2025
efcb8f7
[TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify `examples` map…
venkywonka Aug 8, 2025
cc0f4c8
[None][doc] Move AutoDeploy README.md to torch docs (#6528)
Fridah-nv Aug 8, 2025
d066750
[None][fix] WAR GPT OSS on H20 with Triton MOE (#6721)
dongfengy Aug 8, 2025
9778788
[TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick chan…
yibinl-nvidia Aug 9, 2025
bcf5ec0
[None][feat] Core Metrics Implementation (#5785)
hcyezhang Aug 9, 2025
d643aef
[Perf] Improve Llama4 performance for small max_seqlen cases (#6306)
nv-yilinf Aug 9, 2025
de47282
[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628)
ziyixiong-nv Aug 9, 2025
ee19ca5
[None][infra] Waive test main 0808 (#6751)
EmmaQiaoCh Aug 10, 2025
3c5aec1
[#5048][enhance] AutoDeploy: Optimize prepare_inputs (#6634)
galagam Aug 10, 2025
199f306
[None][chore][kv cache manager] Dead code elimination, we no longer r…
eopXD Aug 10, 2025
14b36e0
[TRTLLM-6174][feat] Enable FP32 mamba ssm cache (#6574)
shaharmor98 Aug 10, 2025
4142320
[https://nvbugs/5444937][fix] Fixing kv_cache_event unit test (#6753)
pcastonguay Aug 10, 2025
b6baa9e
[TRTLLM-6823][doc] Add checkpoint refactor docs (#6592)
shaharmor98 Aug 10, 2025
60073a7
[None][feat] Support SharedTensor on MultimodalParams (#6254)
yechank-nvidia Aug 11, 2025
4b4b91a
[None][feat] improve dataloading for benchmark_dataset by using batch…
zerollzeng Aug 11, 2025
767879e
[https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite…
bo-nv Aug 11, 2025
2cf31b5
relax tensor device type check to fix wideEP loading and fix argument
dongxuy04 Aug 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
52 changes: 28 additions & 24 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,39 @@
# Without approval from a member of this team, PRs cannot be merged to release branches.
# * @NVIDIA/trt-llm-release-branch-approval

## TensorRT-LLM Infra
### CI
/jenkins @NVIDIA/trt-llm-ci-infra-devs @NVIDIA/trt-llm-infra-devs
### Setup
/docker @NVIDIA/trt-llm-setup-infra-devs @NVIDIA/trt-llm-infra-devs
### Github workflows
/.github @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs
/.coderabbit.yaml @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs

## TensorRT-LLM - Docs
/docs @NVIDIA/trt-llm-doc-owners

## Examples
/examples @NVIDIA/trt-llm-doc-owners

## TensorRT-LLM - Triton backend
/triton_backend @NVIDIA/trt-llm-triton-backend-devs

# TensorRT-LLM Pytorch backend
/tensorrt_llm/_torch @NVIDIA/trt-llm-torch-devs

## TensorRT-LLM Pytorch - Modules
/tensorrt_llm/_torch/modules @NVIDIA/trt-llm-torch-modules

## TensorRT-LLM Pytorch Models
/tensorrt_llm/_torch/models @NVIDIA/trt-llm-torch-models-devs
/examples/models @NVIDIA/trt-llm-torch-models-devs @NVIDIA/trt-llm-doc-owners

## TensorRT-LLM Pytorch backend - runtime
/tensorrt_llm/_torch/pyexecutor @NVIDIA/trt-llm-torch-runtime-devs
## TensorRT-LLM Pytorch backend - AutoDeploy flow
/tensorrt_llm/_torch/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
/tensorrt_llm/examples/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
/examples/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs @NVIDIA/trt-llm-doc-owners

## TensorRT-LLM Pytorch - Speculative Decoding
/tensorrt_llm/_torch/speculative @NVIDIA/trt-llm-torch-spec-decoding
Expand All @@ -31,12 +57,6 @@
/tensorrt_llm/_torch/attention_backend @NVIDIA/trt-llm-torch-attention-devs
/tensorrt_llm/_torch/modules/attention.py @NVIDIA/trt-llm-torch-attention-devs

## TensorRT-LLM Pytorch - Modules
/tensorrt_llm/_torch/modules @NVIDIA/trt-llm-torch-modules


## TensorRT-LLM Pytorch Models
/tensorrt_llm/_torch/models @NVIDIA/trt-llm-torch-models-devs

### TensorRT-LLM Pytorch - Models - Gemma
/tensorrt_llm/_torch/models/modeling_gemma3.py @NVIDIA/trt-llm-torch-models-gemma-devs @NVIDIA/trt-llm-torch-models-devs
Expand Down Expand Up @@ -108,8 +128,6 @@
/cpp/tensorrt_llm/runtime/loraUtils.cpp @NVIDIA/trt-llm-torch-peft
/cpp/tensorrt_llm/runtime/loraUtils.h @NVIDIA/trt-llm-torch-peft

## TensorRT-LLM - Triton backend
/triton_backend @NVIDIA/trt-llm-triton-backend-devs

## TensorRT-LLM trtllm-bench Reviewers
/tensorrt_llm/bench @NVIDIA/trtllm-bench-reviewers
Expand All @@ -121,10 +139,9 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
/tensorrt_llm/executor @NVIDIA/trt-llm-llmapi-devs

## TensorRT-LLM LLM Disaggregated
/examples/disaggregated @NVIDIA/trt-llm-disagg-devs
/examples/disaggregated @NVIDIA/trt-llm-disagg-devs @NVIDIA/trt-llm-doc-owners
/tensorrt_llm/disaggregated_params.py @NVIDIA/trt-llm-disagg-devs
/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py @NVIDIA/trt-llm-disagg-devs
/tensorrt_llm/_torch/pyexecutor/py_executor.py @NVIDIA/trt-llm-disagg-devs
/cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp @NVIDIA/trt-llm-disagg-devs
/cpp/tensorrt_llm/batch_manager/cacheFormatter.h @NVIDIA/trt-llm-disagg-devs
/cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp @NVIDIA/trt-llm-disagg-devs
Expand All @@ -135,19 +152,6 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
/cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp @NVIDIA/trt-llm-disagg-devs
/cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h @NVIDIA/trt-llm-disagg-devs

## TensorRT-LLM Infra

### CI
/jenkins @NVIDIA/trt-llm-ci-infra-devs @NVIDIA/trt-llm-infra-devs
### Setup
/docker @NVIDIA/trt-llm-setup-infra-devs @NVIDIA/trt-llm-infra-devs
### Github workflows
/tensorrt_llm/.github @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs
/tensorrt_llm/.coderabbit.yaml @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs

## TensorRT-LLM - Docs
/docs @NVIDIA/trt-llm-doc-owners
/examples @NVIDIA/trt-llm-doc-owners

# The rule below requires that any PR modifying public APIs must be approved by at least one member
# of the NVIDIA/trt-llm-committed-api-review-committee or NVIDIA/trt-llm-noncommitted-api-review-committee team.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ TensorRT-LLM
[![python](https://img.shields.io/badge/python-3.10-green)](https://www.python.org/downloads/release/python-31012/)
[![cuda](https://img.shields.io/badge/cuda-12.9.1-green)](https://developer.nvidia.com/cuda-downloads)
[![trt](https://img.shields.io/badge/TRT-10.11.0-green)](https://developer.nvidia.com/tensorrt)
[![version](https://img.shields.io/badge/release-1.0.0rc6-green)](./tensorrt_llm/version.py)
[![version](https://img.shields.io/badge/release-1.1.0rc0-green)](./tensorrt_llm/version.py)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)

[Architecture](./docs/source/torch/arch_overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Performance](./docs/source/performance/perf-overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Roadmap](https://github.com/NVIDIA/TensorRT-LLM/issues?q=is%3Aissue%20state%3Aopen%20label%3Aroadmap)
Expand Down
22 changes: 20 additions & 2 deletions cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

#include "tensorrt_llm/executor/executor.h"

#include <atomic>
#include <chrono>
#include <condition_variable>
#include <deque>
Expand All @@ -36,7 +37,8 @@ using BlockPtr = std::shared_ptr<KVCacheBlock>;
class KVCacheEventManager
{
public:
explicit KVCacheEventManager(size_t maxKVEventEntries);
explicit KVCacheEventManager(size_t maxKVEventEntries, std::optional<SizeType32> attentionDpRank = std::nullopt,
std::optional<SizeType32> attentionDpSize = std::nullopt, SizeType32 attentionDpEventsGatherPeriodMs = 5);

~KVCacheEventManager();
KVCacheEventManager(KVCacheEventManager& other) = delete;
Expand All @@ -61,14 +63,19 @@ class KVCacheEventManager
// Worker thread which adds events to mEvents.
void worker();

// Thread which exchanges events if attentionDP is enabled
void exchangeAttentionDpThread();

private:
// Add an event to mEventQueue
void enqueueEvent(executor::KVCacheEvent&& event);

/// @brief Flag to terminate the worker
bool mRun;
std::atomic<bool> mRun;
/// @brief Worker thread
std::thread mWorkerThread;
/// @brief Exchange thread for attention DP events
std::thread mExchangeAttentionDpThread;

/// @brief The deque of events
std::deque<executor::KVCacheEvent> mEvents;
Expand All @@ -91,6 +98,17 @@ class KVCacheEventManager
size_t mMaxSize;
/// @brief An auto-incrementing event id counter
size_t mEventId;

/// @brief Attention DP ranks and size
/// If set, we will exchange KV cache events and accumulate on rank 0
std::optional<SizeType32> mAttentionDpRank;
std::optional<SizeType32> mAttentionDpSize;

/// @brief The period in milliseconds to gather attention DP events across rank
SizeType32 mAttentionDpEventsGatherPeriodMs;

/// @brief MPI communicator for attention DP
std::unique_ptr<tensorrt_llm::mpi::MpiComm> mMpiComm;
};

} // namespace tensorrt_llm::batch_manager::kv_cache_manager
45 changes: 7 additions & 38 deletions cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Original file line number Diff line number Diff line change
Expand Up @@ -536,8 +536,7 @@ class WindowBlockManager
SizeType32 sizePerHead, SizeType32 tokensPerBlock, SizeType32 blocksInPrimaryPool,
SizeType32 blocksInSecondaryPool, SizeType32 maxNumSequences, std::shared_ptr<runtime::CudaStream> stream,
bool onboardBlocks, CacheType cacheType, std::optional<executor::RetentionPriority> secondaryOffloadMinPriority,
std::shared_ptr<KVCacheEventManager> eventManager, bool enableHashKey, bool enablePartialReuse,
bool copyOnPartialReuse);
std::shared_ptr<KVCacheEventManager> eventManager, bool enablePartialReuse, bool copyOnPartialReuse);

~WindowBlockManager();

Expand Down Expand Up @@ -633,11 +632,6 @@ class WindowBlockManager
return mAllBlocksById.at(blockId);
}

[[nodiscard]] BlockMapIterRange getBlocksByHash(size_t hash) const
{
return mContextBlocksByHash.equal_range(hash);
}

[[nodiscard]] SizeType32 getTokensPerBlock() const noexcept
{
return mTokensPerBlock;
Expand Down Expand Up @@ -723,10 +717,6 @@ class WindowBlockManager
//! \param blockIds Id of each block.
void storeBlocks(std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds);

void addBlockToHashMap(BlockPtr const& block);

void removeBlockFromHashMap(BlockPtr const& block);

[[nodiscard]] bool verifyQueueIntegrity();

// Only needed when sliding window attention + paged context fmha are used together.
Expand Down Expand Up @@ -808,8 +798,6 @@ class WindowBlockManager
SizeType32 mTokensPerBlock;
// List of all blocks by idx
std::vector<BlockPtr> mAllBlocksById;
// List of all context blocks by hash
BlockMap mContextBlocksByHash;
// Dummy block acting as root for BlockToken searches
BlockPtr mCachedBlocksRoot;
// KV cache type (self or cross)
Expand Down Expand Up @@ -841,8 +829,6 @@ class WindowBlockManager
double mReusedTokens;
// Total number of input tokens
double mTotalInputTokens;
// Whether or not to maintain a hashmap of blocks.
bool mEnableHashKey;
// Whether blocks that are partially matched should be reused.
bool mEnablePartialReuse;
// Whether partially matched blocks that are already in use should be copied and reused.
Expand All @@ -863,8 +849,8 @@ class BlockManager
std::optional<TempAttentionWindowInputs> const& tempAttentionWindowInputs, nvinfer1::DataType dtype,
SizeType32 sinkBubbleLength, bool onboardBlocks, CacheType cacheType = CacheType::kSELF,
std::optional<executor::RetentionPriority> secondaryOffloadMinPriority = std::nullopt,
std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enableHashKey = false,
bool enablePartialReuse = true, bool copyOnPartialReuse = true);
std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
bool copyOnPartialReuse = true);

BlockManager(BlockManager const&) = delete;
BlockManager& operator=(BlockManager const&) = delete;
Expand Down Expand Up @@ -1081,11 +1067,6 @@ class BlockManager
return mWindowBlockManagers.at(windowSize).getBlockById(blockId);
}

[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
{
return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);
}

[[nodiscard]] SizeType32 getNumPrimaryBlocks() const
{
return sumWindows([](auto const& manager) { return manager.getNumPrimaryBlocks(); });
Expand All @@ -1096,16 +1077,6 @@ class BlockManager
return getPool(poolIdx).containsBlockScales;
}

void addBlockToHashMap(BlockPtr const& block, SizeType32 windowSize)
{
mWindowBlockManagers.at(windowSize).addBlockToHashMap(block);
}

void removeBlockFromHashMap(BlockPtr const& block, SizeType32 windowSize)
{
mWindowBlockManagers.at(windowSize).removeBlockFromHashMap(block);
}

//! \brief Store context blocks
void storeContextBlocks(GenerationRequest& sequence, LlmRequest const& llmRequest);

Expand Down Expand Up @@ -1385,8 +1356,8 @@ class KVCacheManager : public BaseKVCacheManager
SizeType32 sinkTokenLength, CudaStreamPtr stream, std::optional<SizeType32> maxSequenceLength,
bool enableBlockReuse = false, bool onboardBlocks = true, CacheType cacheType = CacheType::kSELF,
std::optional<executor::RetentionPriority> secondaryOffloadMinPriority = std::nullopt,
std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enableHashKey = false,
bool enablePartialReuse = true, bool copyOnpartialReuse = true);
std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
bool copyOnpartialReuse = true);

KVCacheManager(std::vector<SizeType32> const& numKvHeadsPerLayer, SizeType32 sizePerHead, SizeType32 tokensPerBlock,
BlocksPerWindow const& blocksPerWindow, SizeType32 maxNumSequences, SizeType32 maxBeamWidth,
Expand All @@ -1405,8 +1376,8 @@ class KVCacheManager : public BaseKVCacheManager
SizeType32 sinkTokenLength, CudaStreamPtr stream, std::optional<SizeType32> maxSequenceLength,
bool enableBlockReuse = true, bool onboardBlocks = true, CacheType cacheType = CacheType::kSELF,
std::optional<executor::RetentionPriority> secondaryOffloadMinPriority = std::nullopt,
std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enableHashKey = false,
bool enablePartialReuse = true, bool copyOnpartialReuse = true);
std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
bool copyOnpartialReuse = true);

KVCacheManager(SizeType32 numLayers, SizeType32 numKvHeads, SizeType32 sizePerHead, SizeType32 tokensPerBlock,
BlocksPerWindow const& blocksPerWindow, SizeType32 maxNumSequences, SizeType32 maxBeamWidth,
Expand Down Expand Up @@ -1692,8 +1663,6 @@ class KVCacheManager : public BaseKVCacheManager
std::unordered_map<LlmRequest::RequestIdType, GenerationRequest> mSequences;
// Whether to cache KV pages for reuse
bool mEnableBlockReuse;
// Whether enable finding blocks by their hash, ignored when reuse enabled
bool mEnableHashKey;
// Mutex to protect access to mSequences
mutable std::mutex mSequencesMtx;
// buffers for static tensors, will be created after allocating pools
Expand Down
Loading