GPU 加速,高效大语言模型推理解决方案
为TensorRT-LLM打分
给出您宝贵的评分:
相关产品
使用 TensorRT-LLM,你可以:
TensorRT-LLM 是 NVIDIA 推出的高性能推理工具,提供易用的 Python API 用于定义大语言模型,集成前沿优化技术,可在 NVIDIA GPU 上实现高效推理;同时支持 Python 与 C++ 运行时组件,保障推理高性能调度与执行效率。
用户评论 (0)
2025年03月14日
2023年10月22日
2025年03月29日
2024年04月19日
2024年09月03日
2025年07月18日
2026年03月12日
2026年03月10日
2026年02月25日
2026年02月03日
2026年01月03日
2025年12月17日
v1.3.0rc13
2026年04月29日
Highlights
-
Model Support
- Support and initial optimizations for Nemotron 3 Nano Omni; known issues for audio-from-video and chunked prefill for video being actively worked on
- Add audio extraction from video, optimize ViT attention, and reduce initialization memory for Nemotron and Nemotron Nano VL models (#12921, #12911, #13283)
- Add per-model VisualGen example scripts, shared configs, per-model defaults, and metadata updates (#12992, #12862)
- Add GLM-4.7 and GLM-5 tool parser support (#13150)
- Optimize Nemotron-H execution from the Python layer and preserve Nemotron HF mamba cache dtype during bench tuning (#13032, #12826)
- Improve DeepSeek-V3.2 and DeepSeek-V3-Lite support with targeted perf and chunked-prefill fixes on Blackwell and SM100-class GPUs (#13142, #13257)
-
API
- Fix the chunked prefill API contract for Nemotron Nano VL (#13025)
- Add abort and resume support for Async RL in verl (#12272)
- Add a modular logger with automatic module detection and per-module filtering (#13202)
- Improve prompt handling by accounting for existing multimodal placeholder tokens in text prompts (#12827)
- Propagate real server-side failures to disaggregated serving clients and improve empty-file handling in trtllm-bench (#13119, #12552)
-
Feature
- Add VisualGen Cache-DiT and a unified cache accelerator (#12548)
- Expand kernel support with broader RMSNorm coverage, optimized causal-conv1d prefill and decode, FP4 residual quantization, and refreshed SageAttention kernels (#13033, #13103, #13117, #12937)
- Add batched addSequence with two-phase claim and unified VSWA and non-reuse support (#13029)
- Add sparse MQA and GQA attention support and introduce new sharding infrastructure (#12470, #12419)
- Improve serving performance with async media loading, faster video frame decoding, cached text computation reuse, lower custom-op overhead, padding-aware CUDA graph tuning, and reduced single-rank broadcast overhead (#13034, #12677, #13149, #12895, #13412, #13259, #11640)
- Optimize runtime internals with Minimax RMSNorm tuning, consolidated prefix-reuse analysis, gen-only sync transfer v2, DWDP contention config cleanup, and round-robin CP cache transmission (#12163, #13095, #12882, #12974, #13180)
- Restore EAGLE3 dynamic-tree speculative decoding support and centralize perfect-router integration and validation (#13081, #13250)
-
Fix
- Fix KV cache and scheduler correctness issues, including SWA compatibility, token accounting with context chunking, over-allocation in VSWA plus EAGLE flows, KVCacheManagerV2 bugs, and multimodal and disaggregated cache reuse problems (#12968, #12976, #12855, #12306, #13104, #12472)
- Fix runtime stability issues by preventing benchmark fill-loop hangs, tightening warmup reservation behavior, and making host-memory-based prefetch decisions consistent across ranks (#13065, #13078, #13161)
- Fix EAGLE3 LoRA speculative decoding and preserve speculative layer weights to avoid MTP plus PP hangs (#13005, #12555)
- Fix FMHA and attention runtime issues, including SM90 full-mask skip-softmax dispatch, misleading generation warnings, stale CUDA graphs on beam-width changes, and FlashInfer KV layout handling (#13120, #13157, #13255, #13190)
- Fix vision and multimodal correctness issues, including KV-cache quantization leaks into the vision encoder, FLUX high-resolution scheduler off-by-one behavior, and Super V3 multi-stream MoE instability (#13181, #13091, #13122)
- Fix packaging and environment issues by restoring the missing aarch64 library, enforcing NCCL >= 2.28 at configure time, and using weights_only=True in LoRA manager loads (#13206, #13108, #13391)
- Fix operational reliability issues in CI and perf pipelines, including OpenSearch upload failures, hanging AIPerf metrics, SLURM host name propagation, and SLURM submission retry behavior (#13215, #13314, #13367, #12778)
- Fix additional model and runtime issues for Qwen3 mrope cache handling, DSA illegal memory access with CUDA graph plus host KV offload, stale tokenizer alias imports, and WAN example timing conflicts (#13269, #13124, #13086, #13193, #12128)
-
Documentation
-
Test & Infra
- Add Dynamo API compatibility tests, VisualGen regression coverage, and refactor MoE communication tests (#12970, #13372, #12841)
- Expand CI coverage for disaggregated serving and weekly performance suites, including K2.5 EPLB coverage, refreshed Nemotron datasets, and additional weekly perf models (#13185, #12982, #13325)
- Improve CI signal quality by splitting multimodal DGX_B200 jobs, removing obsolete or low-priority cases, dropping non-key-model L0 coverage, and moving bf16 and auto precision variants to post-merge (#12978, #13262, #13374, #13315, #13366)
- Improve CI tooling with PR-aware failure analysis, SwiftStack upload support, wildcard bot stage commands, a sync_qa_tests Jenkins script, doc tests, and markdown-only doc-build rules (#12849, #13291, #12881, #13028, #13152, #13358, #13441)
- Refresh repository ownership and security plumbing with CODEOWNERS updates, HMAC key enforcement, and container vulnerability fixes (#13110, #13213, #9850, #13447)
What's Changed
- [https://nvbugs/5997092][fix] Remove waives for DS-V3.2/R1 FP4 Blackkwell perf tests by @peihu-nv in #13042
- [None][infra] Waive 2 failed cases for main in post-merge by @xinhe-nv in #13105
- [TRTLLM-9132][infra] Update to ignore failure for release check and building images by @EmmaQiaoCh in #9871
- [https://nvbugs/5626259][fix] Enable nemotron-h chunk prefill test by @Wanli-Jiang in #12980
- [None][feat] Add the invocation path for mamba2 mtp custom op by @JadoTu in #12787
- [None][infra] Waive 4 failed cases for main in post-merge 2654 by @ZhanruiSunCh in #13113
- [None][infra] Waive 3 failed cases for main in post-merge 2658 by @ZhanruiSunCh in #13141
- [None][chore] Add CODEOWNERS mappings for @NVIDIA/trt-llm-multimodal-devs by @venkywonka in #13110
- [None][chore] Add disaggregated tests that timeout to waives.txt by @2ez4bz in #13136
- [https://nvbugs/5844149][fix] Fix issues with DSV3.2 perf tests by @chenfeiz0326 in #13142
- [None][fix] Fix a capacity issue in KVCacheManagerV2 for SWA compatibility by @heyuhhh in #12968
- [https://nvbugs/6044213][chore] unwaive and reduce free mem ratio in AutoDeploy's perf test: deepseek_r1_distill_qwen_32b by @MrGeva in #12965
- [None][fix] Fix chunked prefill API contract for nemotron nano VL by @2ez4bz in #13025
- [TRTLLM-11794][feat] Optimize ViT Attention kernel on Nemotron by @yechank-nvidia in #12911
- [TRTLLMINF-38][feat] Pass PR number to CI failure analysis agent by @dpitman-nvda in #12849
- [https://nvbugs/6074784][chore] Temp waive dis-agg transformers failed tests by @Shixiaowei02 in #13145
- [None][fix] Fix scheduler off-by-one in FLUX pipelines at high resolutions by @karljang in #13091
- [None][infra] Add 5 users to blossom-ci allowlist by @yuanjingx87 in #13146
- [TRTLLM-11403][feat] VisualGen Cache-DiT + unified cache accelerator by @o-stoner in #12548
- [None][fix] Enable LoRA in EAGLE3 speculative decoding by @Funatiq in #13005
- [TRTLLM-11903][test] Add API compatibility tests for dynamo by @brb-nv in #12970
- [None][feat] Update rms_norm + fp4_qaunt kernel supporting more dim by @Wanli-Jiang in #13033
- [None][chore] Bump version to 1.3.0rc13 by @VALLIS-NERIA in #13159
- [None][fix] Fix compute token accounting for KV cache reuse with context chunking by @lancelly in #12976
- [None][feat] Batch addSequence with two-phase claim and unified VSWA/non-reuse support by @liji-nv in #13029
- [None][bug] fix SM90 full-mask skip-softmax dispatch by @bobboli in #13120
- [None][test] Refactor MoE comm tests: unified dispatch+combine pipeline by @xxi-nv in #12841
- [https://nvbugs/5983320][fix] Use encoder_max_batch_size of 1 for LLaVa in test_multi_request_batch_chat by @moraxu in #12647
- [TRTLLM-11771][feat] Add audio extraction from video for Nemotron Nano VL by @2ez4bz in #12921
- [None][fix] Update stale TOKENIZER_ALIASES import path in serve and bench modules by @cascade812 in #13086
- [TRTLLM-11695][feat] Add per-model VisualGen example scripts, shared configs, and per-model defaults by @zhenhuaw-me in #12992
- [https://nvbugs/6060119][chore] Unwaive DSR1 FP4 128k8k disagg perf tests by @peihu-nv in #13088
- [None][feat] Support sparse mqa/gqa attention by @heyuhhh in #12470
- [None][fix] Support custom_tokenizer in KvCacheAwareRouter for disagg serving by @lishicheng1996-nv in #12990
- [https://nvbugs/6013562][fix] fix kv cache allocation is double the budget for vswa + eagle by @dongfengy in #12855
- [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #13147
- [https://nvbugs/6013562][fix] Unwaive tests since the fix has been merged by @dongfengy in #13183
- [None][chore] Add Dynamo configs to TRTLLM CI - Disagg - Part 1 by @brb-nv in #13167
- [None][feat] Minimax RMS norm optimization by @jmydurant in #12163
- [TRTLLM-11878][feat] Gen-only sync transfer v2 and manager v2 by @Shixiaowei02 in #12882
- [None][test] Remove triton_server test_opt by @Tabrizian in #13173
- [None][infra] Waive 3 failed cases for main in post-merge by @xinhe-nv in #13194
- [None][feat] Optimize nemotron-h from python level by @Wanli-Jiang in #13032
- [https://nvbugs/6026676][fix] Only waive the tests for H20 so that H100 still covered by @dongfengy in #12961
- [TRTLLM-11272][fix] Account for the existing multimodal placeholder tokens in a text prompt by @moraxu in #12827
- [None][infra] Waive 20 failed cases for main in post-merge by @xinhe-nv in #13203
- [None][fix] Fix GPQA Diamond filter_type mismatch in disagg accuracy … by @yingguo-trt in #13210
- [None][infra] Waive 9 failed cases for main in post-merge by @xinhe-nv in #13204
- [None][infra] Waive 6 failed cases for main in post-merge by @xinhe-nv in #13195
- [None][test] Add doc test by @StanleySun639 in #13152
- [https://nvbugs/6071070][fix] Add K2.5 DISAGG Gen Only EPLB Cases into CI by @chenfeiz0326 in #13185
- [None][infra] Waive 2 failed cases for main in post-merge 2663 by @ZhanruiSunCh in #13216
- [TRTLLM-12291][feat] New sharding infrastructure by @greg-kwasniewski1 in #12419
- [None] [chore] Update .github/CODEOWNERS by @kaiyux in #13213
- [None][test] Fix DGX_B200 CI timeout by splitting multimodal tests an… by @nv-guomingz in #12978
- [None][infra] Waive 2 failed cases for main in pre-merge 34569 by @ZhanruiSunCh in #13192
- [None][fix] Test time conflict in WAN T2V example by @2ez4bz in #13193
- [None][infra] Reenable GB300-4_GPUs-PyTorch-Post-Merge-1 by @mlefeb01 in #13097
- [TRTLLM-11872][perf] Multi-threading async media loading and optimizing video frame decoding in trtllm-serve by @yechank-nvidia in #13034
- [None][fix] Do not leak KV cache quantization into vision encoder by @2ez4bz in #13181
- [https://nvbugs/5783876][chore] Enforce HMAC key requirement in the codebase by @yibinl-nvidia in #9850
- [https://nvbugs/5981122][fix] Unwaive DeepSeekV3Lite python_scheduler test by @lancelly in #12972
- [None][infra] Waive 3 failed cases for main in post-merge by @xinhe-nv in #13200
- [None][infra] Waive 1 failed cases for main in pre-merge 34820 by @ZhanruiSunCh in #13252
- [None][infra] Waive 8 failed cases for main in post-merge by @xinhe-nv in #13201
- [None][test] waive hang issues by @xinhe-nv in #13212
- [https://nvbugs/6086538][fix] suppress misleading skip-softmax FMHA warning in generation by @bobboli in #13157
- [None][fix] Add missing aarch64 lib in cf9963f by @pengbowang-nv in #13206
- [None][pref] Consolidate prefix reuse queries into single analyzePrefixReuse radix tree walk by @SimengLiu-nv in #13095
- [None][test] Add sync_qa_tests Jenkins script and update coderabbit review by @xinhe-nv in #13028
- [None][feat] Refactor the routing part in trtllmgen by @ChristinaZ in #12246
- [None][infra] Waive 1 failed cases for main in post-merge 2671 by @ZhanruiSunCh in #13261
- [None][feat] Switch CP cache transmission from contiguous to round-robin by @brb-nv in #13180
- [None][feat] AutoDeploy: Onboard MiniMaxAI/MiniMax-M2.7 custom model by @suyoggupta in #12963
- [https://nvbugs/6088149][chore] Unwaive perf sanity tests for bug 6088149 by @chenfeiz0326 in #13176
- [https://nvbugs/5955765][fix] More accurate launch parameters to avoid over-reservation in warmup by @YihuiLu512 in #13078
- [https://nvbugs/5819019][fix] Remove waivers by @YihuiLu512 in #13118
- [https://nvbugs/6018043][fix] Unwaive testcase by @YihuiLu512 in #13111
- [None][infra] Waive 1 failed cases for main in pre-merge 34865 by @ZhanruiSunCh in #13258
- [TRTLLM-11999][feat] Add GLM-4.7/GLM-5 tool parser by @JunyiXu-nv in #13150
- [None][fix] Unwaive DeepSeekV3Lite test_bfloat16_4gpus_python_scheduler ep4 by @lancelly in #13084
- [None][test] amend for qa weekly core test list by @ruodil in #13153
- [None][test] Update Nemotron-3-Super-120B-A12B-NVFP4 MTP perf case with the real dataset on DGX-Spark by @JennyLiu-nv in #12982
- [None][fix] Cap TLLM_BENCHMARK_REQ_QUEUES_SIZE to avoid fill-loop hang by @reasonsolo in #13065
- [https://nvbugs/6074014][fix] Min-reduce available host memory to ensure that all ranks agree about whether prefetch is enabled by @dhansen-nvidia in #13161
- [TRTLLM-11339][fix] Wan tests refactor + small transformer fix by @o-stoner in #12128
- [None][fix] Fix kv_layout for FLASHINFER backend by @yechank-nvidia in #13190
- [None][chore] Update CI allowlist 2026-04-21 by @tburt-nv in #13289
- [TRTLLM-10703][feat] abort, resume for Async RL in verl by @hchings in #12272
- [TRTLLM-12127][fix] VisualGen metadata updates by @o-stoner in #12862
- [None][fix] Revert "Refactor the routing part in trtllmgen" (#12246) by @peihu-nv in #13294
- [TRTLLM-11759][fix] Reduce peak host memory during NemotronH_Nano_VL_V2 init by @pamelap-nvidia in #13283
- [None][fix] Fix post-merge perf data silently failing to upload to OpenSearch DB by @chenfeiz0326 in #13215
- [None][fix] Fix errors in KV cache manager V2 and scheduler V2 by @jiaganc in #13104
- [None][fix] Enforce NCCL >= 2.28 at CMake configure time by @eopXD in #13108
- [TRTLLM-11861][infra] Support wildcard in bot stage-list/extra-stage commands by @mzweilz in #12881
- [TRTLLM-12062][test] remove obsolete model tests by @xinhe-nv in #13262
- [TRTLLM-12137][chore] Drop non-key-model (starcoder2/mllama/nemotron) cases from L0 by @QiJune in #13315
- [None][feat] Optimize causal_conv1d prefill and decode kernels by @Wanli-Jiang in #13103
- [TRTLLM-11733][perf] Cache constant text computations across denoise steps in LTX2 by @luyiyun1021 in #12677
- [None][chore] Add Dynamo configs to TRTLLM CI - Disagg - Part 2 by @brb-nv in #13168
- [https://nvbugs/6050481][chore] Unwaive passing GPT-OSS ep tests by @dongfengy in #13284
- [None][chore] Waive DSV32 tests by @brb-nv in #13352
- [https://nvbugs/6052050][fix] Drop stale CUDA graphs on beam-width change by @brb-nv in #13255
- [https://nvbugs/6055847][fix] Preserve Nemotron HF mamba cache dtype in bench tuning by @hyukn in #12826
- [None][chore] Better Empty File Error Handling for trtllm-bench by @yijingl-nvidia in #12552
- [None][test] add models for weekly perf test by @ruodil in #13325
- [None][fix] Propagate init_load_balancer to DeepGemmFusedMoE in create_moe_backend by @qiaoxj07 in #13207
- [TRTLLM-12183][chore] Move bf16/auto precision variants from pre-merge to post-merge by @QiJune in #13366
- [https://nvbugs/6076560][chore] Unwaive test_nvfp4_4gpus by @hyukn in #13079
- [None][infra] Waive 2 failed cases for main in post-merge by @xinhe-nv in #13335
- [TRTLLM-11485][feat] Feature rework: Add SageAttention refreshed kernels (attentionOp only) by @xrq-phys in #12937
- [TRTLLM-11540][feat] Revert revert of EAGLE3 dynamic tree speculative decoding support by @sunnyqgg in #13081
- [None][chore] Add related trtllm-gen attention kernel files to trigger multi-gpu tests by @heyuhhh in #13260
- [https://nvbugs/6074943][fix] Disable new aiperf server metrics to stop hang. by @dominicshanshan in #13314
- [None][doc] Restructure installation documentation by @bobboli in #12402
- [None][fix] Disable multi stream moe for super v3 by @tcherckez-nvidia in #13122
- [https://nvbugs/5919796][fix] AutoDeploy: Fix TP deadlock in multistream MoE by @galagam in #13220
- [None][fix] initialize sampler state for ADP dummy requests by @bobboli in #13275
- [#13125][feat] Make auto_deploy standalone-ready and add package generator by @lucaslie in #13155
- [None][perf] Clear multimodal data upon prefill completion by @2ez4bz in #13259
- [TRTLLMINF-45][infra] Upload CI agent failure analysis to SwiftStack by @dpitman-nvda in #13291
- [https://nvbugs/6078421][fix] Commit dcb4a71 intentionally disabled
initialize_mrope_delta_cachein `qwen3 by @tensorrt-cicd in #13269 - [None][feat] Move DWDP contention optimization into DwdpConfig by @JintaoPengCS in #12974
- [None][doc] update verbose comments by @VALLIS-NERIA in #13387
- [None][chore] Remove closed bugs by @xinhe-nv in #13189
- [TRTLLM-9120][feat] centralize perfect router integration and validation by @xxi-nv in #13250
- [https://nvbugs/5916092][fix] Fix MTP+PP hang by preserving speculative layer weights on last PP rank by @xxi-nv in #12555
- [None][feat] Add modular logger with auto module detection and per-module filtering by @reasonsolo in #13202
- [#4674][feat] enabled AutoDeploy qkv and rope fusion with trtllm attention by @MrGeva in #12357
- [None][fix] Fix multimodal KV cache block reuse for disaggregated serving by @indrajit96 in #12472
- [None][fix] KVCacheManagerV2 bug fixes (V2 remains default OFF) by @yizhang-nv in #12306
- [None][test] Remove low priority QA perf test cases by @yufeiwu-nv in #13374
- [None][perf] Use +64 batch sizes for padding-enabled CUDA graphs by @yijingl-nvidia in #12895
- [https://nvbugs/5784566][fix] Isolate ray tests to avoid timeout by @shuyixiong in #13062
- [None][test] AutoDeploy: Add missing guided decoding test to CI by @govind-ramnarayan in #13350
- [https://nvbugs/5997534][fix] Fix eagle3 accuracy test - attn backend must be flashinfer by @govind-ramnarayan in #13398
- [None][infra] Waive 15 failed cases for main in post-merge by @xinhe-nv in #13363
- [TRTLLM-11958][perf] reduce @torch.library.custom_op host overhead by @luyiyun1021 in #13149
- [https://nvbugs/6084445][fix] use DEEPGEMM for DeepSeek-V3-Lite fp8 chunked prefill on SM100/SM103 by @jmydurant in #13257
- [https://nvbugs/6094118][fix] remove redundant tests by @bo-nv in #13411
- [TRTLLM-11123][fix] Propagate real errors to disagg server by @reasonsolo in #13119
- [None][infra] Waive 4 failed cases for main in post-merge 2681 by @ZhanruiSunCh in #13414
- [https://nvbugs/6098095][chore] Waive failed tests for flashinfer-python==0.6.8 upgrading by @yihwang-nv in #13254
- [https://nvbugs/6064029][test] Visual gen b64 path regression tests by @yingguo-trt in #13372
- [None][fix] Populate s_host_node_name for SLURM-based perf test runs by @hyukn in #13367
- [None][feat] Add FP4 residual quantization kernel without channel reo… by @Tracin in #13117
- [https://nvbugs/6093712][fix] skip_pre_hopper for Qwen3 disagg L40S failure by @reasonsolo in #13326
- [https://nvbugs/6018172][fix] Fix DSA illegal memory access with CUDA graph and host KV cache offload by @liji-nv in #13124
- [None][infra] Waive 5 failed cases for main in pre-merge 35493 by @ZhanruiSunCh in #13432
- [TRTLLMINF-43][feat] Update SLURM job submission logic to retry up to… by @dpitman-nvda in #12778
- [https://nvbugs/6098442][fix] WAR IMA on DS V3.2 and update trtllm-gen cubin, lib and src by @pengbowang-nv in #13379
- [https://nvbugs/6094112][fix] Use TRTLLM MoE backend on Blackwell for case TestQwen3_30B_A3B::test_dummy_load_format by @xxi-nv in #13327
- [None][chore] Remove non-exist waiver by @VALLIS-NERIA in #13437
- [None][infra] Update the doc build stage rule by treating *.md-only P… by @nv-guomingz in #13358
- [None][perf] Skip request broadcast when world_size is 1 by @yechank-nvidia in #13412
- [https://nvbugs/6071380][fix] Update the invalid dynamo urls in doc. by @nv-guomingz in #13038
- [https://nvbugs/6025330][fix] Use weights_only=True in LoRA manager torch.load by @yibinl-nvidia in #13391
- [None][perf] Remove unnecessary ToPIL() from find_mm_token_lengths by @yechank-nvidia in #11640
- [TRTLLMINF-45][infra] Pin pbss.s8k.io in /etc/hosts before SwiftStack… by @dpitman-nvda in #13441
- [https://nvbugs/6007285][fix] Unwaive test_configurable_moe_multi_gpu DEEPEP-NVFP4 case by @xxi-nv in #13371
- [None][fix] Split TRT-LLM-only rope fusion out of standalone auto_deploy by @lucaslie in #13454
- [None][infra] Container vulnerability fix by @yuanjingx87 in #13447
- [https://nvbugs/6097980][fix] unwaive Wan T2V example by @zhenhuaw-me in #13316
- [None][infra] Waive 4 failed cases for main in pre-merge 35639 by @ZhanruiSunCh in #13461
- [https://nvbugs/5973199][fix] unwaive TestNemotronSuperV3::test_accuracy[nvfp4-4-attn_dp_on-trtllm] by @tcherckez-nvidia in #13188
New Contributors
- @lishicheng1996-nv made their first contribution in #12990
- @YihuiLu512 made their first contribution in #13078
- @tensorrt-cicd made their first contribution in #13269
Full Changelog: v1.3.0rc12...v1.3.0rc13
详细ChangeLogv1.3.0rc12
2026年04月17日
Highlights
-
Model Support
- Add LTX-2 two-stage pipeline support (#12361)
- Add CUDA graph support for LTX-2 with
torch.compilecompatibility (#12653) - Add video temporal compression for Nemotron Nano and RADIO (#12649)
- Extend the Python cache transceiver to support Qwen-Next (#12772)
- Add CuteDSL MoE backend support for Qwen3.5 (#12799)
- Fix LoRA support for Qwen3 models (#12785)
- Support loading FP8 LoRA weight files (#12848)
- Add support for speculative decoding with LoRA (#12661)
- Fix OOM with large numbers of LoRA adapters (#12815)
- Partially fix LoRA overallocation for Nemotron NAS (#12817)
- Skip
inference_mode()whentorch.compile=Truefor Gemma3 FP8 (#12367) - Skip NVFP4 fused norm when the dimension does not meet requirements (#12901)
- Update MoE
hidden_sizein the communicator for Nemotron-H (#12890) - Unify image-as-tensor handling to avoid repeated conversions for nano models (#12994)
-
API
- Refine the VisualGen API structure (#12807)
- Convert
VisualGenParamsto Pydantic with request validation, per-model defaults, andextra_paramssupport (#12922) - Align
AttentionPluginwith the EdgeLLM interface (#12233) - Add shorthand
KVConnectorpaths forlmcacheandkvbm(#12626) - Add the missing
allow_partial_loadingparameter to CuteDSL and ConfigurableMoEload_weights(#12761) - Improve KV cache statistics monitoring (#12413)
-
Feature
- Add NvTelemetry/GXT-compliant usage telemetry (#12384)
- Add production-level Prometheus metrics for iteration stats, config info, token counters, and phase histograms (#12545)
- Add conversation-affinity routing for disaggregated serving (#12526)
- Enable block reuse with the overlap scheduler (#12816)
- Unify VisualGen parallelism (#12509)
- Consolidate piecewise CUDA graph VLM updates (#12852)
- Add tunable NVFP4 quantization with an additional FlashInfer backend (#12126)
- Optimize GDN prefill with indexed in-kernel state updates (#12791)
-
Fix
- Propagate
disaggregated_paramsthroughPostprocWorker(#12513) - Prebuild disaggregated context responses to avoid
ctx_request_idraces (#12466) - Generate HMAC keys for MGMN IPC servers in disaggregated serving (#12670)
- Enable HMAC authentication in VisualGen ZMQ IPC channels (#12680)
- Fix disaggregated gen-only hangs caused by blocking KV transfers (#12640)
- Replace busy-poll sleep in
get_async_noblockwith the ZMQ async poller (#12189) - Make
trust_remote_codeopt-in inMultimodalModelRunner(#12669) - Fix VLM guided decoding startup crashes caused by missing
vocab_size_padded(#12284) - Eliminate double PNG encoding in visual generation serving (#12903)
- Treat whitespace-only content correctly in nano-v3 reasoning swap (#12912)
- Clamp
usedNumBlocksto non-negative values in KV cache statistics (#11922) - Fix
moe_chunking_tokenshandling during MoE A2A (#12929) - Guard CUDA event
elapsed_timeinperf_metrics_managerto prevent executor crashes (#12868) - Remove leftover
onboardBlocksparameters inkvCacheManagerTest(#13107) - Add CUDA device setup before
load_remote_agent(#12619) - Fix Mooncake transfer agent binding (#12723)
- Fix
multi_stream_moeaccuracy with MLIR and piecewise CUDA graphs (#12847) - Fix Nano chunked prefill (#12782)
- Fix constrained decoding for GLM5 (#12869)
- Fix benchmark disaggregated deadlocks by removing a blocking fill loop (#12208)
- Update CUTLASS C++ to 4.4.2 (#12897)
- Pin Ray to 2.54.1 (#13071)
- Propagate
-
Documentation
-
Benchmark
- Optimize the Qwen3.5 decode delta kernel (#12740)
- Reduce host overhead in DSA MLA attention (#12631)
- Add a host performance regression test suite for PyExecutor (#12148)
- Add benchmark coverage for allreduce backends (#12887)
- Restore DSR1/DSV32/K2 disaggregated performance tests (#12688)
- Support NV SA benchmarks in CI performance testing (#13004)
- Add K2.5 performance tests into CI (#12931)
-
Test & Infra
- Update Perf Sanity System code paths (#12430)
- Bump etcd to 3.6.9 to pick up the gRPC fix (#12594)
- Fix the PLC nightly pipeline and expose more pipeline data (#12940)
- Exclude QA nodes when running TRTLLM CI (#13102)
- Add a unit test for lifecycle race condition errors in disaggregated serving (#12803)
- Add an end-to-end test for PP + disagg + block reuse + chunked prefill hangs (#12913)
- Add Nemotron-3-Super-120B-A12B-NVFP4 functional and performance cases on DGX Spark (#12830)
- Remove obsolete RTX-6000 OOM tests (#12800)
- Remove unused tests (#12625)
- Check unused fixtures (#12730)
- Fix Qwen3 skip-softmax attention CI tests (#12789)
- Fix failing KV cache transceiver tests from the perf sanity changes (#12554)
- Fix Wan unit tests (#13026)
- Remove obsolete waivers (#12979)
- Move the
PY312-UB2404sanity check test to A100X nodes (#13077) - Pin Ray to 2.54.1 in the Slurm CI stage (#13085)
What's Changed
- [None][test] Unwaive Nemotron H flaky case by @nv-guomingz in #11236
- [https://nvbugs/5997543][fix] unwaive test_disaggregated_overlap_transceiver_runtime_python by @chuangz0 in #12580
- [TRTLLM-11574][feat] Some updates on Perf Sanity System codes by @chenfeiz0326 in #12430
- [None][doc] add attention developer guide by @QiJune in #12693
- [https://nvbugs/5991957][fix] Propagate disaggregated_params through PostprocWorker by @peihu-nv in #12513
- [https://nvbugs/5883590][fix] Generate HMAC key for MGMN IPC server in disaggregated serving by @yibinl-nvidia in #12670
- [https://nvbugs/5941242][fix] Fix SigLIP test failure by @tijyojwad in #12717
- [None][feat] Optimize qwen3.5 decode delta kernel by @nv-guomingz in #12740
- [https://nvbugs/5961736][fix] Prebuild disagg ctx response to avoid ctx_request_id race by @peihu-nv in #12466
- [https://nvbugs/5922880][fix] Enable HMAC authentication in VisualGen ZMQ IPC channels by @yibinl-nvidia in #12680
- [None][fix] Add missing allow_partial_loading param to CuteDSL and ConfigurableMoE load_weights by @qiaoxj07 in #12761
- [None][chore] Waive hanging Nemotron Super test by @brb-nv in #12821
- [None][fix] add cuda set device before load_remote_agent by @chuangz0 in #12619
- [None][chore] Remove closed bugs by @xinhe-nv in #12766
- [None][test] Remove RTX-6000 OOM test cases by @yufeiwu-nv in #12800
- [None][fix] Fix LoRA support for Qwen3 models by @achartier in #12785
- [TRTLLM-11343][feat] LTX-2 Two Stage pipeline support by @yibinl-nvidia in #12361
- [#12808][feat] AutoDeploy: Add Gemma4 Support by @bmarimuthu-nv in #12710
- [None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation by @kaiyux in #12831
- [#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats by @wojciech-wais in #11922
- [https://nvbugs/6029864][fix] Fix flaky ray test failure by @brb-nv in #12697
- [https://nvbugs/5813192][fix] Make trust_remote_code opt-in in MultimodalModelRunner by @yibinl-nvidia in #12669
- [None][infra] Bump etcd to 3.6.9 to involve grpc fix by @yuanjingx87 in #12594
- [https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters by @brb-nv in #12815
- [None][feat] AutoDeploy: Add the Triton kernel for MLA by @nvchenghaoz in #12664
- [None][fix] replace busy-poll sleep in get_async_noblock with zmq async poller by @edenfunf in #12189
- [https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever by @yingguo-trt in #12803
- [None][infra] Add DSR1 DSV32 K2 Disagg Perf Tests Back by @chenfeiz0326 in #12688
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12765
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12814
- [None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property by @stefanpantic in #12284
- [None][fix] Fix Nano chunked prefill by @2ez4bz in #12782
- [https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i… by @liji-nv in #12659
- [None][test] remove unused tests by @xinhe-nv in #12625
- [https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory by @peihu-nv in #12640
- [#12593][feat] AutoDeploy: onboard DeepSeek-R1 by @galagam in #12601
- [#11548][feat] AutoDeploy: Optimize Qwen3.5 perf by @taylor-yb-lee in #12265
- [None][chore] Set the use_one_model flag to True by default on llm ap… by @nv-guomingz in #12836
- [https://nvbugs/5921674][fix] unwaive TestNemotronNanoV3 fp8 tests by @tcherckez-nvidia in #12792
- [None][feat] Add NvTelemetry/GXT-compliant usage telemetry by @venkywonka in #12384
- [https://nvbugs/5996776][fix] Fix test OOM by @dongfengy in #12856
- [None][feat] Support loading FP8 LoRA weight files by @achartier in #12848
- [None][test] check unused fixtures by @xinhe-nv in #12730
- [TRTLLM-11804][feat] Mechanical refactoring VisualGen API by @zhenhuaw-me in #12807
- [TRTLLM-11324][perf] Add host performance regression test suite for PyExecutor by @hyukn in #12148
- [None][chore] unwaive some dis-agg tests by @Shixiaowei02 in #12828
- [TRTLLM-11707][feat] Add CUDA graph support (torch compile compatible) for LTX-2 by @luyiyun1021 in #12653
- [https://nvbugs/6055474][test] Fix RTX-6000 with wrong moe backend by @yufeiwu-nv in #12886
- [None][chore] Waive failing pre-merge test by @brb-nv in #12916
- [None][docs] Add README for custom Claude Code skills and agents by @kaiyux in #12920
- [TRTLLM-11421][feat] Support better kv cache statistics monitoring by @eopXD in #12413
- [https://nvbugs/5448464][fix] Partially fix LoRA overallocation for Nemotron NAS by @brb-nv in #12817
- [https://nvbugs/5996776][fix] Unwaive tests after fix by @dongfengy in #12906
- [https://nvbugs/5940463][fix] remove test_cli_flow.py::TestSantacoder case by @QiJune in #12845
- [None][test] Add Nemotron-3-Super-120B-A12B-NVFP4 func and perf cases on DGX-spark by @JennyLiu-nv in #12830
- [None][infra] Fix plc nightly pipeline and show more data by @yuanjingx87 in #12940
- [TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO by @2ez4bz in #12649
- [https://nvbugs/5910749][https://nvbugs/5995486][test] Fix Qwen3 skip softmax attention CI tests by @bobboli in #12789
- [https://nvbugs/6043312][fix] fix_mooncake_transfer_agent_binding by @chuangz0 in #12723
- [#12699][feat] consolidate piecewise CUDA graph VLM updates by @nvchenghaoz in #12852
- [TRTLLM-11770][feat] Skip nvfp4 fused norm if the dim doesn't meet the requirement by @pamelap-nvidia in #12901
- [None][fix] skip inference_mode() when torch.compile=True for gemma3 fp8 by @amukkara in #12367
- [None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4 by @suyoggupta in #12866
- [#12634][feat] AutoDeploy: Support rank 256 MLA in flashinfer_mla by @bmarimuthu-nv in #12519
- [https://nvbugs/5997534][fix] AutoDeploy: Skip Eagle3 One Model Test on pre-Hopper by @govind-ramnarayan in #12757
- [None][fix] Fix multi_stream_moe accuracy with MLIR and piecewise cudagraphs by @suyoggupta in #12847
- [https://nvbugs/5961739][fix] Unwaiving failing tests by @greg-kwasniewski1 in #12936
- [#12954][fix] AutoDeploy: Fix Gemma4 MoE config (disable multi_stream_moe, lower free_gpu_memory_fraction) by @suyoggupta in #12955
- [TRTLLM-11540][feat] Add EAGLE3 dynamic tree speculative decoding support by @sunnyqgg in #12062
- [https://nvbugs/6064029][fix] Eliminate double PNG encoding in visual gen serving by @karljang in #12903
- [TRTLLM-11532][refactor] Unify VisualGen parallelism by @NVShreyas in #12509
- [None][fix] Fix 'max_batch_size' conflict in AD dashboard script by @tcherckez-nvidia in #12967
- [TRTLLM-11797][feat] Add cutedsl moe backend supporting for qwen3.5. by @nv-guomingz in #12799
- [TRTLLM-11315][feat] Extend python cache transceiver to support Qwen-Next by @bo-nv in #12772
- [https://nvbugs/5991576][test] Add E2E test for PP+disagg+block_reuse+chunked_prefill hang by @yingguo-trt in #12913
- [None][feat] Align AttentionPlugin with EdgeLLM interface by @nvyocox in #12233
- [None][infra] Waive 1 failed cases for main in post-merge 2648 by @ZhanruiSunCh in #12975
- [https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent… by @liji-nv in #12631
- [None][infra] Waive 8 failed cases for main in post-merge 2646 by @ZhanruiSunCh in #12934
- [None][fix] Unwaive phi4 accuracy tests by @Wanli-Jiang in #12832
- [None][feat] Add benchmark for all allreduce backend by @yilin-void in #12887
- [TRTLLM-11893][feat] Convert VisualGenParams to Pydantic with extra_params, per-model defaults, and request validation by @zhenhuaw-me in #12922
- [None][infra] Waive 4 failed cases for main in pre-merge 33523 by @ZhanruiSunCh in #12977
- [None][infra] Waive 4 failed cases for main in post-merge by @xinhe-nv in #12973
- [TRTLLM-11657][feat] Conversation affinity disagg router by @reasonsolo in #12526
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12953
- [None][feat] optimize GDN prefill with indexed in-kernel state updates by @nv-guomingz in #12791
- [https://nvbugs/6061812][fix] Unblock ruff check by @VALLIS-NERIA in #12996
- [None][fix] Update moe hidden_size in communicator for nemotron-h by @Wanli-Jiang in #12890
- [TRTLLM-11492][fix] Fix benchmark disagg deadlock by eliminating blocking fill loop by @chienchunhung in #12208
- [TRTLLM-10938][feat] Enable block reuse with overlap scheduler by @chienchunhung in #12816
- [#12617][feat] Add support for speculative decoding with LoRA by @Funatiq in #12661
- [None][fix] Fix contrained decoding for GLM5 by @cascade812 in #12869
- [None][test] Waive two dsv3lite cases due to nvbug 6071081. by @nv-guomingz in #13001
- [None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms) by @nvyutwu in #12545
- [None][infra] Remove invalid test case in waive list by @yuanjingx87 in #13008
- [None][chore] Fix failing KV Cache Transceiver Tests from #11574 by @ekou24 in #12554
- [https://nvbugs/6060281][fix] Treat whitespace-only content in nano-v3 reasoning swap by @tijyojwad in #12912
- [None][feat] KVConnector shorthand paths for "lmcache" and "kvbm" with examples by @sammshen in #12626
- [#12712][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (infra only) by @govind-ramnarayan in #12708
- [https://nvbugs/6059036][fix] AutoDeploy fix registry accuracy tests by @nvchenghaoz in #12942
- [https://nvbugs/5963665][refactor] Refactor warmup orchestration in ModelEngine by @liji-nv in #12407
- [https://nvbugs/5973214][fix] unwaive qwen3 ci test by @byshiue in #12237
- [https://nvbugs/5781383][chore] Unwaive test by @shuyixiong in #12282
- [None][chore] AutoDeploy: Added Qwen3.5 accuracy test for NVFP4 by @taylor-yb-lee in #13014
- [None][chore] Unify code path for reuse/non-reuse when adding sequence in kv cache manager by @eopXD in #10437
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #13016
- [None][chore] Waive failed tests by @yiqingy0 in #13035
- [TRTLLM-11091][feat] Add tunable nvfp4 quantize with additional FlashInfer backend by @chang-l in #12126
- [TRTLLM-11540][feat] Revert EAGLE3 dynamic tree speculative decoding support (#12062) by @brb-nv in #13006
- [None][fix] fix Wan unit tests by @zhenhuaw-me in #13026
- [None][fix] Update CUTLASS C++ to 4.4.2 by @depaulmillz in #12897
- [None][chore] Waive failing tests 04/14 by @brb-nv in #13049
- [None][chore] Unwaive broader test lists by @brb-nv in #13053
- [None][chore] Update waived test name by @brb-nv in #13058
- [None][infra] Waive 4 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13067
- [None][fix] Fix moe_chunking_tokens during MoE A2A by @Wanli-Jiang in #12929
- [None][infra] Support nv sa benchmark in CI Perf Test by @chenfeiz0326 in #13004
- [None][fix] Pin Ray version to 2.54.1 by @shuyixiong in #13071
- [None][infra] Add K2.5 Perf Tests into CI by @chenfeiz0326 in #12931
- [https://nvbugs/5846024][fix] Remove waivers by @VALLIS-NERIA in #12979
- [https://nvbugs/5838178][fix] Fix failing lora test for Llama by @brb-nv in #12950
- [None][fix] Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crash by @yifjiang in #12868
- [None][fix] Pin Ray version to 2.54.1 in slurm CI stage by @shuyixiong in #13085
- [TRTLLM-11266][feat] Unify image as tensor to avoid multiple converting for nano model by @Wanli-Jiang in #12994
- [None][fix] Update CODING_GUIDELINES.md to say Python >= 3.10 by @hnover-nv in #13094
- [None][chore] Remove onboard block switch for KV cache manager by @eopXD in #12449
- [None][infra] Waive 3 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13070
- [None][infra] Exclude QA nodes when running TRTLLM CI by @yuanjingx87 in #13102
- [None][fix] Remove leftover onboardBlocks param in kvCacheManagerTest by @eopXD in #13107
- [None][infra] Waive 1 failed cases for main in post-merge 2653 by @ZhanruiSunCh in #13109
- [TRTLLM-11990][infra] Move PY312-UB2404 sanityCheck test to A100X node by @yiqingy0 in #13077
- [None][chore] Bump version to 1.3.0rc12 by @VALLIS-NERIA in #13129
- Added opt-out usage data collection to better understand usage patterns and guide TensorRT LLM development toward real-world needs. Opt-out options and details at: https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#telemetry-data-collection
New Contributors
- @stefanpantic made their first contribution in #12284
- @nvyutwu made their first contribution in #12545
- @sammshen made their first contribution in #12626
- @depaulmillz made their first contribution in #12897
Full Changelog: v1.3.0rc11...v1.3.0rc12
详细ChangeLogv1.2.1
2026年04月20日
Highlights
-
Fixed Issue
- Fixed an issue that caused KV cache corruption (#12770)
-
Infrastructure Changes
- Upgraded xgrammar and flashinfer (#12811)
v1.3.0rc11
2026年04月09日
Highlights
- Model Support
- API
- Support include_stop_token_in_output in gRPC request manager (#12517)
- Add deprecation warnings on TRT backend entrypoints (#11723)
- Accept strict field in tools and store field in chat requests (#12482)
- Mark TRTLLMSampler as deprecated and update documentation (#11938)
- Move VisualGen APIs to a separate directory (#12538)
- Remove some fields with redefined defaults (#11671)
- Feature
- Apply norm before FC in Eagle (#12561)
- Split MLA DSA custom op for piecewise CUDA graph capture (#12503)
- Optimize host performance for Python cache transceiver (#12273)
- Add Mamba2 MTP SSM cache CUDA kernel for tree-based speculative decoding (#12537)
- Add serve-config-guide skill for basic aggregate single-node serving configs (#12054)
- Add FORCE_CHUNK context chunking policy (#12483)
- Add dense GEMM backend for MoE (#10479)
- Implement gen-first disaggregated scheduling, part 2 (#12239)
- Support EPLB with various MoE backends for Nemotron-H models (#12280)
- Skip softmax via sparsity ratio (#11995)
- Add DWDP (distributed weight data parallelism) support for MoE inference (#12136)
- Add AutoDeploy Super V3 MTP support (#12326)
- Introduce fast path (token IDs + multimodal) for VLMs without re-tokenizing encoded prompts (#11708)
- Add global pool support for suffix automaton speculative decoding (#12130)
- Add Triton paged attention for AutoDeploy (#12642)
- Refactor VisualGen attention backend (#12663)
- Add support of linear attention state for C++ KV cache manager (#12531)
- Add temporally-correlated heuristic-guided indexer TopK for sparse attention (#12385)
- Support MLA generation in TrtllmGen attention backend (#12606)
- Extend Python cache transceiver to support Nemotron (#12150)
- Handle different chat template types (#12336)
- Add multi-turn support for trtllm-bench (#12468)
- Add fused DiT QK Norm + RoPE CUDA kernel for FLUX (#11869)
- Support cache reuse for SSM in KVCacheManagerV2 (#12644)
- Add MLIR-based auto-generated elementwise fusion for AutoDeploy (#12427)
- Add --custom_tokenizer CLI option to trtllm-bench (#12586)
- Support LoRA adapter for Nemotron-H models (#12154)
- Apply multiple host performance optimizations for DSA (#12581)
- Reuse Triton slicing kernel for GDN prefill transpose (#12737)
- Add Trtllm-gen FMHA JIT support (#12612)
- Retune causalConv1d forward dispatch for variable-length and short sequences (#12739)
- Update configuration to enable NVFP4 (#12776)
- Fuse SiLU+Mul in AutoDeploy transform (#12497)
- Fix
- Fix Triton kernels in wheel (#12569)
- Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models (#12571)
- Reorder generation_logits to align with final beam search output ordering (#12268)
- Handle CUDA_ERROR_INVALID_VALUE in kv_cache_v2 _is_prop_supported (#12613)
- Fix autotuner OOM for trtllmGen MoE runners at large context length (#12523)
- Always sync sampler_event in update_requests (#12585)
- Avoid counting KV cache uses during warmup for Prometheus KV cache metrics (#12132)
- Fix lost requests (#12348)
- Fix GPTOSS CUTLASS MoE on Hopper NVLink one-sided workspace overflow (#12666)
- Fix Mooncake dynamic load in transfer_agent_binding (#12181)
- Fix disaggregated pipeline-parallel hang (#12528)
- Correct reused block counting in corner case (#12404)
- Clamp block indices to prevent out-of-bounds in DSA with MTP (#12657)
- Synchronize NCCL memory allocation error handling (#12125)
- Adjust prompt logprobs to use the correct prompt token id (#12499)
- Improve NIXL agent import error diagnostics (#12446)
- Fix disaggregated serving hang on block reuse after eviction (#12667)
- Use the first non-None result returned by Hugging Face download workers (#12259)
- Replace assertions with warnings for unsupported logits/logprobs in speculative sampler (#12547)
- Address H20 weights loading OOM for GPTOSS (#11321)
- Improve Harmony parser (delta grouping, reuse report, test coverage) (#12467)
- Fix hang issues on DGX B200 8-GPU PyTorch configurations (#12656)
- Fix disaggregated KV cache router for chat API; add disaggregated benchmark for ai_perf (#12337)
- Fix CUDA event crash with performance metrics (#12639)
- Update Nemotron-H handling for corner cases (#12620)
- Fix KV cache issue (#12673)
- Fix wrong token suppressed with ignore_eos in Torch sampler (#12358)
- Fix GPTOSS chat template for disaggregated tests (#12724)
- Fix top-K logprobs size for pipeline parallelism (#12623)
- Remove clone in FP8 quantization (#12687)
- Fix Qwen2.5 mixed precision accuracy issue (#12609)
- Fix Mamba metadata prefill bubble in chunked prefill serving (#12736)
- Fix outdated README argument for executorExampleDisaggregated.cpp (#12276)
- Documentation
- Add MoE developer guide for fused_moe module (#12534)
- Update supported models to include Kimi K2/K2.5 and GLM-5 (#12654)
- Publish blog post for DWDP (#12725)
- Add visual generation models to supported models page (#12464)
- Clean up latest news and blogs; update overview and highlight visual generation (#12753)
- Update C++ coding guidelines (#12577)
- Test & Infra
- Use shared utility for node labels (#9095)
- Adjust RocketKV test threshold (#12527)
- Enhance performance tests with GPU availability check in test_perf.py (#12535)
- Move AD performance regression tests to AD pre- and post-merge jobs (#12461)
- Remove Model Registry Check from workflows; check runs in pre-commit (#12590)
- Add Ubuntu 24.04 wheel image for SBSA (#12436)
- Pin mypy version due to dependency conflicts (#12650)
- Fix Pyxis error in disaggregated performance test (#12575)
- Skip already-applied patches gracefully in third-party FetchContent (#12550)
- Add container scanning to PLC nightly pipeline (#12549)
- Use JobBuilder to trigger downstream job (#7079)
- Prefer GitHub then GitLab for TOT waive list (#11063)
- Isolate single-GPU Ray orchestrator tests to avoid CI timeouts (#12616)
- Add workaround for trtllm-bench hang and improve robustness (#12655)
- Bump tornado and black in container (#12600)
- Remove OOM test case from L40S test list (#12685)
- Temporarily disable warn_unused_ignores (#12728)
- Add supplemental Ruff lint for legacy files via ruff-legacy hook (#11469)
- Add port conflict retry for disaggregated multi-process tests (#12618)
- Add CI agent failure analysis to L0 merge request pipeline (#12543)
- Fix source code scanning (#12773)
- Remove gpu-shell tool from ad-run-agent (#12418)
- Move to FlexCache in Austin for 5080 nodes (#12615)
What's Changed
- [https://nvbugs/5882636][fix] Fix triton_kernels in wheel by @dongfengy in #12569
- [https://nvbugs/5919796][test] AutoDeploy: unwaive Super V3 autodeploy failure by @galagam in #12556
- [None][test] Waive another flaky test case on Dis-agg serving with Ne… by @nv-guomingz in #12587
- [#11992][fix] Support include_stop_token_in_output in gRPC request manager by @CatherineSue in #12517
- [None][feat] Eagle: Norm before FC by @IzzyPutterman in #12561
- [#10607][fix] moved AD perf regression tests to AD jobs pre and post merge by @MrGeva in #12461
- [None][infra] Waive 1 failed cases for main in post-merge 2626 by @ZhanruiSunCh in #12592
- [TRTLLM-7335] [infra] Use shared utility for node labels by @niukuo in #9095
- [None][infra] Waive 1 failed cases for main in pre-merge 31714 by @ZhanruiSunCh in #12589
- [https://nvbugs/6007197][fix] Adjust RocketKV test threshold by @heyuhhh in #12527
- [None][test] Enhance performance tests by adding GPU availability check in test_perf.py by @yufeiwu-nv in #12535
- [None][infra] Waive 2 failed cases for main in post-merge 2627 by @ZhanruiSunCh in #12605
- [None][fix] Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models by @lancelly in #12571
- [None][doc] Add MoE developer guide for fused_moe module by @xxi-nv in #12534
- [None][chore] Remove Model Registry Check from workflows, the check already runs in pre-commit by @tcherckez-nvidia in #12590
- [https://nvbugs/5983390][perf] Split MLA DSA custom op for piecewise CUDA graph capture by @liji-nv in #12503
- [None][fix] Reorder generation_logits to align with final beam search output ordering by @achartier in #12268
- [TRTC-351][chore] Deprecation warnings on TRT backend entrypoints by @venkywonka in #11723
- [TRTLLM-10804][infra] add ubuntu2404 wheel image for SBSA by @niukuo in #12436
- [#12288][feat] Add Mistral 4-small support to AutoDeploy by @bmarimuthu-nv in #12266
- [None][infra] waive failed case for main by @EmmaQiaoCh in #12621
- [https://nvbugs/5920751][chore] Unwaive a test that has been fixed by @longlee0622 in #12610
- [TRTLLM-9526][feat] optimize host perf for python cache transceiver by @chuangz0 in #12273
- [None][test] Fix kimi-k2 test issue by @yufeiwu-nv in #12604
- [None][feat] Add Mamba2 MTP SSM cache CUDA kernel for tree-based speculative decoding by @JadoTu in #12537
- [None][chore] Bump version to 1.3.0rc11 by @ZhanruiSunCh in #12627
- [https://nvbugs/6013692][fix] handle CUDA_ERROR_INVALID_VALUE in kv_cache_v2 _is_prop_supported by @lfr-0531 in #12613
- [https://nvbugs/6011517][fix] Fix autotuner OOM for trtllmGen MoE runners at large context length by @hyukn in #12523
- [None][feat] serve-config-guide skill for basic aggregate single-node serving configs by @venkywonka in #12054
- [None][fix] always sync sampler_event in update_requests by @Funatiq in #12585
- [None][infra] Pin the version of mypy due to dependency conflicts by @EmmaQiaoCh in #12650
- [https://nvbugs/5996645][fix] Fix Pyxis Error in Disagg Perf Test by @chenfeiz0326 in #12575
- [TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy by @VALLIS-NERIA in #12483
- [None] [feat] Add densegemm backend for MoE by @zongfeijing in #10479
- [TRTLLM-8922][feat] gen-first disagg scheduling, part 2 by @reasonsolo in #12239
- [https://nvbugs/5972362][fix] Avoid counting KV cache uses during warmup for Prometheus KV cache metrics by @yijingl-nvidia in #12132
- [https://nvbugs/6007352][fix] Accept strict field in tools and store field in chat requests by @JunyiXu-nv in #12482
- [TRTLLM-11551][feat] Support EPLB with various MoE backends for nemotron-h models by @Wanli-Jiang in #12280
- [None][infra] Skip already-applied patches gracefully in 3rdparty FetchContent by @achartier in #12550
- [TRTLLM-11385][chore] Mark TRTLLMSampler as deprecated and update documentation by @Funatiq in #11938
- [None][feat] Skip softmax via sparsity ratio by @rohansjoshi in #11995
- [None][infra] Add container scanning to plc nightly pipeline by @yuanjingx87 in #12549
- [https://nvbugs/5948878][fix] fix lost requests by @bo-nv in #12348
- [TRTLLM-7335][infra] use JobBuilder to trigger downstream job by @niukuo in #7079
- [https://nvbugs/5800672][fix] Unwaive tests fixed by Austin Lab GPU topo config resolution by @peaceh-nv in #12453
- [TRTLLM-11119][feat] Blackwell SageAttention, Integrate into AttentionOp API by @xrq-phys in #11718
- [TRTLLM-9970][infra] Get TOT waive list from github repo first and then gitlab repo by @yiqingy0 in #11063
- [https://nvbugs/5850183][fix] Re-enable passing tests by @dongfengy in #12568
- [None][feat] Add DWDP (Distributed Weight Data Parallelism) support for MoE inference by @tianyuz-nv in #12136
- [https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness by @yufeiwu-nv in #12655
- [https://nvbugs/5836828][fix] Fix GPTOSS CUTLASS MOE on Hopper nvlink one-sided workspace overflow by @dongfengy in #12666
- [https://nvbugs/5996656][fix] unwaive qwen3 ci test by @byshiue in #12652
- [None][fix] fix mooncake dynamic load in transfer_agent_binding by @chuangz0 in #12181
- [None][fix] Add GlmMoeDsaForCausalLM to EPLB supported model list by @qiaoxj07 in #12607
- [None][doc] update supported models to include Kimi K2/K2.5 and GLM-5 by @dc3671 in #12654
- [https://nvbugs/5911788][fix] Isolate single_gpu ray orchestrator tests to avoid CI timeouts by @shuyixiong in #12616
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12645
- [https://nvbugs/6007967][fix] fix disagg pp hang issue by @bo-nv in #12528
- [TRTLLM-10232][feat] Support LoRA adapter for nemotron-h models by @Wanli-Jiang in #12154
- [https://nvbugs/5983390][perf] Multiple host perf optimizations for DSA part by @hyukn in #12581
- [None][fix] Correct reused block counting on corner case by @tongyuantongyu in #12404
- [https://nvbugs/6032056][fix] Clamp block indices to prevent OOB in DSA with MTP by @sunnyqgg in #12657
- [None][revert] Revert "[TRTLLM-11119][feat] Blackwell SageAttention, Integrate into … by @yunruis in #12679
- [#12332][feat] AutoDeploy: SuperV3 MTP Support by @govind-ramnarayan in #12326
- [TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt by @moraxu in #11708
- [TRTLLM-11237][fix] [fix] Synchronize NCCL memory allocation error handling by @nv-lschneider in #12125
- [https://nvbugs/6008710][fix] Adjust prompt logprobs to use the correct prompt token id by @stnie in #12499
- [None][infra] Bump tornado and black in container by @yuanjingx87 in #12600
- [TRTLLM-11043][feat] Add global pool support for suffix automaton speculative decoding by @cascade812 in #12130
- [https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics by @Shixiaowei02 in #12446
- [None][feat] Add triton paged attention for AutoDeploy by @nvchenghaoz in #12642
- [#12560][fix] Fix disaggserving hang on block reuse after eviction by @Tabrizian in #12667
- [None][refactor] VisualGen attention backend refactor by @NVShreyas in #12663
- [TRTLLM-11318][feat] move VisualGen APIs to a separate dir by @zhenhuaw-me in #12538
- [None][infra] Waive failed test by @yuanjingx87 in #12714
- [#11538][fix] Enable sliding window attention for Mistral/Mixtral by @karljang in #12597
- [None][infra] Waive failure test by @yiqingy0 in #12726
- [TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager by @VALLIS-NERIA in #12531
- [None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention by @longcheng-nv in #12385
- [None][test] Remove OOM test case from L40S test list by @yufeiwu-nv in #12685
- [None][feat] Support MLA generation in TrtllmGen attention backend by @yihwang-nv in #12606
- [None][infra] No warn_unused_ignores temporarily by @EmmaQiaoCh in #12728
- [None][feat] Qwen3-Next MTP by @IzzyPutterman in #11370
- [None][doc] Blog19 for DWDP. by @wanqian-nv in #12725
- [https://nvbugs/5800591][chore] Unwaive a deepseek MTP test by @mikeiovine in #12327
- [None][doc] Add visual generation models to supported models page by @chang-l in #12464
- [TRTLLM-11146][feat] Extend python cache transceiver to support nemotron by @bo-nv in #12150
- [TRTLLM-11523][feat] Handle different chat template types by @2ez4bz in #12336
- [#12257][fix] Use the first non-None result returned by hf download workers by @kev-bi in #12259
- [None][feat] Add multi-turn support for trtllm-bench by @cascade812 in #12468
- [None][fix] Replace assertions with warnings for unsupported logits/logprobs in speculative sampler by @yifjiang in #12547
- [None][cleanup] Add supplemental ruff lint for legacy files via ruff-legacy hook by @venkywonka in #11469
- [https://nvbugs/5864187][fix] Address H20 Weights Loading OOM for GPTOSS by @dongfengy in #11321
- [None][feat] Add fused DiT QK Norm + RoPE CUDA kernel for FLUX by @karljang in #11869
- [TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2 by @lowsfer in #12644
- [None][fix] Harmony Parser Delta Grouping + Reuse Report + Better Test Coverage by @dongfengy in #12467
- [None][feat] MLIR-based auto-generated elementwise fusion for AutoDeploy by @suyoggupta in #12427
- [None][doc] Clean up latest news + blogs, update overview, highlight visual gen by @laikhtewari in #12753
- [None][feat] Add --custom_tokenizer CLI option to trtllm-bench by @qiaoxj07 in #12586
- [https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo… by @bo-nv in #12656
- [TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf by @reasonsolo in #12337
- [None][fix] Fix Cuda event crash with perf metrics by @jthomson04 in #12639
- [None][fix] Update codes to support nemotron-h corner cases by @Wanli-Jiang in #12620
- [None][infra] Waive 10 failed cases for main in post-merge 2636 by @ZhanruiSunCh in #12767
- [https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests by @reasonsolo in #12618
- [https://nvbugs/6025177][fix] Fix KV cache issue by @thorjohnsen in #12673
- [TRTLLMINF-37][feat] Add CI agent failure analysis to L0_MergeRequest… by @dpitman-nvda in #12543
- [None][doc] Update C++ coding guidelines. by @hnover-nv in #12577
- [#12324][fix] Fixed wrong token suppressed with ignore_eos in torch sampler by @MrGeva in #12358
- [https://nvbugs/5849648][fix] Fix GPTOSS Chat Template for Disagg Tests by @dongfengy in #12724
- [#11094][feat] AutoDeploy transform to fuse silu+mul by @MrGeva in #12497
- [None][infra] Fix source code scanning by @yuanjingx87 in #12773
- [None][chore] Remove gpu-shell tool from ad-run-agent by @govind-ramnarayan in #12418
- [#9306][cleanup] Remove some fields with redefined defaults by @2ez4bz in #11671
- [None][feat] reuse triton slicing kernel for GDN prefill transpose by @nv-guomingz in #12737
- [None][feat] fix mamba metadata prefill bubble in chunked prefill serving by @nv-guomingz in #12736
- [https://nvbugs/5781731][fix] Unwaive Ray test by @dominicshanshan in #9654
- [None][fix] Fix outdated argument of readme.md for executorExampleDisaggregated.cpp by @Fan-Yunfan in #12276
- [None][feat] Trtllm-gen FMHA JIT support by @yunruis in #12612
- [None][feat] retune causalConv1d fwd dispatch for varlen and short sequences by @nv-guomingz in #12739
- [TRTLLM-9948][infra] Move to use FlexCache in Austin for 5080 nodes by @EmmaQiaoCh in #12615
- [TRTLLM-11768][fix] Config updates to enable NVFP4 by @2ez4bz in #12776
- [https://nvbugs/6008468][fix] Fix top-K logprobs size for PP by @pengbowang-nv in #12623
- [None][fix] Remove clone in fp8 quant. by @Tracin in #12687
- [https://nvbugs/6011284][fix] Fix Qwen2.5 mixed precision accuracy issue. by @Tracin in #12609
New Contributors
- @rohansjoshi made their first contribution in #11995
- @xrq-phys made their first contribution in #11718
- @tianyuz-nv made their first contribution in #12136
- @wanqian-nv made their first contribution in #12725
- @kev-bi made their first contribution in #12259
- @yifjiang made their first contribution in #12547
Full Changelog: v1.3.0rc10...v1.3.0rc11
详细ChangeLogv1.3.0rc10
2026年03月31日
Highlights
-
Model Support
-
API
-
Feature
- Add CuTe DSL single-pass multi-CTA cluster top-k (#12354)
- Account for reusable KV cache blocks in micro-batch scheduler capacity scheduling (#11637)
- Add raster-along-M/N support for blockscaled contiguous backbone kernels in CuteDSL MoE (#12079)
- Add stride support for
conv1dandfused_sigmoid_gating_delta_rule_update(#12442) - Add a safe allgather implementation with chunking (#12174)
- Add dynamic SMEM block routing in MoE (#12456)
- Optimize
mamba_mixer2.pydecode performance (#11843) - Add PDL support to CuTE DSL top-k kernels (#12506)
- Add FlexKV support (#12512)
- Add a KV cache-aware ADP router for prefix-affinity request routing (#12315)
-
Fix
- Fix KV token estimation when ADP is enabled (#12099)
- Fix Eagle MLA target with GQA draft support (#12171)
- Fix Qwen 3.5 3D position ID handling (#12114)
- Switch tests to
TorchSamplerand fix related bugs (#12200) - Use
ceil_divfor head and size sharding (#12441) - Remove redundant D2H synchronization to improve performance (#12445)
- Fix parallel WAN VAE when
return_dict=True(#12460) - Fix Triton resmooth kernel crashes on SM100f for large MoE grids (#12397)
- Use a model-level warmup cache key for visual generation pipelines (#12516)
- Add NVTX annotations in
sampler.py(#12459) - Use
extra_visual_gen_optionsto improve visual generation routing (#12487)
-
Documentation
-
Test & Infra
- Save unittest subtest results periodically (#11850)
- Fix the B200 aggregated CI perf test MPI issue (#12347)
- Fix LoRA config handling when the provided config count is below requirements (#12409)
- Add a unit test for
load_state_dictsafetensors fallback (#12408) - Replace the skipped TRTLLM NVFP4 test in the B300 CI list (#12454)
- Fix the ltx-2 model checkpoint issue in VBench eval tests (#12463)
- Fix the concurrent write issue in perf tests (#12484)
- Update dependencies to align with the NGC PyTorch 26.02 stack (#12102)
- Consolidate PyTransceiver code (#12342)
- Add Eagle coverage with different input/output cases on Spark (#12520)
What's Changed
- [None][infra] Waive 4 failed cases for main in post-merge 2611 by @ZhanruiSunCh in #12433
- [None][test] Fix lora config less than required config number by @yufeiwu-nv in #12409
- [https://nvbugs/5916151][fix] Unwaive test_fused_moe_w4a8_nvfp4_fp8[TRTLLM] by @xxi-nv in #12400
- [https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on. by @dominicshanshan in #12099
- [TRTLLM-11229][infra] Save unittest subtest results periodically by @yiqingy0 in #11850
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12426
- [https://nvbugs/5997090][fix] Fix B200 Aggregated CI Perf Test MPI Issue by @chenfeiz0326 in #12347
- [TRTLLM-10407][perf] Add cute dsl single pass multi cta cluster topk by @limin2021 in #12354
- [TRTLLM-11070][feat] Account for reusable KV cache blocks in micro batch scheduler capacity scheduling. by @SimengLiu-nv in #11637
- [None][chore] Fixing guardword check by @pcastonguay in #12455
- [None][infra] Waive 1 failed cases for main in post-merge 2610 by @ZhanruiSunCh in #12434
- [None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel by @liyuhannnnn in #12079
- [None][fix] Switch tests to TorchSampler and fix bugs by @Funatiq in #12200
- [TRTLLM-10061][fix] Use ceil_div for head/size calculations by @VALLIS-NERIA in #12441
- [TRTLLM-10061][feat] Add stride support for conv1d and fused_sigmoid_gating_delta_rule_update by @VALLIS-NERIA in #12442
- [None][fix] Eagle: MLA Target + GQA Draft by @IzzyPutterman in #12171
- [None][doc] fix outdated code references in tech blogs 2, 3, 4, 8, 9, 11 by @schetlur-nv in #12338
- [TRTLLM-11471][feat] Add safe version of allgather with chunking by @chienchunhung in #12174
- [None][perf] add Dynamic SMEM block routing in MOE by @jiahanc in #12456
- [TRTLLM-11544][feat] Add Qwen 3.5 supporting(NVFP4). by @nv-guomingz in #12302
- [https://nvbugs/5997090][fix] Add Disagg Perf Test back as MPI Issue has been fixed by @chenfeiz0326 in #12458
- [https://nvbugs/5841976][fix] Remove test_fused_moe_alltoall_fp4[DeepEP] from waives by @xxi-nv in #12405
- [None][infra] Waive 2 failed cases for main in post-merge 2613 by @ZhanruiSunCh in #12473
- [https://nvbugs/5866619][test] Add unit test for load_state_dict safetensors fallback by @crazydemo in #12408
- [None][feat] Fuse all_reduce with norm for nemotron_h models by @Wanli-Jiang in #12410
- [None][infra] Update CI allowed list by @yuanjingx87 in #12488
- [https://nvbugs/6013562][test] Update waive by @xinhe-nv in #12492
- [None][feat] Small optimizations for mamba_mixer2.py decode by @hnover-nv in #11843
- [None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @hyukn in #12494
- [#11526][chore] AutoDeploy accuracy tests: Use Llama3.1-8B-Instruct official checkpoints by @galagam in #12285
- [https://nvbugs/6007285][fix] Replace skipped TRTLLM NVFP4 test in B300 CI list by @xxi-nv in #12454
- [https://nvbugs/5983390][fix] Remove redundant D2H sync to optimize perf by @hyukn in #12445
- [https://nvbugs/5987470][fix] BREAKING: Do not normalize log probs by default by @achartier in #12366
- [TRTLLM-11622][fix] fix parallel WAN vae when return_dict=True by @NVShreyas in #12460
- [None][infra] Waive pre-merge failed 5090 test by @yuanjingx87 in #12486
- [None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @bo-nv in #12518
- [None][chore] Fix ltx-2 Model Checkpoint Issue in VBench Eval Tests by @yibinl-nvidia in #12463
- [https://nvbugs/5962591][fix] Fix Triton resmooth kernel crash on SM100f for large MoE grids by @Barry-Delaney in #12397
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12495
- [None][doc] Document temperature-adjusted logprobs in TRT backend by @achartier in #12514
- [None][feat] Add PDL support to CuTE DSL top-k kernels by @limin2021 in #12506
- [None][infra] Waive 4 failed cases for main in post-merge 2617 by @ZhanruiSunCh in #12536
- [None][doc] Update Python coding guidelines. by @hnover-nv in #12439
- [#12290][fix] Qwen 3.5 fix 3d position ID handling by @bmarimuthu-nv in #12114
- [TRTLLM-10820][infra] Update dependencies to align with NGC PyTorch 26.02 stack by @EmmaQiaoCh in #12102
- [https://nvbugs/6015329][fix] Use model-level warmup cache key for visual gen pipelines by @karljang in #12516
- [TRTLLM-9523][chore] PyTransceiver code consolidation by @Shixiaowei02 in #12342
- [None][test] Add different input-output of eagle cases on Spark by @JennyLiu-nv in #12520
- [https://nvbugs/6011086][fix] Fix Perf Test's Concurrent Write Issue by @chenfeiz0326 in #12484
- [None][fix] NVTX annotation in sampler.py by @ixlmar in #12459
- [https://nvbugs/5998489][feat] Adding support for request priority in LLM API by @pcastonguay in #12362
- [None][feat] Add support for FlexKV by @pcastonguay in #12512
- [None][feat] KV cache-aware ADP router for prefix-affinity request routing by @lancelly in #12315
- [https://nvbugs/6008183][fix] Use extra_visual_gen_options to help de… by @JunyiXu-nv in #12487
- [None][test] Waive a flaky test case on Dis-agg serving with Nemotron… by @nv-guomingz in #12578
- [None][chore] Bump version to 1.3.0rc10 by @yuanjingx87 in #12511
- [None][chore] Fixing guardword check by @VALLIS-NERIA in #12579
Full Changelog: v1.3.0rc9...v1.3.0rc10
详细ChangeLogv1.3.0rc9
2026年03月25日
Highlights
- Model Support
- Add Qwen3-next attention DP support (#10218)
- Improve DeepSeek-V3.2 NVFP4 indexer GEMMs and routing kernels (#11989, #12055)
- Support KV cache and speculative decoding in the Trtllm-Gen attention backend (#11667, #12267)
- Add audio support and chunked-prefix enablement for Nemotron models (#12191, #12414)
- Add GLM 5 support and fix DSA MTP issues (#11990)
- Add initial Qwen3.5 text model support for the PyTorch backend with BF16/FP8 (#12242)
- API
- Add energy metrics to
trtllm-serveand benchmarking workflows (#11855) - Expose
video_pruning_rateinllmargsand improve Nano V2 VL handling (#12194) - Add
TLLM_PROFILE_LOG_RANKSto control per-rank step logging (#12263) - Improve the serve CLI with renamed flags and
mm_embedding_serveenhancements (#12105) - Add an
autooption for tool and reasoning parsers (#12104) - Support interleaved thinking in
trtllm-serve(#12199) - BREAKING: Set the default KV cache transfer timeout to 60 seconds (#12249)
- Add energy metrics to
- Feature
- Add FP8 combine support in
moe_a2a(#11844) - Add batch generation support to visual generation pipelines (#12121)
- Improve request management in the sampler (#11861)
- Add fused AllReduce + RMSNorm with optional residual support (#12201)
- Add constraint-based memory partitioning and a Python scheduler for
KVCacheManagerV2(#12212, #11939) - Add LM head sharding (#12252)
- Add an interactive recipe selector with curated configs and button-grid UI (#11917)
- Improve DSA and FlashMLA performance with new kernel fusions and cached tile-scheduler metadata (#12322, #12161)
- Improve model performance with CuteDSL
indexer_top_k, FlashInfer MLP activation, and refined KV cache buffer sizing (#12236, #12131, #12274)
- Add FP8 combine support in
- Fix
- Fix disaggregated perf test result generation, env export, and port allocation issues (#12211, #12140)
- Fix harmony and tool-calling parsers for agentic coding use cases (#12045)
- Fix torch.compile compatibility by routing DSA attention through the MLA custom op (#12186)
- Fix
min_tokenshandling for long prompts and return explicit scheduling errors when requests cannot be placed (#12166, #12206) - Fix KV cache V2 OOMs and weight-loading OOMs in disaggregated serving (#12188, #12377)
- Fix lost requests, dummy-request crashes, and
GUIDE_TYPE_STRUCTURAL_TAGhandling in request management paths (#12197, #12403, #12330) - Fix W4A16 AWQ bias handling on SM100 and add bias support to
WeightOnlyQuantLinearMethod(#12190, #12317) - Fix MiniMax model loading and multimodal loading error propagation (#12182, #12331)
- Fix MTP/DSA reliability, PARD accuracy, and NVFP4 MoE mixed-precision scales (#12010, #12360, #12240)
- Fix DGX Spark multi-node hangs, cross-node rollout issues in Verl, and
CUDA_VISIBLE_DEVICESpropagation in scripts (#12316, #11924, #12370) - Fix build and runtime issues for SM103 context-attention kernels, L40s IB transfers, LlavaNext dtype fallback, and MnnvlMemory resource cleanup (#12248, #12152, #12169, #11979)
- Add warmups to avoid AIPerf timeouts and I2V torch.compile recompilation (#12178, #12351)
- Pre-cache aesthetic predictor weights to avoid VBench 429 failures (#12127)
- Documentation
- Test & Infra
- Limit pre-merge pre-commit checks to changed files (#11379)
- Use CPU affinity instead of raw CPU count for default build parallelism (#12167)
- Add broader performance, accuracy, and end-to-end coverage for Nemotron, DeepSeek-V3.2, disaggregated serving, FLUX, and DSA host-cache offload (#12184, #12142, #12275, #12279, #12278, #12153)
- Update multi-node and MPI-related test coverage (#12075, #12300)
- Add SSH key authentication support for SLURM clusters (#12172)
- Use the public PyTorch index as a CI fallback and update the CI allowlist (#12261, #12296)
- Enable type checking for sampler modules and improve Python KV transceiver coverage (#11678, #11574)
- Remove outdated QA coverage and refactor benchmarking and test infrastructure (#12277, #12344, #12124, #11720, #12192)
What's Changed
- [TRTLLM-10929][feat] add fp8 combine in moe_a2a by @dc3671 in #11844
- [TRTLLM-9767][feat] Enable attention dp for qwen3-next. by @nv-guomingz in #10218
- [None][fix] Fix Disagg Perf Test No result.xml Bug by @chenfeiz0326 in #12211
- [https://nvbugs/5955188][fix] Fix harmony parsers for agentic coding use cases by @dongfengy in #12045
- [https://nvbugs/5973536][fix] Route DSA attention through MLA custom op for torch.compile compatibility by @yizhang-nv in #12186
- [https://nvbugs/5823135][fix] Fix min_tokens not respected when prompt is long by @JunyiXu-nv in #12166
- [None][doc] Blog18 for NVLinkOneSided AlltoAll. by @bobboli in #12195
- [None][chore] Remove closed bugs by @xinhe-nv in #12222
- [None][fix] Fix KV cache V2 OOM with separate draft KV cache (EAGLE3/MTP) by @yizhang-nv in #12188
- [None][doc] AutoDeploy: ad-model-onboard skill updates by @bmarimuthu-nv in #12234
- [TRTLLM-10569][infra] Only check the changed files in pre-commit in pre-merge CI by @yiqingy0 in #11379
- [https://nvbugs/5948878][fix] fix lost requests by @bo-nv in #12197
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12218
- [None][chore] fix deepep trtllm backend MXFP4 by @leslie-fang25 in #12219
- [None][chore] Alltoall benchmark script refine (second time). by @bobboli in #12192
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12220
- [None][fix] Fix W4A16 AWQ bias not applied on SM100 (Blackwell) by @Tracin in #12190
- [None][fix] Export computed env vars to env_vars.json and fix port allocation in disagg benchmark by @qiaoxj07 in #12140
- [TRTLLM-11288][fix] Adapt LTX2 pipeline to CompilationConfig warmup interface by @luyiyun1021 in #12232
- [https://nvbugs/5955927][fix] Add warm up before aiperf to fix timeout issue. by @dominicshanshan in #12178
- [None][refactor] Improve request management in sampler by @Funatiq in #11861
- [None][chore] Use affinity rather than CPU count for default build parallelism by @achartier in #12167
- [None][feat] Support kv cache in Trtllm-Gen attention backend by @yihwang-nv in #11667
- [None][docs] Update nemotron 3 super deployment to include tool calling and reasoning parser by @tijyojwad in #12215
- [None][fix] Add more models to increase perf test coverage by @chenfeiz0326 in #12184
- [TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2 NVFP4 by @peihu-nv in #11989
- [https://nvbugs/5879588][fix] fix MiniMax model loading bugs by @jmydurant in #12182
- [TRTLLM-10333][feat] Add energy metrics in trtllm-serve and benchmark… by @JunyiXu-nv in #11855
- [None][test] Update nemotron super test cases with official ckpt. by @nv-guomingz in #12142
- [None][fix] Reliability fixes for MTP with DSA and support host cache offload for DSA by @dmtri35 in #12010
- [None][infra] Waive 5 failed cases for main in post-merge 2599 by @ZhanruiSunCh in #12283
- [None][infra] use public torch index as CI backup by @tburt-nv in #12261
- [TRTLLM-11362][feat] Add batch generation support to visual gen pipelines by @karljang in #12121
- [https://nvbugs/5973801][fix] exclude subproc_worker_timer from thread leak checks by @MrGeva in #12286
- [#11432][feat] AutoDeploy: Enable fp8 quantization fusion part 1 by @galagam in #11910
- [#10931][feat] AutoDeploy: one-model spec dec by @lucaslie in #11701
- [https://nvbugs/5973536][fix] Add NVFP4+FP8KV+MTP accuracy specs for DeepSeek-V3.2-Exp by @yizhang-nv in #12269
- [#11368][fix] FP4 CUTLASS GEMM shared memory overflow on GB10 (SM121) by @mihai-chiorean in #12141
- [TRTLLM-11267][feat] Add audio support for nemotron by @2ez4bz in #12191
- [None][feat] GLM 5 support and DSA MTP fixes by @NVShreyas in #11990
- [None][fix] Relax MoE test tolerance for fp16 TP mode accuracy mismatch by @xxi-nv in #12244
- [None][test] update function multi nodes test by @xinhe-nv in #12075
- [TRTLLM-11285][feat] Fuse indexer wk + weights_proj into single GEMM in TF32 for DS-V3.2 by @peihu-nv in #12055
- [None][docs] Fix AGENTS.md accuracy and reduce context bloat by @kaiyux in #12258
- [None][doc] Update README. by @bobboli in #12307
- [None][test] Add E2E logprobs test for disaggregated serving via OpenAI API by @yingguo-trt in #12275
- [https://nvbugs/5981841][fix] AutoDeploy: Disable match_swiglu_pattern for Llama 3.3 70B Instruct by @govind-ramnarayan in #12299
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12293
- [https://nvbugs/5969726][fix] exclude IB transfer on L40s by @chuangz0 in #12152
- [TRTLLM-9019][feat] Expose video_pruning_rate as llmargs and fix nano-v2-vl by @Wanli-Jiang in #12194
- [TRTLLM-11517][feat] Add TLLM_PROFILE_LOG_RANKS env var to control per-rank step logging by @longlee0622 in #12263
- [None][chore] Bump version to 1.3.0rc9 by @yuanjingx87 in #12295
- [TRTINFRA-7698][infra] - Add SSH key authentication support for SLURM clusters by @mlefeb01 in #12172
- [None][infra] Update CI allowedlist by @yuanjingx87 in #12296
- [TRTLLM-8804][chore] enable type checking for sampler modules by @ixlmar in #11678
- [None][feat] Add fused allreduce+RMSNorm op and optional residual in … by @lfr-0531 in #12201
- [https://nvbugs/5969206][fix] BREAKING: Setting default value of KV cache transfer timeout to 60s by @pcastonguay in #12249
- [None][infra] PLC nightly source code scanning by @yuanjingx87 in #12124
- [None][fix] LlavaNext dtype fallback when text_config.torch_dtype is None by @indrajit96 in #12169
- [#11694][feat] AutoDeploy: Improve the piecewise CG memory usage by @nvchenghaoz in #11993
- [https://nvbugs/5979443][chore] Refine the trtllm MoE unit test by @leslie-fang25 in #12318
- [TRTLLM-11257][fix] release GPU memory and FDs in MnnvlMemory on pidfd failure to prevent leak by @zhaoyangwang-nvidia in #11979
- [None][test] Fix mpi-type issue and add wideep acc test to dev's l0 local flow by @fredricz-20070104 in #12300
- [None][fix] Fix the issue of excluding all context attention kernels when building for sm103 by @yifeizhang-c in #12248
- [None][infra] Waive 4 failed cases for main in post-merge 2603 by @ZhanruiSunCh in #12334
- [https://nvbugs/5937478][test] Add RCCA test for DeepSeek-V3.2 multi-turn tool_call encoding by @crazydemo in #12279
- [https://nvbugs/5389100][test] Remove TensorRT integration test list and add trtllm-serve for test_perf.py by @yufeiwu-nv in #12277
- [#11526][chore] AutoDeploy accuracy tests: use nemotron-3 official checkpoints by @galagam in #12243
- [TRTLLM-10407][perf] Enable CuteDSL indexer_top_k in model by @limin2021 in #12236
- [None][test] Add DSA host cache offload tests to CI and QA test lists by @longlee0622 in #12278
- [TRTLLM-10076][feat] Serve CLI improvements: renames, new flags, and mm_embedding_serve enhancements by @JunyiXu-nv in #12105
- [None][chore] Refine kv cache buffer calculation by @yihwang-nv in #12274
- [None][feat] Constraint-based memory partitioning to KVCacheManagerV2 by @lowsfer in #12212
- [None][infra] Waive 5 failed cases for main in post-merge 2604 by @ZhanruiSunCh in #12345
- [None][feat] Enable speculative decoding in TrtllmGen attention backend by @yihwang-nv in #12267
- [https://nvbugs/5893116][fix] fix disagg llama oom by @chuangz0 in #12281
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12328
- [https://nvbugs/5808603][fix] Add bias support to WeightOnlyQuantLinearMethod by @stnie in #12317
- [https://nvbugs/5949524][fix] Fix hang issue on DGX-Spark multinode by @JennyLiu-nv in #12316
- [None][chore] Improved test coverage for Python KV Transceiver by @ekou24 in #11574
- [#10607][feat] added AutoDeploy serving perf test with Super test by @MrGeva in #12287
- [#12183][fix] Fix TRTLLM-Gen NVFP4 MoE scales for mixed-precision che… by @tcherckez-nvidia in #12240
- [TRTLLM-11358][test] Add trtllm-serve e2e tests for FLUX by @JunyiXu-nv in #12153
- [None][perf] enable flashinfer mlp activation and fix piecewise graph for gemma3-1B by @amukkara in #12131
- [https://nvbugs/5875031][fix] Compile XQA with sm_120f by @pamelap-nvidia in #12170
- [None][fix] Properly raise errors from multimodal loading by @2ez4bz in #12331
- [#11992][fix] Handle GUIDE_TYPE_STRUCTURAL_TAG in gRPC request manager by @CatherineSue in #12330
- [TRTLLM-10688][fix] fix cross-node rollout issues in verl by @hchings in #11924
- [None][fix] Relax W8A16 MoE test tolerance for DTP mode by @xxi-nv in #12335
- [https://nvbugs/5964329][fix] fix PARD accuracy issue by @cascade812 in #12360
- [None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export by @qiaoxj07 in #12370
- [None][fix] return an explicit error if the requests can't be schedul… by @Tabrizian in #12206
- [None][feat] Initial Qwen3.5 text model support for PyT backend (BF16/FP8) by @rosenrodt in #12242
- [https://nvbugs/5725811][test] Remove outdated llama-v4 and ministral-8b models out of QA scope by @yufeiwu-nv in #12344
- [TRTLLM-10077][feat] Add 'auto' option for tool and reasoning parsers by @JunyiXu-nv in #12104
- [https://nvbugs/5814350][fix] Fix OOM killed during weight loading in disaggregated sever by @yingguo-trt in #12377
- [TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve by @JunyiXu-nv in #12199
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12363
- [None][doc] Optimize the tech blog sequence. by @nv-guomingz in #12386
- [TRTLLM-12250][feat] added lm head sharding by @greg-kwasniewski1 in #12252
- [TRTLLM-9523][chore] Refactor the transfer logic (step 6) by @Shixiaowei02 in #12231
- [https://nvbugs/5895249][fix] Update test waives by @greg-kwasniewski1 in #12247
- [TRTLLM-11497][fix] Add I2V warmup to prevent torch.compile recompilation by @luyiyun1021 in #12351
- [TRTLLM-11287][feat] Implement python based scheduler for KVCacheManagerV2 by @lancelly in #11939
- [https://nvbugs/5961414][fix] Pre-cache aesthetic predictor weights to avoid VBench 429 errors by @chang-l in #12127
- [TRTLLMINF-10][chore] move repeated apt-get installs into tritondevel Docker … by @dpitman-nvda in #11720
- [https://nvbugs/5991576][fix] fix dummy request crash with PP + ADP + disagg + block reuse by @Tabrizian in #12403
- [None][feat] Interactive recipe selector with curated configs and button-grid UI by @venkywonka in #11917
- [TRTLLM-11587][feat] Enable chunked prefix for Nemotron models on sm120 by @pamelap-nvidia in #12414
- [https://nvbugs/5983390][perf] Kernel fusions in _gather_k_cache_for_chunk of Indexer in DSA by @hyukn in #12322
- [None][perf] Cache FlashMLA tile-scheduler metadata across attention layers by @bobboli in #12161
- [None][doc] Fix invalid links in tech blogs. by @nv-guomingz in #12425
New Contributors
- @mihai-chiorean made their first contribution in #12141
- @indrajit96 made their first contribution in #12169
Full Changelog: v1.3.0rc8...v1.3.0rc9
详细ChangeLogv1.3.0rc8
2026年03月18日
Highlights
-
Model Support
- Nemotron 3 Super support
- Add tool parser support for GLM-4 models (#11986)
- Implement dynamic resolution for Nemotron VL (#11894)
- Enable mixed quantization support for Nemotron-H Mamba (#11972)
- Add VisualGen FA4 attention backend support (#11697)
- VisualGen support for LTX-2, Wan and FLUX (#12009)
- Add TRTLLM-Gen kernels for GLM4.7 and support
groupsTokensHeadsQande2m1output (#11643) - Support attention-DP for TRTLLM-Gen NVFP4 MoE (#12156)
-
API
-
Feature
- Add basic SSM support in
KVCacheManagerV2(#11976) - Improve KV event batching (#11883)
- Add 2FP4 / Arcquant support (#11333)
- Adapt the transceiver to manager v2 (step 6) (#11978)
- Add shared expert LoRA support for MoE models in the PyTorch backend (#11760)
- Add dynamic draft length on the one-model speculative decoding path (#10860)
- Enable configurable warmup shapes for VisualGen (#12107)
- Add FlashInfer API support for
TRTLLMGenFusedMoE(#10453) - Add Python cache transceiver support for gen-first workflow (#11941)
- Add basic SSM support in
-
Fix
- Upgrade Cutlass version (#11956)
- Fix DS v32 tool calling type and parse errors (#11935)
- Fix protobuf and
aiohttpvulnerabilities (#11898) - Fix NVFP4 sharding (#11618)
- Fix Kimi-K2.5 accuracy test skip condition and reference configs (#11930)
- Pass
sparse_attn_configfromeffective_draft_configfor one-model draft KV cache (#12032) - Fix MTP advanced sampling top-k IMA (#12088)
- Revert refactor of the KV connector integration in py_executor, which caused issues with KVBM (#11872)
- Fix sharding overwrite with multiple graph modules (#12051)
- Fix various agentic flow issues (#12061)
- Split
mContextChunkSizeinto per-target and per-draft fields (#12058) - Fix
ValueErrorand missing decoding statistics for MTP (#12063) - Improve NCCL library load stability (#12015)
- Disable TRTLLM-Gen routing PDL due to NaN issues (#11994)
- Enforce a minimum
NVSHMEM_QP_DEPTHof 128 for DeepEP low latency (#12100) - Narrow a bare
exceptclause and use identity checks forNone(#12041) - Fix MoE DeepEP hangs caused by non-deterministic GC (#12060)
- Fix
KVCacheManagerV2shrink behavior for the last level and improveinit_ratio(#12112) - Fix Mamba cache handling for PP > 1 (#12146)
- Handle
anyOfparameter schemas in the Qwen3Coder tool parser (#12173) - Add explicit errors for intermediate-size misalignment with the FP8 block size (#12101)
- Fix DeepEP with the TRTLLM MoE backend for sequence length 1 (#12158)
- Improve port retry loops and exception handling (#12225)
- Add streaming support for
no </think>on Nemotron models (#12176)
-
Documentation
-
Benchmark
- Add QA perf test cases with L0 local mode (#12022)
- Align performance benchmark output format (#12067)
- Improve sampler performance by replacing
torch.wherewithmasked_fill_(#11949) - Add a fused
cat+fp8_quantizeCUDA kernel for the DSA indexer (#11899) - Optimize long-sequence token-parallel prefill for the DSA indexer (#11871)
- Reduce
logprobs=0overhead inTorchSampler(#11983) - Refine AlltoAll benchmark scripts (#11649)
- Optimize the Q3N decode kernel with IO reads (#11344)
- Fix disaggregated gen-only benchmark coverage (#12091)
- Fix MPI issues and port conflicts in disaggregated performance tests (#12020)
- Add GB200 performance sanity tests to the QA test database (#11882)
- Refactor parallel VAE support (#12123)
- Optimize 6KD FP8 blockscale GEMM (#11502)
- Optimize Qwen3.5 performance (#11581)
- Restore 3 disaggregated gen-only tests (#12159)
-
Test & Infra
- Fix disaggregated SKU coverage (#12065)
- Fix upload build info branch handling and ensure it always runs in post steps (#12025)
- Fix the CI issue for Mistral Large3 (#12073)
- Enable more KV connector priority tests in CI (#11892)
- Add speculative decoding tests for
exclude_input_in_output=true(#12080) - Add E2E tests for the KV cache connector async loading path (#12053)
- Change the image used for the CI preparation step (#12086)
- Add the
verlstage in CI (#11306) - Add multi-node E2E and accuracy cases on DGX-Spark (#12110)
- Update NumPy to version 2 (#11280)
What's Changed
- [None][feat] Add Auto-Deploy dashboard failures analysis skill by @tcherckez-nvidia in #12033
- [https://nvbugs/5820511][fix] Upgrade Cutlass version by @pamelap-nvidia in #11956
- [None][feat] Add AD model list validation checks to pre-commit and PR… by @tcherckez-nvidia in #12036
- [None][chore] Clarify DCO sign-off and co-author guidelines in AGENTS.md by @kaiyux in #12034
- [TRTLLM-7784][feat] Basic SSM support in KVCacheManagerV2 by @lowsfer in #11976
- [None][test] Add QA's perf test cases with L0 local mode by @fredricz-20070104 in #12022
- [TRTLLM-11246][feat] Add tool parser support for GLM-4 models by @JunyiXu-nv in #11986
- [https://nvbugs/5937478][fix] Fix DS v32 tool calling type and parse error by @JunyiXu-nv in #11935
- [TRTLLM-11135][fix] Fix vulnerabilities protobuf and aiohttp by @yiqingy0 in #11898
- [None][chore] Align perf benchmark output format by @yingguo-trt in #12067
- [None][chore] Improve sampler performance by replacing torch.where with masked_fill_ by @stnie in #11949
- [None][infra] Waive 1 failed cases for main in post-merge 2582 by @ZhanruiSunCh in #12069
- [TRTLLM-10421][perf] Add fused cat+fp8_quantize CUDA kernel for DSA indexer by @kaiyux in #11899
- [None][test] Fix disagg sku by @fredricz-20070104 in #12065
- [https://nvbugs/5892646][perf] Long-sequence token-parallel optimization for DSA indexer prefill by @nvxuanyuc in #11871
- [TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL by @2ez4bz in #11894
- [https://nvbugs/5708901][perf] reduce logprobs=0 overhead in TorchSampler by @ixlmar in #11983
- [None][feat] NVFP4 TRTLLM-Gen MoE for AutoDeploy (Nemotron Super) by @tcherckez-nvidia in #11652
- [https://nvbugs/5963896][fix] Remove test
test_visual_gen_quickstarton A10 by @chang-l in #12048 - [TRTLLM-11535][feat] Fixed NVFP4 sharding by @greg-kwasniewski1 in #11618
- [None][fix] Improve KV Event Batching by @jthomson04 in #11883
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12047
- [TRTLLM-11276][fix] Fix Kimi-K2.5 accuracy test skip condition and reference configs by @lancelly in #11930
- [https://nvbugs/5919026][fix] Pass sparse_attn_config from effective_draft_config for one-model draft KV cache by @chenfeiz0326 in #12032
- [None][fix] MTP Advanced Sampling Topk IMA by @IzzyPutterman in #12088
- [None][fix] Revert "[None][chore] KV Connector Refactor (#11078)" by @jthomson04 in #11872
- [None][chore] Bump version to 1.3.0rc8 by @yuanjingx87 in #12090
- [None][chore] Refine AlltoAll benchmark scripts. by @bobboli in #11649
- [None][feat] 2FP4 / Arcquant. by @Tracin in #11333
- [None][fix] Fix Upload Build Info branch and run in post always by @mzweilz in #12025
- [TRTLLM-11366][feat] Add dedicated virtual memory tag for model weights, configurable restore mode by @tongyuantongyu in #11889
- [https://nvbugs/5961430][fix] Fix CI issue of Mistral Large3 by @byshiue in #12073
- [None][test] add Perf sanity gb200 test into QA test db by @xinhe-nv in #11882
- [None][infra] Waive 2 failed cases for main in post-merge 2584 by @ZhanruiSunCh in #12108
- [None][chore] Waive mpi hang test case by @jieli-matrix in #12077
- [None][chore] re-enable benchmark test in post merge by @zhenhuaw-me in #12035
- [None][feat] Mamba optimization and mixed quantization support for nemotron-h by @Wanli-Jiang in #11972
- [None][fix] Various fixes for agentic flow by @2ez4bz in #12061
- [https://nvbugs/5936322][fix] Fix sporadic port collision in multigpu AutoDeploy tests by @MrGeva in #11913
- [TRTLLM-9523][feat] Adapting the transceiver to manager v2 (step 6) by @Shixiaowei02 in #11978
- [TRTLLM-11928][feat] Fix sharding overwrite with multiple graph module by @greg-kwasniewski1 in #12051
- [https://nvbugs/5948539][fix] Fix disagg gen-only benchmark by @Tabrizian in #12091
- [None][fix] Split mContextChunkSize into per-target/draft fields by @Hrithvik-Alex in #12058
- [None][fix] Fix ValueError and missing decoding statistics for MTP by @cascade812 in #12063
- [None][fix] Enable more KV connector priority tests in CI by @jthomson04 in #11892
- [https://nvbugs/5923949][fix] Improve NCCL library load stability by @nv-lschneider in #12015
- [None][feat] Enable non-gated activation to the new MoE test by @IwakuraRein in #11996
- [None][infra] Update CI allow list by @yuanjingx87 in #12119
- [None][chore] Unwaiving disagg tests failing with address in use error by @pcastonguay in #12085
- [https://nvbugs/5955170][fix] Disable TRTLLM GEN Routing PDL due to nan issue by @dongfengy in #11994
- [None][fix] Enforce minimum NVSHMEM_QP_DEPTH of 128 for DeepEP low latency by @Tabrizian in #12100
- [None][refactor] parallel vae refactor by @NVShreyas in #12123
- [https://nvbugs/5826604][test] Remove test waive for Llama3.1 8B bfloat16 4gpu timeout … by @syuoni in #12092
- [TRTLLM-11257][infra] Unwaive TestDeepSeekR1::test_fp8_blockscale[throughput_mtp] test case by @zhaoyangwang-nvidia in #12059
- [None][infra] Waive 2 failed cases for main in post-merge 2586 by @ZhanruiSunCh in #12134
- [None][feat] Optimize the q3n decode kernel with IO read by @JadoTu in #11344
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12093
- [TRTLLM-11092][feat] add support for visual gen FA4 attention backend by @o-stoner in #11697
- [https://nvbugs/5955173][fix] Add abort method for GenerationResultBase by @JunyiXu-nv in #11970
- [None][test] Add speculative decoding test with exclude_input_in_output=true by @StanleySun639 in #12080
- [None][feat] Add shared expert LoRA support for MoE models in PyTorch backend by @achartier in #11760
- [https://nvbugs/5846166][bug] Fix Disagg Perf Test's MPI Issue and Port Conflict by @chenfeiz0326 in #12020
- [TRTLLM-10244][doc] Add deployment guide for Nemotron 3 Super by @nv-guomingz in #12129
- [None][fix] Narrow bare except clause and use identity check for None by @edenfunf in #12041
- [TRTLLM-10303][feat] Deprecate trtllm-serve CLI options by @JunyiXu-nv in #12106
- [#11800][fix] Add keepalive ping tolerance and context.abort to gRPC server by @CatherineSue in #11992
- [None][test] Add e2e tests for KV cache connector async loading path by @Tabrizian in #12053
- [TRTLLMINF-11][chore] Change image used for Preparation step of CI by @dpitman-nvda in #12086
- [https://nvbugs/5973199][fix] support attn-dp TRTLLM-Gen NVFP4 MoE fu… by @tcherckez-nvidia in #12156
- [TRTLLM-10617][feat] LTX-2 Model Support by @yibinl-nvidia in #12009
- [TRTLLM-10695][ci] add verl stage in CI by @Superjomn in #11306
- [None][feat] Optimize 6KD fp8 blockscale gemm by @CarstyYou in #11502
- [https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back by @chenfeiz0326 in #12159
- [TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC by @xxi-nv in #12060
- [None][feat] Add flashinfer api for TRTLLMGenFusedMoE by @rosong11 in #10453
- [None][chore] Add multinode e2e and accuracy cases on DGX-Spark by @JennyLiu-nv in #12110
- [TRTLLM-11207][requirements] Update numpy version to 2 by @Funatiq in #11280
- [None][chore] Fix KVCacheManagerV2 shrink for last level and improve init_ratio by @lowsfer in #12112
- [TRTLLM-10319][feat] Dynamic draft length on spec decode one-model path by @zheyuf in #10860
- [TRTLLM-11288][feat] Configurable warmup shapes for VisualGen by @luyiyun1021 in #12107
- [None][feat] add trtllm-gen kernels for glm4.7 and support groupsTokensHeadsQ + e2m1 output by @PerkzZheng in #11643
- [None][fix] Fixed mamba cache issue for pp>1 by @Wanli-Jiang in #12146
- [None][feat] Qwen3.5 perf optimizations by @suyoggupta in #11581
- [None][feat] Add mix-precision checkpoint support in AutoDeploy by @Fridah-nv in #12175
- [https://nvbugs/5944411][fix] Handle anyOf parameter schemas in Qwen3Coder tool parser by @tijyojwad in #12173
- [None][infra] Waive failed A10-PyTorch-1 test in pre-merge by @yuanjingx87 in #12207
- [None][fix] Add streaming support to no for nemotron model by @tijyojwad in #12176
- [None][chore] Add explicit error for intermediate size misalignment with fp8 block size by @leslie-fang25 in #12101
- [https://nvbugs/5973316][fix] fix deepep with trtllm moe backend and seqlen one by @leslie-fang25 in #12158
- [TRTLLM-8922][feat] py cache transceiver for gen-first workflow by @reasonsolo in #11941
- [None][fix] remove test_llm_api_autodeploy.py::TestNemotronSuperV3::t… by @tcherckez-nvidia in #12193
- [None][infra] Waive 9 failed cases for main in post-merge 2593 by @ZhanruiSunCh in #12224
- [None][fix] port retry loop and exception handling by @MrGeva in #12225
New Contributors
- @Hrithvik-Alex made their first contribution in #12058
- @zhaoyangwang-nvidia made their first contribution in #12059
- @edenfunf made their first contribution in #12041
Full Changelog: v1.3.0rc7...v1.3.0rc8
详细ChangeLogv1.2.0
2026年03月13日
Highlights
-
Model Support
- Added beta support for K-EXAONE, Nemotron Nano V3, Qwen3-Next and Qwen3-VL.
- Improved GPT-OSS, Nemotron, EXAONE, GLM, Starcoder2, Qwen3, KimiK2, DeepSeek v3.2 and Mistral Large 3 support and validation.
- Expanded Blackwell/Hopper/Ampere enablement including B300/GB200/GB300 and SM120/SM121/SM103 paths.
- Broadened low-precision and MoE capabilities (FP8/NVFP4/MXFP4/INT4-AWQ), including routing and kernels.
-
Features
- Speculative Decoding:
- Enabled MTP>1 support for DeepSeek v3.2
- Disaggregated Serving:
- Added service discovery mechanism for dynamic scaling
- Added support for cancelling requests
- Added NIXL-LibFabric support
- Added support for Mooncake transfer engine as a cache transceiver backend
- Sampling:
- Implemented batched sampling using FlashInfer sampling
- Added support for returning logprobs incrementally with streaming mode in PyTorch backend
- Added Beam Search support to TorchSampler
- Performance:
- Improved TorchSampler performance
- Enabled PDL by default and added PDL support for indexer TopK and additional kernels.
- Improved trtllm-gen kernels
- Enabled early exit with overlap scheduler
- Added NUMA-aware CPU affinity automatic configuration
- Expert Parallelism:
- Enabled EPLB for trtllm-gen and cutlass backend
- Enabled CuteDSL MoE with large EP
- Added CUDA graph support for DeepEP
- Multiple performance improvements
- Hardware:
- DGX Spark Support (Beta)
- Others:
- Helix parallelism support
- New Ray orchestrator type
- Speculative Decoding:
-
Documentation
- Deployment Guides:
- Added comprehensive deployment guides for KimiK2, Qwen3 and Qwen3-Next.
- Added new guide on CPU Affinity configuration.
- Updated GPT-OSS guide.
- Developer Guides:
- Added developer guide about KV Cache Transmission.
- New section on MoE Expert Load Balance Analysis (Perfect Router) in Performance Analysis guide.
- New section on API Change Principles in LLM API Change guide.
- Feature Documentation:
- Created new guides for Additional Outputs, Helix Parallelism, KV Cache Connector, Ray Orchestrator, Sparse Attention and Torch Compile & Piecewise CUDA Graph.
- Also updated the Feature Combination Matrix and Paged Attention, IFB, and Request Scheduling guide.
- Tech Blogs: Published blogs on:
- Examples:
- Added new section on disaggregated serving service discovery method.
- Added examples for K-EXAONE, Nemotron Nano V2 VL and Nemotron Nano V3.
- Added RocketKV usage documentation.
- Deployment Guides:
-
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.12-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.12-py3. - The dependent public PyTorch version is updated to 2.9.1.
- The dependent transformers version is updated to 4.57.3.
- The dependent triton version is updated to 3.5.1.
- The dependent NIXL version is updated to 0.8.0.
- The base Docker image for TensorRT-LLM is updated to
-
API Changes
- Breaking Changes:
- FlashInfer sampling now used by default with PyTorch backend.
- Changes to sampling strategy in some previously undefined cases.
- OpenAI API:
- Enabled n > 1 with PyTorch backend
- Added support for GET/DELETE v1/responses
- Breaking Changes:
-
Fixed multiple Issues
-
Known Issues
- DGX Spark: DGX Spark support is in beta. Only single-node configurations and the models listed above have been validated in this release.
- Disaggregated Serving: A hang may occur in disaggregated serving with context pipeline parallelism and generation tensor parallelism configurations.
v1.3.0rc7
2026年03月11日
Highlights
-
Model Support
- Support tensor parallelism of TRTLLM MoE backend for Nemotron-H model (#11470)
- Add Kimi-K2.5 text model support (NVFP4) (#11777)
- Add Helix CP support for DSV3.2 (#11507)
- Support mix quantization between shared experts and routed experts for DSV3 (#11215)
- Support Cohere Command A model (#11505)
- Extract embeddings as
.safetensorsand support float8-quantized models (#11180)
-
API
- Add
--served-model-nameoption toservecommand (#11711) - Add flag to
trtllm serveto override KV cache dtype (#11487) - Use string stop/bad words in gRPC proto instead of pre-tokenized
TokenSequence(#11888) - Support multimodal image input in gRPC server (#11800)
- Expose
use_python_schedulerinSchedulerConfigand add associated tests (#11884) - Add
max_gpu_total_bytesto control KVCacheManagerV2 capacity (#11907)
- Add
-
Feature
- Support PARD (Parallel Draft Model) in one-model speculative decoding (#11438)
- Enable autotuner for VisualGen and compilation config support (#11660)
- Add globaltimer-based timing backend for autotuner profiling (#11657)
- Support heterogeneous
tokens_per_block(#11751) - Refactor KVCacheManagerV2 to simplify new model support (#11749)
- Support Helix CP with GQA (#11570)
- Add option to skip KV cache memory estimation (#11714)
- Implement suffix automaton on device for speculative decoding and one-model support (#11434)
- Separate radix search tree implementation (#10862)
- Add support for
expert_number(\le 2048) andK(\le 32) (#11510) - Add support for bidirectional sliding window attention mask to
fmha_v2(#11212) - Avoid duplicated computation with ADP + Helix CP in GQA (#11891)
- Add explicit video encode format support (#11830)
- Refactor video encoding to use ffmpeg CLI or pure Python fallback (#11672)
- Integrate CuTe DSL top-k kernel for Blackwell (#11900)
- Integrate suffix automaton with EAGLE3 and PARD (#11878)
- Add 5D A2A for fused Ulysses (#11787)
- Add SiLU to
trtllm-genMoE (#11663) - Optimize by fusing
nvfp4_quantintolayernorm_gatedformamba2_mixer(#11473) - Wire
KVCacheBlocktoUnifiedBlockTreeusing lookup-node pointers (#11919) - Run extra general warmup to warm up memory pool (#10340)
-
Fix
- Add async worker to MTP/EAGLE3 sampler (#11573)
- Fix disaggregated cancellation (#11730)
- Use
prefer_pinned()inpard.py(#11762) - Release KVCacheManagerV2 memory immediately on shutdown (#11746)
- Remove duplicated MoE computation with Helix CP+DP (#11167)
- Register add+norm fallback pass for
torch.compilein multi-GPU mode (#11739) - Propagate logprobs from prefill to decode in disaggregated serving (#11727)
- Propagate logits from prefill to decode in disaggregated serving (#11767)
- Enable separate draft KV cache pool for aggregated mode and KVBM (#11689)
- Fix warnings when building
moe_kernels.cu(#11703) - Fix
available_blockstypo in scheduler (#11801) - Clean up memory in rollout process (#11658)
- Warm up
maybe_compiled_catinforward_context_with_chunked_prefill(#11743) - Fix DeepEPLowLatency with CuTe DSL MoE backend (#11769)
- Fix FP8 per-tensor
torch.compilegraph break in dynamic quantization (#11759) - Fix streaming generation logits and speed up logits testcase (#10637)
- Fix overly aggressive capacity scheduler (#11731)
- Use proper tokens when
exclude_input_in_outputis true (#9453) - Move
launch_dependent_gridsaftertmemfree to fix race (#11812) - Fix E/PD disaggregated chunked prefill bug (#11805)
- Fix SM120 issue for
rms_normwithnvfp4_quant_fusion(#11774) - Remove dead code (#11813)
- Fix KVCacheManagerV2 OOM and dummy request allocation in chunked prefill / pipeline parallel (#11710)
- Fix AttributeError when DSA indexer accesses non-DSA KVCacheManager (#11858)
- Override
mMaxAttentionWindowwith actual largest window size (#11842) - Update
check_is_moeto supportmlp_layer_typesafterconfig.jsonupdate (#11477) - Fix incorrect GPU timing in time breakdown under overlap scheduler (#11860)
- Fix OOM hang with
NCCL_SYMMETRICfallback during long-context inference (#11870) - Fix position IDs input for Qwen3.5 text-only usage (#11877)
- Disable preload for Llama4 Scout (#11873)
- Fix formatting issue in
tensorrt_llm/serve/openai_server.py(#11920) - Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper (#11862)
- Fix Nemotron MTP crash on SM90 (#11807)
- Fix Mistral Large3 + EAGLE bug (#11942, #11885)
- Fix TeaCache broken caching for FLUX.1 and FLUX.2 (#11868)
- Fix FLUX.1 TeaCache polynomial coefficients and defaults (#12007)
- Implement workaround for
ClientPayloadError(#12018) - Fix duplicate model entry in model list (#12029)
- Fix Python string truthiness bug in FMHA cubin selection (#11909)
-
Documentation
- Fix typos, grammar, and accuracy across documentation (#11766)
- Add sparse attention tech blog (#11644)
- Add known issue for disaggregated serving hang with asymmetric PP/TP (#11789)
- Fix documentation links (#11912)
- Replace “TensorRT-LLM” with “TensorRT LLM” (#11914)
- Add CI trigger and test-failure retrieval instructions to
AGENTS.md(#11803)
-
Benchmark
- Vectorize
quantize_fp8_blockwisewith CUDA kernel (#11724) - Use
F.rms_normfor per-head QK normalization in VisualGen (#11798) - Short-sequence MHA optimization for DSA MLA prefill (#11677)
- Parallel VAE harness and implementation for WAN (#11875)
- Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for VisualGen (#11854)
- Optimize
_prepare_inputshost time (#11704) - Improve
are_stop_wordsperformance (#11196) - Add DeepSeek RCCA performance test case (#11736)
- Add VisualGen benchmarking script (#11651)
- Vectorize
-
Test & Infra
- Add tests for all database configs (#11653)
- Move B200 test stage to AIHub (#11692)
- Support local wheel installation and add GB300 demo cases (#11742)
- Remove submodule pulls from TRT-LLM git checkouts (#11693)
- Add back WAN VBench test in CI (#11804)
- Add E2E test for cancelled disaggregated generation requests with overlap scheduler (#11795)
- Pass Nsight options to
ray_executorand trigger profiling throughcollective_rpc(#11493) - Add B200 multi-node tests DB (#11783)
- Add sanity tests for release 1.2 version (#11738)
- Add QA test case for
trust-remote-codeon multi-node failure (#11905) - Fix
model_nameStarcoder 15B allowed-models issue (#11981) - Upgrade
xgrammarfrom 0.1.25 to 0.1.32 (#12016) - Limit TileIRAS to CUDA 13.1 (#12042)
- Remove VisualGen benchmark test from YAML (#12027)
What's Changed
- [None][feat] Support tensor parallelism for nemotron-h model by @Wanli-Jiang in #11470
- [None][test] Add tests for all database configs. by @fsaady in #11653
- [https://nvbugs/5911143][fix] add async worker to MTP/Eagle3 sampler,… by @dhansen-nvidia in #11573
- [TRTLLM-10886][feat] Support PARD(Parallel Draft Model) in one-model spec dec by @ziyixiong-nv in #11438
- [None][fix] Fix disagg cancellation by @Tabrizian in #11730
- [None][fix] Use prefer_pinned() in pard.py by @mikeiovine in #11762
- [None][fix] Make KVCacheManagerV2 release mem immediately on shutdown by @lowsfer in #11746
- [TRTLLM-11115][feat] enable autotuner for visual gen + Compilation Config by @NVShreyas in #11660
- [None][chore] Minor fix in w4a8 mxfp4 mxfp8 test. by @Tracin in #11745
- [None][infra] Move B200 test stage to AIHub by @yuanjingx87 in #11692
- [None][infra] Waive failed cases for main on 02/27 by @EmmaQiaoCh in #11770
- [TRTLLM-11064][fix] Remove duplicated MoE Computation with Helix CP+DP by @brb-nv in #11167
- [TRTLLM-10386][fix] torch.compile: register add+norm fallback pass in multi-GPU mode by @luyiyun1021 in #11739
- [None][feat] Support heterogeneous tokens_per_block by @lowsfer in #11751
- [None][chore] Remove closed bugs by @xinhe-nv in #11527
- [None][test] local wheel installation support and add gb300 cases demo by @fredricz-20070104 in #11742
- [None][feat] Refactor cache manager v2 to simplify new model support by @jiaganc in #11749
- [https://nvbugs/5879614][fix] Waive test_guided_decoding_with_eagle3 xgrammar in disaggregated serving by @ziyixiong-nv in #11773
- [https://nvbugs/5911788][test] Waive test_llm_partial_update_weights[Qwen3/Qwen3-8B] by @liji-nv in #11785
- [None][feat] add globaltimer-based timing backend for autotuner profi… by @dhansen-nvidia in #11657
- [https://nvbugs/5926823][fix] Propagate logprobs from prefill to decode in disagg by @brb-nv in #11727
- [TRTLLMINF-9][chore] Remove submodule pulls from TRT-LLM git checkouts by @dpitman-nvda in #11693
- [https://nvbugs/5685010][fix] Delete test_eagle3_output_repetition_4gpus flaky assertions. by @zheyuf in #11725
- [None][fix] enable separate draft KV cache pool for aggregated + KVBM… by @zyang-Modular in #11689
- [TRTLLM-11058][feat] Support Helix CP with GQA by @brb-nv in #11570
- [None][perf] Vectorize quantize_fp8_blockwise with CUDA kernel by @karljang in #11724
- [https://nvbugs/5868616][fix] Fix warnings when building moe_kernels.cu by @yumin066 in #11703
- [None][chore] Add CI trigger and test failure retrieval instructions to AGENTS.md by @lucaslie in #11803
- [None][fix] Fix typo: avaiable_blocks -> available_blocks in scheduler by @kaiyux in #11801
- [TRTLLM-11568][feat] Fix collective calls by @greg-kwasniewski1 in #11632
- [None][perf] Use F.rms_norm for per-head QK normalization in visual gen by @karljang in #11798
- [TRTLLM-11185][test] Add back WAN VBench test in CI by @chang-l in #11804
- [TRTLLM-9782][feat] Support to skip KV cache memory estimation by @HuiGao-NV in #11714
- [None][doc] Fix typos, grammar, and accuracy across documentation by @kaiyux in #11766
- [None][fix] cleanup mem in rollout process by @hchings in #11658
- [None][feat] Add --served-model-name option to serve command by @slin1237 in #11711
- [None][chore] Update AGENTS.md by @lucaslie in #11809
- [None][fix] AutoDeploy: Fix shape handling for singleton prefill by @galagam in #11679
- [None][infra] Waive failed cases for main on 03/01 by @EmmaQiaoCh in #11811
- [None][feat] TRT-LLM Gen MoE finalize kernel optimization by @nekorobov in #11501
- [None][test] Add E2E test for cancelled disagg gen request with overlap scheduler by @Tabrizian in #11795
- [None][chore] pass nsight options to ray_executor and trigger profiling through collective_rpc by @davidmlw in #11493
- [TRTLLM-10962][feat] Refactor video encoding to use ffmpeg CLI or pur… by @JunyiXu-nv in #11672
- [https://nvbugs/5823212][fix] Warmup maybe_compiled_cat in forward_context_with_chunked_prefill by @yuantailing in #11743
- [None][feat] Extract embeding as .savetensors and support float8 quantized model by @nvyocox in #11180
- [https://nvbugs/5885070][fix] fix deepeplowlatency with cutedsl moe backend by @leslie-fang25 in #11769
- [None][fix] Fix FP8 per-tensor torch.compile graph break in dynamic quantization by @karljang in #11759
- [TRTLLM-9687][feat] Improve are_stop_words performance by @stnie in #11196
- [https://nvbugs/5883738][fix] fix bug for illegal memory access on Qwen3-235B-A22B-Thinking-2507-NVFP4 + Eagle3 by @sunnyqgg in #11474
- [#10693][chore] AutoDeploy: Add L1 tests from coverage dashboard by @marinayanov in #11530
- [https://nvbugs/5764627][fix] Fix generation logits with streaming and improve runtime of logits testcase. Also fixes https://nvbugs/5573238 by @stnie in #10637
- [https://nvbugs/5934461][fix] Propagate logits from prefill to decode in disagg by @brb-nv in #11767
- [#11726][feat] AutoDeploy: Fuse gemms of mixed children by @taylor-yb-lee in #11793
- [None][fix] Fix overly aggressive capacity scheduler by @jthomson04 in #11731
- [https://nvbugs/5689262][fix] use proper tokens when exclude_input_in_output is true by @lazykyama in #9453
- [https://nvbugs/5863912][fix] Fix with move launch_dependent_grids after tmem free by @benzh-2025 in #11812
- [https://nvbugs/5938603][fix] Fix E/PD disagg chunked prefill bug by @2ez4bz in #11805
- [None][test] add deepseek RCCA perf test case by @ruodil in #11736
- [None][fix] remove torch compile models arg by @NVShreyas in #11836
- [None][test] add b200 multi nodes tests db by @xinhe-nv in #11783
- [None][fix] Fix SM120 issue for rms_norm with nvfp4_quant_fusion by @Wanli-Jiang in #11774
- [None][infra] Waive failed cases for main for post-merge 2564 by @ZhanruiSunCh in #11848
- [https://nvbugs/5936502][fix] remove dead codes by @bo-nv in #11813
- [None][chore] a GitHub Action to assign the PR to the author by @zhenhuaw-me in #11673
- [None][infra] Fix a typo in waives.txt by @EmmaQiaoCh in #11852
- [None][test] Fix wrong lora config by @yufeiwu-nv in #11818
- [None][test] fix flaky issues by @xinhe-nv in #11814
- [None][fix] Fix OOM issue/dummy request allocation/chunked prefill/pp for KV Cache Manager V2 by @yizhang-nv in #11710
- [None][test] update waive list by @xinhe-nv in #11815
- [TRTLLM-9939][perf] Short-sequence MHA optimization for DSA MLA prefill by @kaiyux in #11677
- [None][refactor] Revisit attention interface for AutoDeploy by @lucaslie in #11796
- [None][feat] Add a flag in trtllm serve to support overriding kv cache dtype by @cjluo-nv in #11487
- [TRTLLMINF-9][chore] Use checkoutFile in mergeWaiveList to avoid full clone by @dpitman-nvda in #11794
- [None][chore] Refresh inferenceX configs in recipes by @venkywonka in #11595
- [TRTLLM-11042][feat] Implement suffix automaton on device for spec and support one model by @cascade812 with help from @mahmoudhas in #11434
- [https://nvbugs/5941681][fix] Handle dict type for speculative_config by @ziyixiong-nv in #11828
- [None][feat] Add Kimi-K2.5 text model support (NVFP4) by @lancelly in #11777
- [None][chore] Bump version to 1.3.0rc7 by @yuanjingx87 in #11864
- [https://nvbugs/5919026][fix] Fix AttributeError when DSA indexer accesses non-DSA kv_cache_manager by @ziyixiong-nv in #11858
- [TRTLLM-11184][feat] Explicit video encode format support by @JunyiXu-nv in #11830
- [None][test] Enable DeepGemm + DeepEPLowLatency MoE test combination by @Tabrizian in #11876
- [#10009][fix] Fix json_schema response_format to support OpenAI API w… by @JunyiXu-nv in #11497
- [https://nvbugs/5927620][fix] Override mMaxAttentionWindow with the actual largest window size by @ziyixiong-nv in #11842
- [None][feat] Support mix quantization between shared experts and routed experts for dsv3 by @dmtri35 in #11215
- [#11666][fix] Fix inmemory model dir detection by @capyun007 in #11753
- [None][infra] Waive 3 failed cases for main in post-merge 2566 by @ZhanruiSunCh in #11881
- [None][doc] Add sparse attention tech blog by @heyuhhh in #11644
- [TRTLLM-9392][feat] Support MoE output to alltoall's workspace for all the quantization recipe of trtllm-gen. by @bobboli in #11449
- [TRTLLM-10852][feat] Enhance logprobs functionality to always return prompt token logprobs in prompt logprobs by @stnie in #11235
- [None][fix] Fix typos, grammar, and formatting in comments and docstrings by @kaiyux in #11826
- [None][fix] Update check_is_moe into support mlp_layer_types after config.json update by @eagle705 in #11477
- [https://nvbugs/5946303][fix] Fix incorrect GPU timing in time breakdown under overlap scheduler by @luyiyun1021 in #11860
- [None][chore] Update autotuner by @jiahanc in #11859
- [None][chore] Handle failure in auto-assign author workflow by @zhenhuaw-me in #11906
- [https://nvbugs/5930934][fix] Fix OOM hang with NCCL_SYMMETRIC fallback during long-context inference by @peihu-nv in #11870
- [None][fix] Qwen3.5 fix positions ids input for text-only usage by @bmarimuthu-nv in #11877
- [None][fix] Refactor nanoV3+superV3 accuracy tests to load example config by @galagam in #11458
- [None][chore] Deprecate eagle3 2-model by @mikeiovine in #11761
- [#11819][fix] Disable preload for Llama4 scout by @taylor-yb-lee in #11873
- [None][chore] Fix format issue in tensorrt_llm/serve/openai_server.py by @chienchunhung in #11920
- [None][feat] Separate radix search tree implementation by @thorjohnsen in #10862
- [None][feat] Add support for expert_number<=2048 and K<=32 by @ChristinaZ in #11510
- [None][infra] Waive 1 failed cases for main in pre-merge 29212 by @ZhanruiSunCh in #11929
- [None][fix] remove leak check for kimi by @xinhe-nv in #11825
- [https://nvbugs/5907477][chore] unwaive test by @reasonsolo in #11896
- [TRTLLM-10956][infra] Support build-only mode for GenPostMergeBuilds job by @mzweilz in #11895
- [#11755][feat] AutoDeploy onboarding agent + Kimi K2.5 AD modeling code by @bmarimuthu-nv in #11780
- [None][fix] Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper by @Bias92 in #11862
- [TRTLLM-11101][feat] VisualGen benchmarking script by @zhenhuaw-me in #11651
- [https://nvbugs/5820734][fix] Run extra general warmup to warm up memory pool by @liji-nv in #10340
- [None][fix] Fix nemotron super MTP crash on SM90 by @sunnyqgg in #11807
- [None][chore] Use cluster service discover in disagg CI tests by @ekou24 in #11242
- [None][feat] External Drafter One Model by @IzzyPutterman in #11758
- [None][chore] Update model list by @tcherckez-nvidia in #11827
- [#11578][fix] Use string stop/bad words in gRPC proto instead of pre-tokenized TokenSequence by @CatherineSue in #11888
- [None][feat] Add support for bidirectional sliding window attention mask to fmha_v2 by @djns99 in #11212
- [TRTLLM-11036][feat] Enable new moe test and clean the legacy moe test in the CI by @xxi-nv in #11817
- [None][infra] Waive 4 failed cases for main in post-merge 2571 by @ZhanruiSunCh in #11968
- [None][test] Fix deepseek-r1 OOM issue for H100 perf test by @yufeiwu-nv in #11948
- [None][fix] Remove incorrect Python import style rule from AGENTS.md by @yuxianq in #11940
- [https://nvbugs/5896577][fix] fix bug of mistral large3 with eagle by @byshiue in #11942
- [https://nvbugs/5819048][fix] unwaive test of qwen3-235b eagle3 by @byshiue in #11969
- [None][feat] Avoid duplicated computation with ADP + Helix CP in GQA by @brb-nv in #11891
- [https://nvbugs/5624818][fix] Add unittest for GPT-OSS non-paged_context_fmha by @pengbowang-nv in #11415
- [#10245][feat] AutoDeploy: Support Finegrained FP8 quantization by @bmarimuthu-nv in #10897
- [TRTLLM-11284][infra] Move large models test to post-merge by @EmmaQiaoCh in #11933
- [TRTLLM-11155][infra] Run multi-GPU tests even single-GPU tests are failed when use --disable-fail-fast by @yiqingy0 in #11740
- [None][fix] Refine tests/unittest/_torch/flashinfer/test_trtllm_flashinfer_symbol_collision.py to reduce jit-compile time by @yihwang-nv in #11890
- [#11422][feat] AutoDeploy: Piecewise cudagraph support Prototype by @nvchenghaoz in #11515
- [TRTLLM-11189][fix] VisualGen isolated TeaCache Wan fix by @o-stoner in #11964
- [https://nvbugs/5846166][fix] Update Perf Triage Scripts to Fix gen_only issue by @chenfeiz0326 in #11802
- [TRTLLM-11057][feat] Add Helix CP support for DSV3.2 by @brb-nv in #11507
- [#2912][feat] Support Cohere Command A model by @torotoki in #11505
- [TRTLLM-11259][perf] Parallel VAE harness and implementation for WAN by @NVShreyas in #11875
- [#11578][feat] support multimodal image input in gRPC server by @CatherineSue in #11800
- [TRTLLM-11093][feat] add 5D A2A for fused ulysses by @NVShreyas in #11787
- [TRTLLM-11189][fix] Fix TeaCache broken caching for FLUX.1 and FLUX.2 by @karljang in #11868
- [None][refactor] Request management in ScheduledRequests by @Funatiq in #11784
- [None][perf] Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for visual gen by @chang-l in #11854
- [TRTLLM-11290][feat] Enable trtllm-serve E2E tests by @JunyiXu-nv in #11985
- [None][feat] Optimize by fuse nvfp4_quant to layernorm_gated for mamba2_mixer by @Wanli-Jiang in #11473
- [None][chore] Autodeploy: add models for sprint by @nvchenghaoz in #11999
- [None][infra] Update CI allow list 20260305 by @yuanjingx87 in #11965
- [None][chore] Mass integration of release/1.2 weekly - 6th by @dominicshanshan in #11934
- [None][fix] Fix Collect Perf Sanity Result's import requests Error by @chenfeiz0326 in #12002
- [TRTLLM-10956][infra] Skip updating gitlab status for GenPostMergeBuilds by @mzweilz in #11954
- [None][feat] add ReLU2 NVFP4 fusion for AutoDeploy with tests by @tcherckez-nvidia in #11957
- [TRTLLM-11159][feat] Wire KVCacheBlock to UnifiedBlockTree, replacing mPrevBlock/mNextBlocks with lookup-node pointers. by @SimengLiu-nv in #11919
- [#11166][infra] AutoDeploy: improve test organization in CI and add overview doc by @lucaslie in #11291
- [None][chore] Model update 260308 by @tcherckez-nvidia in #12011
- [None][infra] Update AutoDeploy CODEOWNERS coverage by @lucaslie in #12013
- [https://nvbugs/5732958][bug] Fix TestLlama4MinLatency::test_llama_allclose_to_hf failure by @nvpohanh in #10191
- [None][chore] Unwaive some skip for trtllm moe backend by @leslie-fang25 in #11975
- [TRTLLM-11134][feat] export VisualGen API and update doc by @zhenhuaw-me in #11911
- [https://nvbugs/5823783][test] add qa test case for trust-remote-code on multinode failure by @crazydemo in #11905
- [None][feat] Use max_gpu_total_bytes to control v2's capacity by @jiaganc in #11907
- [TRTLLM-11342][fix] Fix FLUX.1 TeaCache polynomial coefficients and default t… by @karljang in #12007
- [None][fix] Use try/except fallback for Pydantic ValidatorIterator in chat message parsing by @Wanli-Jiang in #11903
- [None][infra] Unwaive 2 cases on rtx-pro-6000d by @EmmaQiaoCh in #12003
- [TRTLLM-11276][chore] Expose use_python_scheduler in SchedulerConfig and add UTs/ITs for python scheduler by @lancelly in #11884
- [None][infra] Waive 7 failed cases for main in post-merge 2576 by @ZhanruiSunCh in #12014
- [https://nvbugs/5948878][fix] Implement workaround for ClientPayloadError by @yingguo-trt in #12018
- [TRTLLM-10407][feat] Integrate CuTE DSL top-k kernel for Blackwell by @limin2021 in #11900
- [TRTLLM-11148][perf] _prepare_inputs host time optimization by @hyukn in #11704
- [None][test] Fix model_name starcoder_15b is not in allowed_models issue by @yufeiwu-nv in #11981
- [None][infra] Waive 5 failed cases for main in post-merge 2578 by @ZhanruiSunCh in #12023
- [None][chore] AutoDeploy: re-enable nvfp4 superv3 accuracy test by @galagam in #11945
- [None][chore] Remove visual_gen benchmark test from YAML by @zhenhuaw-me in #12027
- [None][fix] Fix the model list as it had a dup model by @tcherckez-nvidia in #12029
- [https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection by @luyiyun1021 in #11909
- [None][feat] Upgrade xgrammar from 0.1.25 to 0.1.32 by @sunnyqgg in #12016
- [https://nvbugs/5924144][test] unwaive cpp/test_unit_tests.py::test_unit_tests[kernels-80] by @Funatiq in #11902
- [None][chore] limit tileiras to CUDA13.1 by @tburt-nv in #12042
- [None][feat] Add silu to trtllm-gen MoE by @IwakuraRein in #11663
- [TRTLLM-11045][feat] Integrate SA with EAGLE3 and PARD by @cascade812 in #11878
- [None][chore] waive test_visual_gen_quickstart by @tburt-nv in #12043
- [None][feat] NIXL support for hybrid model cache transfer by @NVShreyas in #11608
New Contributors
- @zyang-Modular made their first contribution in #11689
- @slin1237 made their first contribution in #11711
- @davidmlw made their first contribution in #11493
- @marinayanov made their first contribution in #11530
- @lazykyama made their first contribution in #9453
- @capyun007 made their first contribution in #11753
- @Bias92 made their first contribution in #11862
- @ekou24 made their first contribution in #11242
- @o-stoner made their first contribution in #11964
- @torotoki made their first contribution in #11505
- @IwakuraRein made their first contribution in #11663
Full Changelog: v1.3.0rc6...v1.3.0rc7
详细ChangeLogv1.3.0rc5.post1
2026年03月07日
What's Changed
- [None][chore] bump version to 1.3.0rc5.post1 by @tburt-nv in #11788
- [None][fix] Cherry pick cancel fix by @pcastonguay in #11790
- [https://nvbugs/5926823][fix] Cherry-pick: Propagate logprobs from prefill to decode in disagg (#11727) by @pcastonguay in #11792
- [https://nvbugs/5934461][fix] Cherry-picks 11767 (logits support in disagg) by @pcastonguay in #11832
- [https://nvbugs/5935104][fix] Cherry-pick Fix overly aggressive capacity scheduler by @pcastonguay in #11834
- [https://nvbugs/5938603][fix] Cherry-pick Fix E/PD disagg chunked prefill bug (#11805) by @pcastonguay in #11847
- [https://nvbugs/5930934][fix] Cherry-pick fix NCCL OOM hang by @pcastonguay in #11916
Full Changelog: v1.3.0rc5...v1.3.0rc5.post1
详细ChangeLog