Guided decoding with xgrammar #3965

windreamer · 2025-09-12T09:16:51Z

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

Build-system
- Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
- Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
Core C++ changes
- DynamicDecodeLayer pipeline extended with two new layers:
  - GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
  - GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
- Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
Python frontend
- Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
- turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

Pre-commit hooks (clang-format, flake8, mypy) passed.
Document updated

shell-nlp · 2025-09-16T15:34:45Z

good job!

lmdeploy/turbomind/tokenizer_info.py

requirements/runtime_cuda.txt

lvhan028 · 2025-09-24T13:59:44Z

src/turbomind/layers/DynamicDecodeLayer.cc

    layers_.emplace_back(new LogitsProcessorLayer<float>{param});
+    layers_.emplace_back(new GuidedDecodeMaskLayer<float>{param});
    layers_.emplace_back(new SamplingLayer<float>{param});
+    layers_.emplace_back(new GuidedDecodeUpdateLayer<float>{param});
    layers_.emplace_back(new StopCriteriaLayer<float>{param});


The sampling-related classes are declared as templates, but the template parameter T does not appear to be utilized in any of the following:

Member variable

Member function

Base class
how about remove the templates? @lzhangzz @irexyc any comments?

src/turbomind/kernels/apply_token_bitmask_inplace_cuda.cu

lvhan028 · 2025-09-24T14:25:53Z

cc @zhulinJulia24 may consider CI for guided decoding functions.
Here is the guide https://lmdeploy.readthedocs.io/en/latest/advance/structed_output.html

windreamer · 2025-09-25T01:11:32Z

What's the size of whl file if this PR is applied?

 ls -alh lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl
-rw-rw-r-- 1 tianzhongbo tianzhongbo 92M Sep 25 09:07 lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl

It seems we will increase the package size from 79M to 92M, which is 17%

src/turbomind/layers/DynamicDecodeLayer.cc

src/turbomind/layers/sampling_layers/GuidedDecodeMaskLayer.cc

windreamer · 2025-09-28T09:05:43Z

Also done replace outdated outlines with xgrammar in PyTorch Engine. And this enabled us to:

No longer need to restrict numpy<2
No buggy pyairports packages as dependencies

windreamer changed the title ~~Guided decoding with xgrammar~~ [WIP] Guided decoding with xgrammar Sep 12, 2025

windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44

windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41

lvhan028 reviewed Sep 24, 2025

View reviewed changes

lmdeploy/turbomind/tokenizer_info.py Show resolved Hide resolved

lvhan028 reviewed Sep 24, 2025

View reviewed changes

requirements/runtime_cuda.txt Outdated Show resolved Hide resolved

lvhan028 reviewed Sep 24, 2025

View reviewed changes

src/turbomind/kernels/apply_token_bitmask_inplace_cuda.cu Outdated Show resolved Hide resolved

windreamer force-pushed the guided_decoding_with_xgrammar branch from bf3a8ea to 39b4d13 Compare September 24, 2025 16:46

windreamer added 12 commits September 25, 2025 12:40

feat(turbomind): bring xGrammar into build

b92b821

feat(turbomind): add skeleton for guided decoding layers

24215da

feat(turbomind): add implementation for naive bitmap mask with a loop

b131558

add ModelRequest support for xgrammar

acea4c2

feat: enable grammar init in turbomind

4cfacbb

fix: fix some bug and add initial tests

0517cf9

feat: restructure the interface

7e39dff

feat: speedup with cuda inplace kernel

fa438d3

fix: fix test case

e904f70

fix: use stream from context instead of the default stream

13d83a8

test: add matrix grammar test

d214421

fix: simplify the bitmap apply kernel

2355af6

windreamer force-pushed the guided_decoding_with_xgrammar branch from b75e4c9 to 2355af6 Compare September 25, 2025 04:41

lzhangzz reviewed Sep 25, 2025

View reviewed changes

src/turbomind/layers/DynamicDecodeLayer.cc Show resolved Hide resolved

src/turbomind/layers/sampling_layers/GuidedDecodeMaskLayer.cc Outdated Show resolved Hide resolved

feat: move tensor allocation to ctor

3c4cbdb

windreamer force-pushed the guided_decoding_with_xgrammar branch from 8413618 to 3c4cbdb Compare September 26, 2025 01:18

test: temporarily disable pytorch engine tests as it is faulty

ac34675

windreamer mentioned this pull request Sep 28, 2025

[Feature] turbomind后端是否会支持guided_decoding #2771

Open

feat: replace outlines with xgrammar in pytorch engine

488399c

windreamer requested a review from grimoire September 28, 2025 09:06

windreamer linked an issue Sep 28, 2025 that may be closed by this pull request

[Feature] turbomind后端是否会支持guided_decoding #2771

Open

test: move timm to test requirements

297effb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guided decoding with xgrammar #3965

Guided decoding with xgrammar #3965

windreamer commented Sep 12, 2025

Uh oh!

shell-nlp commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

lvhan028 Sep 24, 2025

Uh oh!

Uh oh!

lvhan028 commented Sep 24, 2025

Uh oh!

windreamer commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

windreamer commented Sep 28, 2025

Uh oh!

Uh oh!

Guided decoding with xgrammar #3965

Are you sure you want to change the base?

Guided decoding with xgrammar #3965

Conversation

windreamer commented Sep 12, 2025

Motivation

Modification

Checklist

Uh oh!

shell-nlp commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

lvhan028 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lvhan028 commented Sep 24, 2025

Uh oh!

windreamer commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

windreamer commented Sep 28, 2025

Uh oh!

Uh oh!