Fix retokenization #272

thwu1 · 2025-10-26T19:10:21Z

What is this PR about?
The initial AgentExecutionEngine implementation operates at the text level, decoding newly generated tokens into text and appending this text to the conversation history. This history is then re-tokenized for the next step.

This approach creates a critical issue due to tokenization ambiguity. For example, different token sequences like [53122, 316] and [393, 30002] can both decode to the same text string, ' Pantom'. Because of this, operating on the text level leads to a discrepancy: the accumulated prompt and response tokens may not be the same as the actual tokens the model saw and generated.

This mismatch is a form of off-policy issue in Reinforcement Learning, which was observed to cause significant spikes in KL divergence.

How we fix
To fix this, we introduced rllm.rollout_assemble_mode=hybrid. This mode bypasses text-level operations during trajectory generation. Instead, it directly tracks the exact prompt and response tokens seen and generated by the model at each step. These token sequences are assembled later when forming the final trajectory.

We validate trajectories by ensuring each step's token sequence is an exact prefix of the subsequent step. If this prefix check fails, it signifies that a re-tokenization discrepancy occurred, and we discard the trajectory by setting response_masks=0.

Experiment Result
Experiments conducted on Qwen/Qwen3-4B using a multiturn search task on Hotpot QA demonstrate the effectiveness of this new mode. The hybrid rollout assemble mode shows significantly more stable KL divergence and clipfrac values, while achieving on-par or better task performance compared to the original text-based method. wandb report

kylemontgomery1 · 2025-10-28T00:25:37Z

We have the same issue on the workflow side, particularly when using v0.1 style agents/envs with the default single/multi turn or cumulative workflows. Currently, in this case, the training instances are built from the chat completions, which can produce different completion ids than were originally generated. Ideally, agent.update_from_model would accept a ModelOutput obj instead of response str and store the obj in the trajectory step, but this would require refactoring all of the agents.

jeffreysijuntan · 2025-10-28T06:58:15Z

Get this merged as a temporary workaround in AgentExecutionEngine. We will eventually come up with a systematic design change for both AgentExecutionEngine and AgentWorkflowEngine.

listar2000 · 2025-10-29T05:01:10Z

By the way, it seems that the current retokenization logic for workflows is managed on the user side? The agent_workflow_engine is not making any call to get_model_response (while a typical workflow usually does) as far as I can tell.

thwu1 added 4 commits October 26, 2025 11:32

support hybrid rollout assemble mode to fix retokenization error

d100aa7

fix pre-commit

e5cfa46

default to filter token mismatch

12b7cfb

set filter token mismatch in config

3306c3c

jeffreysijuntan merged commit 0360a25 into rllm-org:nightly Oct 28, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix retokenization #272

Fix retokenization #272

Uh oh!

thwu1 commented Oct 26, 2025 •

edited

Loading

Uh oh!

kylemontgomery1 commented Oct 28, 2025

Uh oh!

jeffreysijuntan commented Oct 28, 2025

Uh oh!

Uh oh!

listar2000 commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix retokenization #272

Fix retokenization #272

Uh oh!

Conversation

thwu1 commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylemontgomery1 commented Oct 28, 2025

Uh oh!

jeffreysijuntan commented Oct 28, 2025

Uh oh!

Uh oh!

listar2000 commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thwu1 commented Oct 26, 2025 •

edited

Loading