Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is this PR about?
The initial
AgentExecutionEngineimplementation operates at the text level, decoding newly generated tokens into text and appending this text to the conversation history. This history is then re-tokenized for the next step.This approach creates a critical issue due to tokenization ambiguity. For example, different token sequences like
[53122, 316]and[393, 30002]can both decode to the same text string,' Pantom'. Because of this, operating on the text level leads to a discrepancy: the accumulated prompt and response tokens may not be the same as the actual tokens the model saw and generated.This mismatch is a form of off-policy issue in Reinforcement Learning, which was observed to cause significant spikes in KL divergence.
How we fix
To fix this, we introduced
rllm.rollout_assemble_mode=hybrid. This mode bypasses text-level operations during trajectory generation. Instead, it directly tracks the exact prompt and response tokens seen and generated by the model at each step. These token sequences are assembled later when forming the final trajectory.We validate trajectories by ensuring each step's token sequence is an exact prefix of the subsequent step. If this prefix check fails, it signifies that a re-tokenization discrepancy occurred, and we discard the trajectory by setting
response_masks=0.Experiment Result
Experiments conducted on Qwen/Qwen3-4B using a multiturn search task on Hotpot QA demonstrate the effectiveness of this new mode. The hybrid rollout assemble mode shows significantly more stable KL divergence and clipfrac values, while achieving on-par or better task performance compared to the original text-based method. wandb report