Skip to content

Conversation

@thwu1
Copy link
Contributor

@thwu1 thwu1 commented Oct 26, 2025

What is this PR about?
The initial AgentExecutionEngine implementation operates at the text level, decoding newly generated tokens into text and appending this text to the conversation history. This history is then re-tokenized for the next step.

This approach creates a critical issue due to tokenization ambiguity. For example, different token sequences like [53122, 316] and [393, 30002] can both decode to the same text string, ' Pantom'. Because of this, operating on the text level leads to a discrepancy: the accumulated prompt and response tokens may not be the same as the actual tokens the model saw and generated.

This mismatch is a form of off-policy issue in Reinforcement Learning, which was observed to cause significant spikes in KL divergence.

How we fix
To fix this, we introduced rllm.rollout_assemble_mode=hybrid. This mode bypasses text-level operations during trajectory generation. Instead, it directly tracks the exact prompt and response tokens seen and generated by the model at each step. These token sequences are assembled later when forming the final trajectory.

We validate trajectories by ensuring each step's token sequence is an exact prefix of the subsequent step. If this prefix check fails, it signifies that a re-tokenization discrepancy occurred, and we discard the trajectory by setting response_masks=0.

Experiment Result
Experiments conducted on Qwen/Qwen3-4B using a multiturn search task on Hotpot QA demonstrate the effectiveness of this new mode. The hybrid rollout assemble mode shows significantly more stable KL divergence and clipfrac values, while achieving on-par or better task performance compared to the original text-based method. wandb report

@kylemontgomery1
Copy link
Collaborator

We have the same issue on the workflow side, particularly when using v0.1 style agents/envs with the default single/multi turn or cumulative workflows. Currently, in this case, the training instances are built from the chat completions, which can produce different completion ids than were originally generated. Ideally, agent.update_from_model would accept a ModelOutput obj instead of response str and store the obj in the trajectory step, but this would require refactoring all of the agents.

@jeffreysijuntan
Copy link
Contributor

Get this merged as a temporary workaround in AgentExecutionEngine. We will eventually come up with a systematic design change for both AgentExecutionEngine and AgentWorkflowEngine.

@jeffreysijuntan jeffreysijuntan merged commit 0360a25 into rllm-org:nightly Oct 28, 2025
1 check passed
@listar2000
Copy link
Contributor

By the way, it seems that the current retokenization logic for workflows is managed on the user side? The agent_workflow_engine is not making any call to get_model_response (while a typical workflow usually does) as far as I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants