Now, `LLMRayActor` returns logprobs, and we calculate some stats about them vs the trainer logprobs in `grpo_fast.py`. #1041

finbarrtimbers · 2025-09-29T15:19:34Z

Added a use_vllm_logprobs flag which uses the vllm logprobs to train on instead of the local ones from the learner model. Also added the KL divergence from the generator to the trainer, and an option to use truncated importance sampling via the truncated_importance_sampling_ratio_cap flag.

Runs with --truncated_importance_sampling_ratio_cap 2.0:

Single GPU: Beaker
Multi-node: Beaker

Runs with --use_vllm_logprobs:

Single GPU: Beaker
Multi-node: Beaker

mnoukhov

Instead of purely using the vllm logprobs which always seemed pretty unstable, can we using something like truncated importance sampling from https://fengyao.notion.site/off-policy-rl

if you want to do it token-wise here's a code snippet from the authors https://github.com/yaof20/verl/blob/1e413344a2f31aefdbd05457843274f84dff9f2d/verl/trainer/ppo/core_algos.py#L893

…aN source - Added truncated_importance_sampling_ratio_cap parameter (default 0.0) - Implemented importance sampling with comprehensive assertions - Added checks for INVALID_LOGPROB values and extreme logprob differences - Added NaN checks at each step of the calculation This will help identify exactly where NaNs are introduced when importance sampling is enabled. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add torch.nan_to_num to replace NaN values with INVALID_LOGPROB after collation - Query tokens in packed sequences are set to NaN by pack_sequences in rl_utils2.py - Apply nan_to_num in both training loop (line 1031) and old_logprobs calculation (line 987) - Implement proper importance sampling with masking: * Only apply IS where both logprobs are valid (not INVALID_LOGPROB) * Use response mask to ensure only response tokens are affected * Initialize importance ratio to 1.0 (neutral) for invalid positions * Clamp logprob differences to prevent numerical overflow - Remove all debug assertions that were causing failures - Ensure importance sampling only affects valid response token positions This fixes the 'NaN in mb_old_logprobs before IS' assertion error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

finbarrtimbers · 2025-10-07T15:25:01Z

Run with truncated importance sampling: Wandb

mnoukhov

Just do the check for not setting both args at the same time (as they conflict)

The other changes aren't necessary. Overall, I think our logic for logprobs is a bit convoluted but that's not the point of this PR so no need to fix.

open_instruct/grpo_fast.py

…tch at source This adds logging to check if vLLM CompletionOutput has mismatched token_ids and logprobs lengths. According to vLLM source analysis, all generated tokens should have logprobs, so N-1 behavior would indicate a bug. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Root cause: We were unconditionally appending EOS token when finish_reason='stop', but not appending corresponding logprob. This created len(response) = N+1 but len(logprobs) = N mismatch. Fix: - Only append EOS for truly empty responses (the actual edge case) - When we do append EOS, also append NaN to logprobs - Normal responses ending with </answer> no longer get EOS appended - vLLM returns N logprobs for N tokens, so no mismatch Also added assertions to verify masks match correctly. Updated diagnostic logging to treat length mismatches as errors rather than expected behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

finbarrtimbers · 2025-10-07T22:50:06Z

After discussing with Hamish, one of the main issues is that we had code which adds EOS when we use a stop string (for instance, in reasoning, we end when we generate </answer>, not <eos>). Because we're manually adding EOS instead of having vLLM generate it, we obviously don't have logprobs, which is why the masks are different.

I changed this PR to remove the EOS addition. Runs seem unaffected:

https://wandb.ai/ai2-llm/open_instruct_internal?nw=ba89hjgyfp6

finbarrtimbers added 9 commits September 29, 2025 09:15

Now, llmrayactor returns logprobs

8f2145c

Updated code

4632792

CLeaned up PR.

fb6028d

Cleaned up PR.

c275203

Updated logprob code

966c78b

Fixed code

1ef90d8

now uses nan

090d104

Now, we filter nans

5ca9d0b

Cleaned up code.

54728e9

finbarrtimbers requested a review from mnoukhov September 29, 2025 16:54

finbarrtimbers marked this pull request as ready for review September 29, 2025 16:54

Fixed tests

bdaf060

finbarrtimbers enabled auto-merge September 29, 2025 18:09

finbarrtimbers added 6 commits September 29, 2025 12:16

Updated code

2d4f540

Added vllm logprobs

280b4f8

Cleaned up code

4f871d6

Undo changes to script.

a6ea5da

Fixed bug in logprobs

3d7c852

fixed failing tests

03c4207

mnoukhov requested changes Sep 30, 2025

View reviewed changes

finbarrtimbers added 5 commits October 7, 2025 06:51

Merge branch 'main' into vllm-logprobs

c8c3afe

Added importance sampling ratio

f7c6bca

Added back comment

82d9535

Removed comment

deedd09

Test config

b493408

hamishivi mentioned this pull request Oct 7, 2025

Truncated Importance Sampling #1051

Closed

finbarrtimbers and others added 3 commits October 7, 2025 08:23

added reverse kl

4d6fa40

Merge branch 'main' into vllm-logprobs

8eaee10

finbarrtimbers requested a review from mnoukhov October 7, 2025 15:35

mnoukhov requested changes Oct 7, 2025

View reviewed changes

open_instruct/grpo_fast.py Show resolved Hide resolved

open_instruct/grpo_fast.py Outdated Show resolved Hide resolved

open_instruct/grpo_fast.py Show resolved Hide resolved

open_instruct/grpo_fast.py Outdated Show resolved Hide resolved

finbarrtimbers and others added 7 commits October 7, 2025 11:04

Simplified code

4f07ed6

changes to debug mask

61f1441

more logging

ddf2425

more logging

8f4448a

Addedcomment describing

2b7e531

finbarrtimbers added 5 commits October 8, 2025 07:27

Address review comments.

88fcfcc

Updated scripts

4afb6dd

Updated scripts

dbd3cb2

Cleaned up PR.

5dea688

Added assert

ed7bb9c

finbarrtimbers requested a review from mnoukhov October 8, 2025 17:54

mnoukhov approved these changes Oct 8, 2025

View reviewed changes

finbarrtimbers added this pull request to the merge queue Oct 8, 2025

Merged via the queue into main with commit bd26188 Oct 8, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Now, `LLMRayActor` returns logprobs, and we calculate some stats about them vs the trainer logprobs in `grpo_fast.py`. #1041

Now, `LLMRayActor` returns logprobs, and we calculate some stats about them vs the trainer logprobs in `grpo_fast.py`. #1041

finbarrtimbers commented Sep 29, 2025 •

edited

Loading

Uh oh!

mnoukhov left a comment

Uh oh!

finbarrtimbers commented Oct 7, 2025

Uh oh!

mnoukhov left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Now, LLMRayActor returns logprobs, and we calculate some stats about them vs the trainer logprobs in grpo_fast.py. #1041

Now, LLMRayActor returns logprobs, and we calculate some stats about them vs the trainer logprobs in grpo_fast.py. #1041

Conversation

finbarrtimbers commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mnoukhov left a comment

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers commented Oct 7, 2025

Uh oh!

mnoukhov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Now, `LLMRayActor` returns logprobs, and we calculate some stats about them vs the trainer logprobs in `grpo_fast.py`. #1041

Now, `LLMRayActor` returns logprobs, and we calculate some stats about them vs the trainer logprobs in `grpo_fast.py`. #1041

finbarrtimbers commented Sep 29, 2025 •

edited

Loading