[Draft] Comparing toolu with main #963

shatu · 2025-08-28T23:39:20Z

Draft PR for a branch -- No review needed

* Added scripts to run benchmarks. * Removed install script. * Added install script back.

* first pass remap verifier * make judge json parsing a little more robust * typoooooooo * typoooooooo * fix logic... * clean logging naming up

Co-authored-by: Michael Noukhovitch <[email protected]>

* add punk tokenizer * fix up command

* Made changes. * Switched to use ray.util.queue.Queue instead of a custom RayQueue class. * Now, only handles new version. * Updated benchmark_generators.py and test_grpo_fast.py. * CLeaned up code from Claude. * training_step defaults to None. * Added an info dataclass to replace the tuple. * Removes assumption that queries_prompt_Q and inference_results_Q are in sync by moving queries_prompt_Q to be a map. * CLeaned up benchmark * Added code to split batch sizes. * Removed benchmark scripts, which are now in a separate PR. * Now, we create all Ray queues in main, and pass them in as appropriate. * Removed changes * Test changes. * Linter passes * Added tests. * Now, we index with the dataset indices. * Checks and tests pass. * Ran linter * Added benchmark scripts back. Whoops.

* Set new default value for num_samples * Now run N batches at once * different batch size * Fix pack length * Fix pack length * Fix wasted compute % (was accidentally multiplying by 100), and fix num rollouts (was referencing the wrong variable). * Now, we save benchmark results to CSV. * Now show a percentage for time spent generating. * Updated benchmark saving code. * Fixed syntax error. * Fixed benchmark * Fixed timing code. * Removed changes to vllm_utils3.py. * Now, we actually write the data to disk> * Bigger batch * Modified benchmark * Undid changes to benchmark script. * Temp change * Undid changes to benchmark script.

it was only being installed in regular Dockerfile Co-authored-by: Michael Noukhovitch <[email protected]> Co-authored-by: Saurabh Shah <[email protected]>

Co-authored-by: Michael Noukhovitch <[email protected]>

* binary reward for code * style * binary code reward flag -> pass rate reward threshold

* Now, we run individual prompts through the queue. * Fixed issues. * Ran linter * Fixed linter errors. * COde lints. * Test passes. * Ran linter. * Ensures that we send single prompts as requests. * Now, code lints. * Cleaned up code. * Fixes test. * Linter passes. * Cleaned test up. * Removed redundant comments.

* Adds flashinfer dep. * Now, open_instruct builds even on mac. * Updated install instructions to add flash-infer. * Now, we set flashinfer as the default attention backend. * Added flashinfer to the base dockerfile. * Ran linter. * Removed extra changes to mason.py. * Undid changes to uv.lock. * Updated requirements.txt * Updated flash-attn version. --------- Co-authored-by: Hamish Ivison <[email protected]>

* delete function Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Update open_instruct/dataset_transformation.py --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Hamish Ivison <[email protected]>

prev-branch: padding-free-squashing-7 Co-authored-by: Hamish Ivison <[email protected]>

This reverts commit 541058c.

* Fix misnamed variables. * Ran linter.

Co-authored-by: Hamish Ivison <[email protected]>

Adds new olmo-core-compatible chat templates. Includes: * New olmo template with support for function-calling. Includes a basic hard-coded system prompt, and appends "You do not have access to any functions" to any SFT examples that do not include functions. * Thinker version of the above template, has <think> included in the generation prompt * R1-style thinker template These 3 templates mirror our current Tulu templates Also includes some necessary changes to the --add_bos logic, to handle the new chat template which does not have a bos token. Includes a few other QoL fixes: * Fixes a bug in the olmocore tokenization script re: label mask * Logs dataset-level statistics during data mixing and tokenization * Supports easy upsampling during data mixing

* fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]>

* remove moar things * create on pr * dont create on pr

* Moved init to the main thread, before the queues. * Undid changes to test.

) * init and run * minor updates * minor script additions * nit * add readme and basic special token filtering * minor additions * Delete ngram_test_results/removal_rates_summary.txt

Co-authored-by: Saurabh Shah <[email protected]>

* init branch * it works * nits * false positive regex * manual filtering * fix manual label issue * another fix * minor fixes * final tweaks * clean up * cleaning * cleanup and docs * try removing tests spec * reset pyproject

* Removed outdates files * Added tool files * Ran tests * Ran linter * Always uses the real code execution server * Reduced nesting * Removed comments * Fixed tests

…y're slow and not needed now that we don't use `flashinfer`. (#1025) * Fixed tests. * Updated deps to remove xdist.

…_loss="sum"` any longer. (#1024) * Removed outdates files * Added tool files * Ran tests * Ran linter * Always uses the real code execution server * Reduced nesting * Removed comments * Removed reduce_loss argument, default is now mean * Removed -n auto from tests. * Fixed linter errors. * Removed reduce_loss arguments from scripts.

* update filtering script * address gemini comments * address comments

* Fixed issues with token metrics * Also changes training_start_time to use perf_counter * Fixed eta

* Added actor tokens_per_second * Fixed bug where one timing was using time.time() and the other was using time.perf_counter(). * Updated code * Cleaned up PR.

* Add MFU/MBU calculation support to grpo_fast - Added MFU/MBU fields to TokenStatistics dataclass - Moved ModelDims and GPU_SPECS to utils.py to avoid circular imports - Calculate batch-wide MFU/MBU in accumulate_inference_batches - Thread model_dims parameter through function signatures - Use module-level constants for FLOP_PER_MAC and SOFTMAX_FLOPS_PER_SCORE 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Now, we calculate MFU/MBU. * Now, we report actor mfu/mbu to wandb. * Cleaned up PR. * Cleaned up how we load model dims. * Cleand up code. * fixed intermediate size bug. * Fixed method calls. * Added GPU memory logging around ModelDims loading to debug OOM issue * Fix OOM by avoiding loading model weights when getting config The issue was that ray.get(vllm_engines[0].get_vllm_config.remote()) was loading 19GB of model weights into GPU memory on the main process. Changed to only fetch necessary model dimensions as a simple dict, avoiding serialization of the full vLLM config object which contained model references. * Fixed eval calls * Cleaned up PR * Updated code so tests pass. * Updated code * Removed debugging logs --------- Co-authored-by: Claude <[email protected]>

* fix * gemini comment

* fix some minor niggles * pas through seed * fix * address comment * fix

* Added learner MFU * Cleaned up code. * Cleaned up PR * Fixed calculation. * Fixed tests.

* logging oe eval to wandb when using new oe-eval-interal requires PR from oe-eval-internal allenai/oe-eval-internal#636 * wandb run step arg * undo formatting * gemini's suggested leading space * default log to wandb and check secret * wandb util test * doctest string * make it default

* Added timing for individual weight syncs * Fixed bug * Fixed bug

docker image doesn't have jq so get username without it update comment to remove reference to removed flag --oe_eval_log_to_wandb

) * Now, we check to see if we need to rebuild our image. * Updated script to use --description instead of --desc. * Update script to fix syntax error. * Made change.

* Update cluster names for open-instruct scripts * Updated docs * Removed/renamed outdated cluster names. * Updated code to remove outdated cluster references. * Updated all scripts again. * Updated cluster references as per Hamish's suggestions.

…ace. (#1049) * updated code to set default env vars in a central place. * Inlined code.

* CLeaned up code * Added comment

* Set vllm_use_v1=1 * Updated search endpoint * Update search url

…#1053) * Loads model on device * Update open_instruct/grpo_fast.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Set conditional device map * Updated code --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Loads model on device * Update open_instruct/grpo_fast.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Set conditional device map * Set CUDA device before init_process_group to silence NCCL warning Adds torch.cuda.set_device(self.local_rank) immediately before init_process_group() calls in PolicyTrainerRayProcess.setup_model_update_group() to silence the NCCL warning about unknown device mapping. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Set CUDA device immediately before DeepSpeed init to silence NCCL warning The NCCL warning was coming from deepspeed.init_distributed(), not from the model_update_group initialization. Added torch.cuda.set_device(self.local_rank) immediately before deepspeed.init_distributed() to ensure the device is properly set when DeepSpeed creates its distributed process group. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * trying to fix it * Removed unneeded code. * Cleaned up PR. * Cleaned PR * Cleaned PR * Undid changes to open_instruct/utils.py --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Claude <[email protected]>

This reverts commit e8db3e7.

…ang forever and take up resources. (#1058) * Added timeouts. * Update scripts/train/debug/large_test_script.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Removed all tabs. * Formatted whitespace from mason.py.

* Updated LLMRayActor to run continuously * Set v1 = true * Added pause loop in update weights * SEt inflight updates false * Fixed condition * Update code * another attempt at fixing deadlocks * Added logs * More logging to diagnose hang * Fixed busy-waiting. * Updated the code * Added logging * fixed prefetch * Updated code * Cleaned up PR. * Cleaned up code. * Fixed race condition * Removed logging from update_weight * Less logging * Cleaned up PR. * Cleaned up PR. * Cleaned up PR. * Cleaned up PR. * Rearranged timeout * Removes broken code. * Undid changes to inflight * Updated code with a health check on the actors.

* updated filtering script whoops remove things * add command * address comments * need list * return type

…t them vs the trainer logprobs in `grpo_fast.py`. (#1041) * Now, llmrayactor returns logprobs * Updated code * CLeaned up PR. * Cleaned up PR. * Updated logprob code * Fixed code * now uses nan * Now, we filter nans * Cleaned up code. * Fixed tests * Updated code * Added vllm logprobs * Cleaned up code * Undo changes to script. * Fixed bug in logprobs * fixed failing tests * Added importance sampling ratio * Added back comment * Removed comment * Test config * Add truncated importance sampling with debug assertions to identify NaN source - Added truncated_importance_sampling_ratio_cap parameter (default 0.0) - Implemented importance sampling with comprehensive assertions - Added checks for INVALID_LOGPROB values and extreme logprob differences - Added NaN checks at each step of the calculation This will help identify exactly where NaNs are introduced when importance sampling is enabled. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix NaN handling in truncated importance sampling - Add torch.nan_to_num to replace NaN values with INVALID_LOGPROB after collation - Query tokens in packed sequences are set to NaN by pack_sequences in rl_utils2.py - Apply nan_to_num in both training loop (line 1031) and old_logprobs calculation (line 987) - Implement proper importance sampling with masking: * Only apply IS where both logprobs are valid (not INVALID_LOGPROB) * Use response mask to ensure only response tokens are affected * Initialize importance ratio to 1.0 (neutral) for invalid positions * Clamp logprob differences to prevent numerical overflow - Remove all debug assertions that were causing failures - Ensure importance sampling only affects valid response token positions This fixes the 'NaN in mb_old_logprobs before IS' assertion error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * added reverse kl * Simplified code * changes to debug mask * more logging * more logging * Addedcomment describing * Add diagnostic logging to vllm_utils3 to detect logprobs length mismatch at source This adds logging to check if vLLM CompletionOutput has mismatched token_ids and logprobs lengths. According to vLLM source analysis, all generated tokens should have logprobs, so N-1 behavior would indicate a bug. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix vLLM logprobs N-1 mismatch by only appending EOS for empty responses Root cause: We were unconditionally appending EOS token when finish_reason='stop', but not appending corresponding logprob. This created len(response) = N+1 but len(logprobs) = N mismatch. Fix: - Only append EOS for truly empty responses (the actual edge case) - When we do append EOS, also append NaN to logprobs - Normal responses ending with </answer> no longer get EOS appended - vLLM returns N logprobs for N tokens, so no mismatch Also added assertions to verify masks match correctly. Updated diagnostic logging to treat length mismatches as errors rather than expected behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Address review comments. * Updated scripts * Updated scripts * Cleaned up PR. * Added assert --------- Co-authored-by: Claude <[email protected]>

finbarrtimbers and others added 30 commits July 16, 2025 10:21

Update oe-eval.sh to set a default timeout of 48h. (#789)

e2be8d0

Updated configs to support changes. (#790)

ca55010

Add benchmark scripts (#786)

c56efad

* Added scripts to run benchmarks. * Removed install script. * Added install script back.

Add remap verifier (#773)

004a48d

* first pass remap verifier * make judge json parsing a little more robust * typoooooooo * typoooooooo * fix logic... * clean logging naming up

Ran the linter. (#792)

3b5b354

fix the URL for code api setup (#791)

7f7308c

Co-authored-by: Michael Noukhovitch <[email protected]>

Add nltk setup to uv dockerfile (#785)

829796a

* add punk tokenizer * fix up command

Set new default value for num_samples

7eb6c4d

install nginx in uv (#793)

d5e7160

it was only being installed in regular Dockerfile Co-authored-by: Michael Noukhovitch <[email protected]> Co-authored-by: Saurabh Shah <[email protected]>

allow passing local models, bubble up dataset cache errors (#797)

bb7477d

Co-authored-by: Michael Noukhovitch <[email protected]>

binary reward for code (#798)

839a806

* binary reward for code * style * binary code reward flag -> pass rate reward threshold

new beaker names (#803)

b3e8e70

Remove Unused DPO Function (#794)

266f214

* delete function Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Update open_instruct/dataset_transformation.py --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Hamish Ivison <[email protected]>

extra reporting (#799)

8048c9a

prev-branch: padding-free-squashing-7 Co-authored-by: Hamish Ivison <[email protected]>

Revert "Now, we run individual prompts through the queue. (#796)" (#804)

4659dca

This reverts commit 541058c.

Fix misnamed variables. (#808)

45ae474

* Fix misnamed variables. * Ran linter.

Fix broken syntax. (#809)

9d1620d

Co-authored-by: Hamish Ivison <[email protected]>

Fixes from last PR (#810)

d944d42

* fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]>

Delete run_repro.sh (#813)

207268a

Fix disk space error on image creation (#814)

cc33540

* remove moar things * create on pr * dont create on pr

fix: chat template kwarg in save_with_accelerate (#824)

8c45fd8

Fixes the ray double init error. (#822)

6f30008

* Moved init to the main thread, before the queues. * Undid changes to test.

[WIP] Filtering: Remove special tokens, Chinese characters, readme (#815

b28ac65

) * init and run * minor updates * minor script additions * nit * add readme and basic special token filtering * minor additions * Delete ngram_test_results/removal_rates_summary.txt

Makes pytest ignore oe-eval-internal. (#829)

7b92068

Co-authored-by: Saurabh Shah <[email protected]>

repetition filtering for reasoning traces (#801)

4477615

* init branch * it works * nits * false positive regex * manual filtering * fix manual label issue * another fix * minor fixes * final tweaks * clean up * cleaning * cleanup and docs * try removing tests spec * reset pyproject

finbarrtimbers and others added 30 commits September 18, 2025 22:57

Removed references to old scripts. (#1022)

ce5571a

Removed outdated files (#1023)

afe6b9b

* Removed outdates files * Added tool files * Ran tests * Ran linter * Always uses the real code execution server * Reduced nesting * Removed comments * Fixed tests

Removes nvcc and the "delete big files" steps from our tests as the…

9107188

…y're slow and not needed now that we don't use `flashinfer`. (#1025) * Fixed tests. * Updated deps to remove xdist.

Remove duplicate cluster names in mason (#1017)

1b695ca

Update filtering script (#1029)

43bbe30

* update filtering script * address gemini comments * address comments

Fixes bug in token metrics (#1032)

6d141b6

* Fixed issues with token metrics * Also changes training_start_time to use perf_counter * Fixed eta

Calculates tokens per second for actors. (#1034)

b174166

* Added actor tokens_per_second * Fixed bug where one timing was using time.time() and the other was using time.perf_counter(). * Updated code * Cleaned up PR.

Fix filtering judge script (#1037)

b4657b4

* fix * gemini comment

Default to tokenizer chat template, configurable random seed. (#1035)

d880e86

* fix some minor niggles * pas through seed * fix * address comment * fix

Logs MFU for the learner to wandb (#1038)

d7ca66b

* Added learner MFU * Cleaned up code. * Cleaned up PR * Fixed calculation. * Fixed tests.

Adds timing for the individual weight sync events (#1033)

045cc0a

* Added timing for individual weight syncs * Fixed bug * Fixed bug

fix beaker whoami and update warning message (#1040)

89e350b

docker image doesn't have jq so get username without it update comment to remove reference to removed flag --oe_eval_log_to_wandb

Skips uploading/renaming the beaker image when it hasn't changed. (#1047

6f9feba

) * Now, we check to see if we need to rebuild our image. * Updated script to use --description instead of --desc. * Update script to fix syntax error. * Made change.

Removed all references to reduce_loss. (#1048)

0ce2138

Modifies mason.py so that we set default env vars in one central pl…

9489f33

…ace. (#1049) * updated code to set default env vars in a central place. * Inlined code.

Cleaned up code (#1050)

7ba1bba

* CLeaned up code * Added comment

Make sure we're using vllm's v1 API everywhere. (#1052)

9724932

* Set vllm_use_v1=1 * Updated search endpoint * Update search url

Revert "Silences NCCL warning (#1055)" (#1060)

33d9f4b

This reverts commit e8db3e7.

Fixed formatting (#1059)

c3f79a3

* Removed all tabs. * Formatted whitespace from mason.py.

Update filter script (#1042)

938e08a

* updated filtering script whoops remove things * add command * address comments * need list * return type

Removed a bunch of dead code (#1057)

3a21f22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] Comparing toolu with main #963

[Draft] Comparing toolu with main #963

Uh oh!

shatu commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Draft] Comparing toolu with main #963

Are you sure you want to change the base?

[Draft] Comparing toolu with main #963

Uh oh!

Conversation

shatu commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shatu commented Aug 28, 2025 •

edited

Loading