-
Notifications
You must be signed in to change notification settings - Fork 445
[Draft] Comparing toolu with main #963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
shatu
wants to merge
209
commits into
toolu
Choose a base branch
from
main
base: toolu
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Added scripts to run benchmarks. * Removed install script. * Added install script back.
* first pass remap verifier * make judge json parsing a little more robust * typoooooooo * typoooooooo * fix logic... * clean logging naming up
Co-authored-by: Michael Noukhovitch <[email protected]>
* add punk tokenizer * fix up command
* Made changes. * Switched to use ray.util.queue.Queue instead of a custom RayQueue class. * Now, only handles new version. * Updated benchmark_generators.py and test_grpo_fast.py. * CLeaned up code from Claude. * training_step defaults to None. * Added an info dataclass to replace the tuple. * Removes assumption that queries_prompt_Q and inference_results_Q are in sync by moving queries_prompt_Q to be a map. * CLeaned up benchmark * Added code to split batch sizes. * Removed benchmark scripts, which are now in a separate PR. * Now, we create all Ray queues in main, and pass them in as appropriate. * Removed changes * Test changes. * Linter passes * Added tests. * Now, we index with the dataset indices. * Checks and tests pass. * Ran linter * Added benchmark scripts back. Whoops.
* Set new default value for num_samples * Now run N batches at once * different batch size * Fix pack length * Fix pack length * Fix wasted compute % (was accidentally multiplying by 100), and fix num rollouts (was referencing the wrong variable). * Now, we save benchmark results to CSV. * Now show a percentage for time spent generating. * Updated benchmark saving code. * Fixed syntax error. * Fixed benchmark * Fixed timing code. * Removed changes to vllm_utils3.py. * Now, we actually write the data to disk> * Bigger batch * Modified benchmark * Undid changes to benchmark script. * Temp change * Undid changes to benchmark script.
it was only being installed in regular Dockerfile Co-authored-by: Michael Noukhovitch <[email protected]> Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Michael Noukhovitch <[email protected]>
* binary reward for code * style * binary code reward flag -> pass rate reward threshold
* Now, we run individual prompts through the queue. * Fixed issues. * Ran linter * Fixed linter errors. * COde lints. * Test passes. * Ran linter. * Ensures that we send single prompts as requests. * Now, code lints. * Cleaned up code. * Fixes test. * Linter passes. * Cleaned test up. * Removed redundant comments.
* Adds flashinfer dep. * Now, open_instruct builds even on mac. * Updated install instructions to add flash-infer. * Now, we set flashinfer as the default attention backend. * Added flashinfer to the base dockerfile. * Ran linter. * Removed extra changes to mason.py. * Undid changes to uv.lock. * Updated requirements.txt * Updated flash-attn version. --------- Co-authored-by: Hamish Ivison <[email protected]>
* delete function Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Update open_instruct/dataset_transformation.py --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Hamish Ivison <[email protected]>
prev-branch: padding-free-squashing-7 Co-authored-by: Hamish Ivison <[email protected]>
* Fix misnamed variables. * Ran linter.
Co-authored-by: Hamish Ivison <[email protected]>
Adds new olmo-core-compatible chat templates. Includes: * New olmo template with support for function-calling. Includes a basic hard-coded system prompt, and appends "You do not have access to any functions" to any SFT examples that do not include functions. * Thinker version of the above template, has <think> included in the generation prompt * R1-style thinker template These 3 templates mirror our current Tulu templates Also includes some necessary changes to the --add_bos logic, to handle the new chat template which does not have a bos token. Includes a few other QoL fixes: * Fixes a bug in the olmocore tokenization script re: label mask * Logs dataset-level statistics during data mixing and tokenization * Supports easy upsampling during data mixing
* fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]>
* remove moar things * create on pr * dont create on pr
* Moved init to the main thread, before the queues. * Undid changes to test.
Co-authored-by: Saurabh Shah <[email protected]>
* init branch * it works * nits * false positive regex * manual filtering * fix manual label issue * another fix * minor fixes * final tweaks * clean up * cleaning * cleanup and docs * try removing tests spec * reset pyproject
* Removed outdates files * Added tool files * Ran tests * Ran linter * Always uses the real code execution server * Reduced nesting * Removed comments * Fixed tests
…y're slow and not needed now that we don't use `flashinfer`. (#1025) * Fixed tests. * Updated deps to remove xdist.
…_loss="sum"` any longer. (#1024) * Removed outdates files * Added tool files * Ran tests * Ran linter * Always uses the real code execution server * Reduced nesting * Removed comments * Removed reduce_loss argument, default is now mean * Removed -n auto from tests. * Fixed linter errors. * Removed reduce_loss arguments from scripts.
* update filtering script * address gemini comments * address comments
* Fixed issues with token metrics * Also changes training_start_time to use perf_counter * Fixed eta
* Added actor tokens_per_second * Fixed bug where one timing was using time.time() and the other was using time.perf_counter(). * Updated code * Cleaned up PR.
* Add MFU/MBU calculation support to grpo_fast - Added MFU/MBU fields to TokenStatistics dataclass - Moved ModelDims and GPU_SPECS to utils.py to avoid circular imports - Calculate batch-wide MFU/MBU in accumulate_inference_batches - Thread model_dims parameter through function signatures - Use module-level constants for FLOP_PER_MAC and SOFTMAX_FLOPS_PER_SCORE 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Now, we calculate MFU/MBU. * Now, we report actor mfu/mbu to wandb. * Cleaned up PR. * Cleaned up how we load model dims. * Cleand up code. * fixed intermediate size bug. * Fixed method calls. * Added GPU memory logging around ModelDims loading to debug OOM issue * Fix OOM by avoiding loading model weights when getting config The issue was that ray.get(vllm_engines[0].get_vllm_config.remote()) was loading 19GB of model weights into GPU memory on the main process. Changed to only fetch necessary model dimensions as a simple dict, avoiding serialization of the full vLLM config object which contained model references. * Fixed eval calls * Cleaned up PR * Updated code so tests pass. * Updated code * Removed debugging logs --------- Co-authored-by: Claude <[email protected]>
* fix * gemini comment
* fix some minor niggles * pas through seed * fix * address comment * fix
* Added learner MFU * Cleaned up code. * Cleaned up PR * Fixed calculation. * Fixed tests.
* logging oe eval to wandb when using new oe-eval-interal requires PR from oe-eval-internal allenai/oe-eval-internal#636 * wandb run step arg * undo formatting * gemini's suggested leading space * default log to wandb and check secret * wandb util test * doctest string * make it default
* Added timing for individual weight syncs * Fixed bug * Fixed bug
docker image doesn't have jq so get username without it update comment to remove reference to removed flag --oe_eval_log_to_wandb
* Update cluster names for open-instruct scripts * Updated docs * Removed/renamed outdated cluster names. * Updated code to remove outdated cluster references. * Updated all scripts again. * Updated cluster references as per Hamish's suggestions.
…ace. (#1049) * updated code to set default env vars in a central place. * Inlined code.
* CLeaned up code * Added comment
* Set vllm_use_v1=1 * Updated search endpoint * Update search url
…#1053) * Loads model on device * Update open_instruct/grpo_fast.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Set conditional device map * Updated code --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Loads model on device * Update open_instruct/grpo_fast.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Set conditional device map * Set CUDA device before init_process_group to silence NCCL warning Adds torch.cuda.set_device(self.local_rank) immediately before init_process_group() calls in PolicyTrainerRayProcess.setup_model_update_group() to silence the NCCL warning about unknown device mapping. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Set CUDA device immediately before DeepSpeed init to silence NCCL warning The NCCL warning was coming from deepspeed.init_distributed(), not from the model_update_group initialization. Added torch.cuda.set_device(self.local_rank) immediately before deepspeed.init_distributed() to ensure the device is properly set when DeepSpeed creates its distributed process group. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * trying to fix it * Removed unneeded code. * Cleaned up PR. * Cleaned PR * Cleaned PR * Undid changes to open_instruct/utils.py --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Claude <[email protected]>
…ang forever and take up resources. (#1058) * Added timeouts. * Update scripts/train/debug/large_test_script.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Removed all tabs. * Formatted whitespace from mason.py.
* Updated LLMRayActor to run continuously * Set v1 = true * Added pause loop in update weights * SEt inflight updates false * Fixed condition * Update code * another attempt at fixing deadlocks * Added logs * More logging to diagnose hang * Fixed busy-waiting. * Updated the code * Added logging * fixed prefetch * Updated code * Cleaned up PR. * Cleaned up code. * Fixed race condition * Removed logging from update_weight * Less logging * Cleaned up PR. * Cleaned up PR. * Cleaned up PR. * Cleaned up PR. * Rearranged timeout * Removes broken code. * Undid changes to inflight * Updated code with a health check on the actors.
* updated filtering script whoops remove things * add command * address comments * need list * return type
…t them vs the trainer logprobs in `grpo_fast.py`. (#1041) * Now, llmrayactor returns logprobs * Updated code * CLeaned up PR. * Cleaned up PR. * Updated logprob code * Fixed code * now uses nan * Now, we filter nans * Cleaned up code. * Fixed tests * Updated code * Added vllm logprobs * Cleaned up code * Undo changes to script. * Fixed bug in logprobs * fixed failing tests * Added importance sampling ratio * Added back comment * Removed comment * Test config * Add truncated importance sampling with debug assertions to identify NaN source - Added truncated_importance_sampling_ratio_cap parameter (default 0.0) - Implemented importance sampling with comprehensive assertions - Added checks for INVALID_LOGPROB values and extreme logprob differences - Added NaN checks at each step of the calculation This will help identify exactly where NaNs are introduced when importance sampling is enabled. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix NaN handling in truncated importance sampling - Add torch.nan_to_num to replace NaN values with INVALID_LOGPROB after collation - Query tokens in packed sequences are set to NaN by pack_sequences in rl_utils2.py - Apply nan_to_num in both training loop (line 1031) and old_logprobs calculation (line 987) - Implement proper importance sampling with masking: * Only apply IS where both logprobs are valid (not INVALID_LOGPROB) * Use response mask to ensure only response tokens are affected * Initialize importance ratio to 1.0 (neutral) for invalid positions * Clamp logprob differences to prevent numerical overflow - Remove all debug assertions that were causing failures - Ensure importance sampling only affects valid response token positions This fixes the 'NaN in mb_old_logprobs before IS' assertion error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * added reverse kl * Simplified code * changes to debug mask * more logging * more logging * Addedcomment describing * Add diagnostic logging to vllm_utils3 to detect logprobs length mismatch at source This adds logging to check if vLLM CompletionOutput has mismatched token_ids and logprobs lengths. According to vLLM source analysis, all generated tokens should have logprobs, so N-1 behavior would indicate a bug. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix vLLM logprobs N-1 mismatch by only appending EOS for empty responses Root cause: We were unconditionally appending EOS token when finish_reason='stop', but not appending corresponding logprob. This created len(response) = N+1 but len(logprobs) = N mismatch. Fix: - Only append EOS for truly empty responses (the actual edge case) - When we do append EOS, also append NaN to logprobs - Normal responses ending with </answer> no longer get EOS appended - vLLM returns N logprobs for N tokens, so no mismatch Also added assertions to verify masks match correctly. Updated diagnostic logging to treat length mismatches as errors rather than expected behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Address review comments. * Updated scripts * Updated scripts * Cleaned up PR. * Added assert --------- Co-authored-by: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Draft PR for a branch -- No review needed