-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SFT] Support context parallelism for SFT #132
Conversation
…and make FSDP wrapping policy conditional
verl/trainer/fsdp_sft_trainer.py
Outdated
@@ -165,6 +193,14 @@ def _build_model_optimizer(self): | |||
trust_remote_code = self.config.model.trust_remote_code | |||
# load config first | |||
config = AutoConfig.from_pretrained(local_model_path, trust_remote_code=trust_remote_code) | |||
if self.use_remove_padding: | |||
assert self.config.ulysses_sequence_parallel_size > 1, "Remove padding is only supported with sequence parallel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it should be the opposite? Sequence parallel is only support when remove_padding is enabled?
verl/trainer/fsdp_sft_trainer.py
Outdated
loss.backward() | ||
if self.use_remove_padding and self.config.ulysses_sequence_parallel_size > 1: | ||
# micro_batch = micro_batch.to('cuda') | ||
loss = self._compute_loss_and_backward_sp(batch=micro_batch) / n_micro_batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine the two functions as there are plenty replicated code?
Nice work! |
1. Fix assertion logic - sequence parallel requires remove_padding 2. Combine loss computation functions to reduce code duplication
Could you run the formatting script? |
done! @vermouth1992 |
There are conflicts with newly merge MR.. |
# Add Sequence Parallelism and Padding Removal to SFT Trainer This PR adds sequence parallelism (SP) and padding removal optimizations to the SFT trainer, which can help improve training efficiency for large language models. ## Key Changes ### Core Features 1. **Sequence Parallelism**: Added support for sequence parallelism through the Ulysses framework - Configurable via `ulysses_sequence_parallel_size` parameter - Properly handles data distribution across SP ranks - Maintains consistent loss computation across distributed setup 2. **Padding Removal**: Added support for efficient handling of variable-length sequences - Enabled via `use_remove_padding` flag (requires SP to be enabled) - Uses flash-attention's padding removal utilities - Handles proper re-padding and loss computation 3. **Training Improvements**: - Added label smoothing support to loss computation - Added progress bar with epoch information - Added RoPE scaling configuration support - Improved error messages for batch size validation ### Testing - Added comprehensive test suite (`test_trainer.py`) to verify: - Forward pass consistency between original and SP+rmpad implementations - Loss computation correctness across distributed setup - Proper handling of micro-batches ### Example Usage Added example script `examples/sft/gsm8k/run_qwen_05_sp2.sh` demonstrating how to use the new features with Qwen-2.5B model. ## Implementation Details - Uses device mesh for proper distributed training setup - Handles data distribution ensuring same sequences within SP groups but different across DP groups - Carefully manages backward pass timing with gradient checkpointing - Maintains compatibility with existing FSDP features ## Testing Instructions 1. Run the example script with sequence parallelism: ```bash bash examples/sft/gsm8k/run_qwen_05_sp2.sh <nproc_per_node> <save_path> ``` 2. Run the test suite: ```bash tests/sft/run_sft_sp_loss_match.sh``` ^^ These are PR description generated by [OpenHands](https://github.com/All-Hands-AI/OpenHands) --------- Co-authored-by: Jiayi Pan <[email protected]> Co-authored-by: openhands <[email protected]>
Add Sequence Parallelism and Padding Removal to SFT Trainer
This PR adds sequence parallelism (SP) and padding removal optimizations to the SFT trainer, which can help improve training efficiency for large language models.
Key Changes
Core Features
Sequence Parallelism: Added support for sequence parallelism through the Ulysses framework
ulysses_sequence_parallel_size
parameterPadding Removal: Added support for efficient handling of variable-length sequences
use_remove_padding
flag (requires SP to be enabled)Training Improvements:
Testing
test_trainer.py
) to verify:Example Usage
Added example script
examples/sft/gsm8k/run_qwen_05_sp2.sh
demonstrating how to use the new features with Qwen-2.5B model.Implementation Details
Testing Instructions
bash tests/sft/run_sft_sp_loss_match.sh
^^ These are PR description generated by OpenHands