[SFT] Support context parallelism for SFT #132

xingyaoww · 2025-01-25T06:27:08Z

Add Sequence Parallelism and Padding Removal to SFT Trainer

This PR adds sequence parallelism (SP) and padding removal optimizations to the SFT trainer, which can help improve training efficiency for large language models.

Key Changes

Core Features

Sequence Parallelism: Added support for sequence parallelism through the Ulysses framework
- Configurable via ulysses_sequence_parallel_size parameter
- Properly handles data distribution across SP ranks
- Maintains consistent loss computation across distributed setup
Padding Removal: Added support for efficient handling of variable-length sequences
- Enabled via use_remove_padding flag (requires SP to be enabled)
- Uses flash-attention's padding removal utilities
- Handles proper re-padding and loss computation
Training Improvements:
- Added label smoothing support to loss computation
- Added progress bar with epoch information
- Added RoPE scaling configuration support
- Improved error messages for batch size validation

Testing

Added comprehensive test suite (test_trainer.py) to verify:
- Forward pass consistency between original and SP+rmpad implementations
- Loss computation correctness across distributed setup
- Proper handling of micro-batches

Example Usage

Added example script examples/sft/gsm8k/run_qwen_05_sp2.sh demonstrating how to use the new features with Qwen-2.5B model.

Implementation Details

Uses device mesh for proper distributed training setup
Handles data distribution ensuring same sequences within SP groups but different across DP groups
Carefully manages backward pass timing with gradient checkpointing
Maintains compatibility with existing FSDP features

Testing Instructions

Run the example script with sequence parallelism:

bash examples/sft/gsm8k/run_qwen_05_sp2.sh <nproc_per_node> <save_path>

Run the test suite:
bash tests/sft/run_sft_sp_loss_match.sh

^^ These are PR description generated by OpenHands

…and make FSDP wrapping policy conditional

xingyaoww · 2025-01-25T20:13:33Z

ok the training script is working now!

xingyaoww · 2025-01-25T20:23:49Z

This PR should be ready for review! I just added a CI that checks for loss match

vermouth1992 · 2025-01-26T01:01:05Z

verl/trainer/fsdp_sft_trainer.py

@@ -165,6 +193,14 @@ def _build_model_optimizer(self):
        trust_remote_code = self.config.model.trust_remote_code
        # load config first
        config = AutoConfig.from_pretrained(local_model_path, trust_remote_code=trust_remote_code)
+        if self.use_remove_padding:
+            assert self.config.ulysses_sequence_parallel_size > 1, "Remove padding is only supported with sequence parallel"


I assume it should be the opposite? Sequence parallel is only support when remove_padding is enabled?

vermouth1992 · 2025-01-26T01:03:20Z

verl/trainer/fsdp_sft_trainer.py

-            loss.backward()
+            if self.use_remove_padding and self.config.ulysses_sequence_parallel_size > 1:
+                # micro_batch = micro_batch.to('cuda')
+                loss = self._compute_loss_and_backward_sp(batch=micro_batch) / n_micro_batches


Can we combine the two functions as there are plenty replicated code?

vermouth1992 · 2025-01-26T01:03:36Z

Nice work!

1. Fix assertion logic - sequence parallel requires remove_padding 2. Combine loss computation functions to reduce code duplication

vermouth1992 · 2025-01-27T01:13:40Z

Could you run the formatting script?

xingyaoww · 2025-01-27T01:27:53Z

done! @vermouth1992

vermouth1992 · 2025-01-27T03:37:13Z

There are conflicts with newly merge MR..

# Add Sequence Parallelism and Padding Removal to SFT Trainer This PR adds sequence parallelism (SP) and padding removal optimizations to the SFT trainer, which can help improve training efficiency for large language models. ## Key Changes ### Core Features 1. **Sequence Parallelism**: Added support for sequence parallelism through the Ulysses framework - Configurable via `ulysses_sequence_parallel_size` parameter - Properly handles data distribution across SP ranks - Maintains consistent loss computation across distributed setup 2. **Padding Removal**: Added support for efficient handling of variable-length sequences - Enabled via `use_remove_padding` flag (requires SP to be enabled) - Uses flash-attention's padding removal utilities - Handles proper re-padding and loss computation 3. **Training Improvements**: - Added label smoothing support to loss computation - Added progress bar with epoch information - Added RoPE scaling configuration support - Improved error messages for batch size validation ### Testing - Added comprehensive test suite (`test_trainer.py`) to verify: - Forward pass consistency between original and SP+rmpad implementations - Loss computation correctness across distributed setup - Proper handling of micro-batches ### Example Usage Added example script `examples/sft/gsm8k/run_qwen_05_sp2.sh` demonstrating how to use the new features with Qwen-2.5B model. ## Implementation Details - Uses device mesh for proper distributed training setup - Handles data distribution ensuring same sequences within SP groups but different across DP groups - Carefully manages backward pass timing with gradient checkpointing - Maintains compatibility with existing FSDP features ## Testing Instructions 1. Run the example script with sequence parallelism: ```bash bash examples/sft/gsm8k/run_qwen_05_sp2.sh <nproc_per_node> <save_path> ``` 2. Run the test suite: ```bash tests/sft/run_sft_sp_loss_match.sh``` ^^ These are PR description generated by [OpenHands](https://github.com/All-Hands-AI/OpenHands) --------- Co-authored-by: Jiayi Pan <[email protected]> Co-authored-by: openhands <[email protected]>

xingyaoww and others added 30 commits January 15, 2025 19:22

support skip template apply; directly read prompt as str for sft dataset

88878e4

add initial lora sft support

d5f0fad

add ring_attn utils

66ff737

fix config serialization

dabfdab

improve hyper-params

902ddbe

Merge commit '902ddbe6c26623f9e9e511e55abe9f8676707ff2' into dev

6a43cb8

minor ux tweaks

b478b04

support label smoothing

c7aaed6

init

ed7377c

add sft example

5b4a800

enable prompt+response filter

d123e3b

Merge commit '5b4a80000effa8b19b6e2b1e8ea8a63e6fa0274b' into dev

a486ed4

add label smoothing to sp

94ef011

add wip

843a4e1

Merge commit '884a7273123c3996c0970fce90b2ace55e251203' into dev

596494e

fix backward shape mismatch

775b546

get seq parallel working!

37e672a

clean up debug print and fix debug using the latest signature

e2cc036

rm filter prompt and response now we supported 128k SFT

6194d1e

remove extra debug

eb59cc1

add tqdm to show training progress

ef38f9a

Merge upstream/main into feature/lora

17d521f

chore: remove unused ring_attn_utils.py and add peft dependency

4b75406

fix: address PR comments - add default LoRA config, fix script name, …

537eb7e

…and make FSDP wrapping policy conditional

refactor: use is_lora parameter instead of checking peft_config directly

8866aba

refactor: simplify fsdp wrap policy for LoRA using is_lora flag

77baf5c

revert: remove unnecessary LoRA changes from fsdp_workers.py

31f44ad

merge: upstream/main into feature/lora

decfe61

Merge commit 'decfe61082a9c96116cad4aef9534aea4a8e3b00' into dev

e4e7af3

ci: add workflow for LoRA testing

92cd230

openhands-agent and others added 5 commits January 25, 2025 20:01

Fix code formatting

6934a70

update scripts and default config

68cfd0a

cleanup debug print

720ee5c

fix microbsz

e0919e6

remove label smoothing

f8c1de0

xingyaoww added 3 commits January 25, 2025 20:15

add sequence parallism to CI

5a02740

add ci to check the loss diff between sp and default implementation

4adb707

remove uncessary rope scaling

4e38c46

xingyaoww marked this pull request as ready for review January 25, 2025 20:23

xingyaoww changed the title ~~[WIP, SFT] Support context parallelism for SFT~~ [SFT] Support context parallelism for SFT Jan 25, 2025

vermouth1992 reviewed Jan 26, 2025

View reviewed changes

openhands-agent and others added 7 commits January 26, 2025 05:54

Address review comments:

b08abcd

1. Fix assertion logic - sequence parallel requires remove_padding 2. Combine loss computation functions to reduce code duplication

Merge loss computation implementations to reduce code duplication

79d4a3f

Apply code formatting

818e5aa

Add missing nullcontext import

d665b9c

cleanup

e05d273

adjust order for cleaner git diff

b38db7b

make sure the test works

d7d2683

format script

c5e0f1b

Merge branch 'main' into xw/cp-sft

6f90fb7

vermouth1992 approved these changes Jan 27, 2025

View reviewed changes

vermouth1992 merged commit 077173f into volcengine:main Jan 27, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SFT] Support context parallelism for SFT #132

[SFT] Support context parallelism for SFT #132

xingyaoww commented Jan 25, 2025 •

edited

Loading

xingyaoww commented Jan 25, 2025

xingyaoww commented Jan 25, 2025

vermouth1992 Jan 26, 2025

vermouth1992 Jan 26, 2025

vermouth1992 commented Jan 26, 2025

vermouth1992 commented Jan 27, 2025

xingyaoww commented Jan 27, 2025

vermouth1992 commented Jan 27, 2025

[SFT] Support context parallelism for SFT #132

[SFT] Support context parallelism for SFT #132

Conversation

xingyaoww commented Jan 25, 2025 • edited Loading

Add Sequence Parallelism and Padding Removal to SFT Trainer

Key Changes

Core Features

Testing

Example Usage

Implementation Details

Testing Instructions

xingyaoww commented Jan 25, 2025

xingyaoww commented Jan 25, 2025

vermouth1992 Jan 26, 2025

Choose a reason for hiding this comment

vermouth1992 Jan 26, 2025

Choose a reason for hiding this comment

vermouth1992 commented Jan 26, 2025

vermouth1992 commented Jan 27, 2025

xingyaoww commented Jan 27, 2025

vermouth1992 commented Jan 27, 2025

xingyaoww commented Jan 25, 2025 •

edited

Loading