[misc][Long Context] feat: support ulysses for long context training #109

PeterSH6 · 2025-01-16T13:21:15Z

To support ulysses, we implemented a FSDPUlyssesShardingManager to manage the SP Parallel states of different models. And we utilize device mesh to manage the SP parallel groups.
The long context training is supported through monkey_patch. Currently, we support llama and qwen2 architectures. Will support other models later
Before model fwd, we pad the input_ids to be divisible by SP size and then slice the input_ids. The position_ids are only padded not sliced to make sure the position_embedding can match the qkv_states. This can be optimized.
Make shuffle to False in mini_batch_iterator

…entation

…put layer

verl/models/transformers/__init__.py

PeterSH6 · 2025-01-17T13:51:20Z

Almost finished.

I wonder what kind of examples shall we add? We can add some scripts in the next PR.

verl/models/transformers/monkey_patch.py

xingyaoww · 2025-01-23T19:30:47Z

Quick question @PeterSH6 - would this Ulysses PR supports gradient checkpointing?

I'm trying to use context parallel implemented here for SFT, but I seems keep running into shape mismatch issue during .backward() but not forward. Not sure if it is because this implementation didn't support grad accumulation yet.

xingyaoww · 2025-01-23T19:49:23Z

~~Yes, i'm able to produce; when enabling Ulysses context parallelism, set gradient_checkpointing_enable to False, and everything works. And turn it on will result in the above indexing error.~~

Nevermind! I figured it out: it happens when you do loss.backward() outside the context of `FSDPUlyssesShardingManager, seq-parallel info is not available, so the patched forward FN won't gather sentences correctly, hence causing this error.

PeterSH6 · 2025-01-24T01:49:00Z

@xingyaoww Cool! So you implemented Ulysses in the SFT trainer?

xingyaoww · 2025-01-24T03:06:07Z

@PeterSH6 yep! most changes are here (but a lot of unrelated changes as well, e.g. lora)
https://github.com/xingyaoww/verl/commits/dev

I'm still testing it :) but so far it seems to work pretty well.

Can send some PR later

PeterSH6 · 2025-01-24T14:31:39Z

@xingyaoww It would be really nice.

I've seen your LoRA PR. It looks great.
It would be even better if you could create a PR for Ulysses + rmpad in SFTTrainer.
We really appreciate your effort!

xingyaoww · 2025-01-25T06:40:29Z

@PeterSH6 definitely -- draft PR up in #132

Will clean up the code there when LoRA is merged :)

PeterSH6 added 27 commits January 14, 2025 13:56

init commit of ulysses

44d70ad

support SeqAllToAll

af9ce4e

build device mesh for ulysses

34a9fae

refact: hybrid_engine dir to sharding_manager for more general repres…

a5d7b54

…entation

Merge branch 'gm/sharding_refact' into gm/uly

44ec407

support FSDPUlyssesShardingManager

110c305

using sharding manager in workers

712de7d

impl a llama patch for ulysses

1fc127d

record progress: add a ulysses test script and impl Gather op for out…

4846395

…put layer

small fix of ulysses

98ba142

update patch and test for llama

4fe3291

record progress: debug

818577e

upload one pass case

8e500a5

fix all to all to pass test

2d6cc13

add backward test

0af1f94

add monkey patch and rename

a997a1c

support qwen ulysses patch and pass test

a68c8ab

apply patch in fsdp_workers

a666afa

add error message and add config field for ulysses

1e90cfc

add ulysses preprocess and post gather for actor/critic and rm

1e8c409

distinguish rollout and ulysses sharding manager

50ebf77

fix logprob computation

c0b6876

add script for sp2

c1ebff3

disable shuffle and delete set seed to fix rollout gen hang problem

a9503b6

update script

f0f1b6b

Merge branch 'main' into gm/uly

a04395f

lint

462c618

PeterSH6 changed the title ~~[misc] feat: support ulysses for long context training~~ [misc][Long Context] feat: support ulysses for long context training Jan 16, 2025

fix ci

1992e6d

vermouth1992 reviewed Jan 16, 2025

View reviewed changes

verl/models/transformers/__init__.py Outdated Show resolved Hide resolved

PeterSH6 added 11 commits January 17, 2025 18:07

fix world_size issue

610944f

revert

2ebaccf

fix log_prob compute because we should roll first before slice

b715a45

fix

0786675

add force stop for ci

99c8fe9

fix ci for digit completion

a78a290

add ulysses tests

83bb9cc

assert transformers version for ulysses

1c6fd40

lint

1af53d5

fix ulysses ci

0322690

update ci

8875968

PeterSH6 marked this pull request as ready for review January 17, 2025 13:47

revert logprob computation change

7ad2a63

vermouth1992 reviewed Jan 17, 2025

View reviewed changes

verl/models/transformers/monkey_patch.py Show resolved Hide resolved

PeterSH6 added 5 commits January 17, 2025 22:10

hardcode shuffle

1d937b7

lint

5ae496d

fix import

c851bc2

fix rm and disable val in some ci

6581369

use xformer in gsm8k ci to be more robust

02e9c99

eric-haibin-lin mentioned this pull request Jan 17, 2025

does this framework support long-generation such 8k-16k #69

Closed

Merge branch 'main' into gm/uly

359fa65

vermouth1992 merged commit e8eb9e4 into volcengine:main Jan 18, 2025
8 checks passed

PeterSH6 mentioned this pull request Jan 16, 2025

[Roadmap] veRL Development Roadmap #22

Open

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc][Long Context] feat: support ulysses for long context training #109

[misc][Long Context] feat: support ulysses for long context training #109

PeterSH6 commented Jan 16, 2025

PeterSH6 commented Jan 17, 2025

xingyaoww commented Jan 23, 2025

xingyaoww commented Jan 23, 2025 •

edited

Loading

PeterSH6 commented Jan 24, 2025

xingyaoww commented Jan 24, 2025 •

edited

Loading

PeterSH6 commented Jan 24, 2025

xingyaoww commented Jan 25, 2025

[misc][Long Context] feat: support ulysses for long context training #109

[misc][Long Context] feat: support ulysses for long context training #109

Conversation

PeterSH6 commented Jan 16, 2025

PeterSH6 commented Jan 17, 2025

xingyaoww commented Jan 23, 2025

xingyaoww commented Jan 23, 2025 • edited Loading

PeterSH6 commented Jan 24, 2025

xingyaoww commented Jan 24, 2025 • edited Loading

PeterSH6 commented Jan 24, 2025

xingyaoww commented Jan 25, 2025

xingyaoww commented Jan 23, 2025 •

edited

Loading

xingyaoww commented Jan 24, 2025 •

edited

Loading