Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USP latency test #416

Open
eppane opened this issue Dec 27, 2024 · 2 comments
Open

USP latency test #416

eppane opened this issue Dec 27, 2024 · 2 comments

Comments

@eppane
Copy link

eppane commented Dec 27, 2024

Hello,

On a 8xH100 80GB node, when running:

python benchmark/usp_latency_test.py --model_id black-forest-labs/FLUX.1-dev --script examples/flux_usp_example.py --sizes 1024 --n_gpus 8 --steps 25

I get the following results:

Running test for size 1024, ulysses_degree 1, ring_degree 8
epoch time: 155.80 sec, parameter memory: 33.76 GB, memory: 34.02 GB
Running test for size 1024, ulysses_degree 2, ring_degree 4
epoch time: 127.65 sec, parameter memory: 33.76 GB, memory: 34.02 GB
Running test for size 1024, ulysses_degree 4, ring_degree 2
epoch time: 129.35 sec, parameter memory: 33.76 GB, memory: 34.02 GB
Running test for size 1024, ulysses_degree 8, ring_degree 1
epoch time: 154.19 sec, parameter memory: 33.76 GB, memory: 34.02 GB

Meanwhile, when running flux_example.py instead of flux_usp_example.py (is this even intended usage?):

python benchmark/usp_latency_test.py --model_id black-forest-labs/FLUX.1-dev --script examples/flux_example.py --sizes 1024 --n_gpus 8 --steps 25

Produces:

Running test for size 1024, ulysses_degree 1, ring_degree 8
epoch time: 4.34 sec, parameter memory: 33.76 GB, memory: 33.87 GB
Running test for size 1024, ulysses_degree 2, ring_degree 4
epoch time: 3.55 sec, parameter memory: 33.76 GB, memory: 33.90 GB
Running test for size 1024, ulysses_degree 4, ring_degree 2
epoch time: 2.81 sec, parameter memory: 33.76 GB, memory: 33.89 GB
Running test for size 1024, ulysses_degree 8, ring_degree 1
epoch time: 2.48 sec, parameter memory: 33.76 GB, memory: 33.87 GB

@feifeibear are the scripts and timings working as expected? In the latter case, the true time is more similar to the former case, rather than couple of seconds. Did you use the same scripts and timing points for producing the results in performance/flux.md?

Another example (without and with torch.compile, default mode):

torchrun --nproc_per_node=8 examples/flux_usp_example.py --model black-forest-labs/FLUX.1-dev --prompt "A small cat" --seed 42 
--height 1024 --width 1024 --num_inference_steps 25 --max_sequence_length 256 --no_use_resolution_binning --warmup_steps 1
--ulysses_degree 8 --ring_degree 1

epoch time: 155.74 sec, parameter memory: 33.76 GB, memory: 36.35 GB

torchrun --nproc_per_node=8 examples/flux_usp_example.py --model black-forest-labs/FLUX.1-dev --prompt "A small cat" --seed 42 
--height 1024 --width 1024 --num_inference_steps 25 --max_sequence_length 256 --no_use_resolution_binning --warmup_steps 1
--ulysses_degree 8 --ring_degree 1 --use_torch_compile

epoch time: 1.77 sec, parameter memory: 33.76 GB, memory: 36.35 GB
@xibosun
Copy link
Collaborator

xibosun commented Dec 30, 2024

When we conduct experiments on the H100 with CUDA 12 and Torch 2.5.1, both flux_example.py and flux_usp_example.py (with and without torch.compile) exhibit comparable performance levels. The inference process across all configurations consistently completes within a few seconds. What are the version of CUDA Runtime and torch on your machine?

@eppane
Copy link
Author

eppane commented Dec 31, 2024

Hello @xibosun! Interesting, I am using 2.5.1+cu124. If you run those same commands I have above, you get different outputs?

I have also:

xfuser                    0.4.0                # from source, latest commit 57eb27f
transformers              4.47.1
diffusers                 0.33.0.dev0          # from source, latest commit 83da817
flash-attn                2.7.2.post1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants