USP latency test #416

eppane · 2024-12-27T15:17:54Z

Hello,

On a 8xH100 80GB node, when running:

python benchmark/usp_latency_test.py --model_id black-forest-labs/FLUX.1-dev --script examples/flux_usp_example.py --sizes 1024 --n_gpus 8 --steps 25

I get the following results:

Running test for size 1024, ulysses_degree 1, ring_degree 8
epoch time: 155.80 sec, parameter memory: 33.76 GB, memory: 34.02 GB
Running test for size 1024, ulysses_degree 2, ring_degree 4
epoch time: 127.65 sec, parameter memory: 33.76 GB, memory: 34.02 GB
Running test for size 1024, ulysses_degree 4, ring_degree 2
epoch time: 129.35 sec, parameter memory: 33.76 GB, memory: 34.02 GB
Running test for size 1024, ulysses_degree 8, ring_degree 1
epoch time: 154.19 sec, parameter memory: 33.76 GB, memory: 34.02 GB

Meanwhile, when running flux_example.py instead of flux_usp_example.py (is this even intended usage?):

python benchmark/usp_latency_test.py --model_id black-forest-labs/FLUX.1-dev --script examples/flux_example.py --sizes 1024 --n_gpus 8 --steps 25

Produces:

Running test for size 1024, ulysses_degree 1, ring_degree 8
epoch time: 4.34 sec, parameter memory: 33.76 GB, memory: 33.87 GB
Running test for size 1024, ulysses_degree 2, ring_degree 4
epoch time: 3.55 sec, parameter memory: 33.76 GB, memory: 33.90 GB
Running test for size 1024, ulysses_degree 4, ring_degree 2
epoch time: 2.81 sec, parameter memory: 33.76 GB, memory: 33.89 GB
Running test for size 1024, ulysses_degree 8, ring_degree 1
epoch time: 2.48 sec, parameter memory: 33.76 GB, memory: 33.87 GB

@feifeibear are the scripts and timings working as expected? In the latter case, the true time is more similar to the former case, rather than couple of seconds. Did you use the same scripts and timing points for producing the results in performance/flux.md?

Another example (without and with torch.compile, default mode):

torchrun --nproc_per_node=8 examples/flux_usp_example.py --model black-forest-labs/FLUX.1-dev --prompt "A small cat" --seed 42 
--height 1024 --width 1024 --num_inference_steps 25 --max_sequence_length 256 --no_use_resolution_binning --warmup_steps 1
--ulysses_degree 8 --ring_degree 1

epoch time: 155.74 sec, parameter memory: 33.76 GB, memory: 36.35 GB

torchrun --nproc_per_node=8 examples/flux_usp_example.py --model black-forest-labs/FLUX.1-dev --prompt "A small cat" --seed 42 
--height 1024 --width 1024 --num_inference_steps 25 --max_sequence_length 256 --no_use_resolution_binning --warmup_steps 1
--ulysses_degree 8 --ring_degree 1 --use_torch_compile

epoch time: 1.77 sec, parameter memory: 33.76 GB, memory: 36.35 GB

The text was updated successfully, but these errors were encountered:

xibosun · 2024-12-30T06:52:17Z

When we conduct experiments on the H100 with CUDA 12 and Torch 2.5.1, both flux_example.py and flux_usp_example.py (with and without torch.compile) exhibit comparable performance levels. The inference process across all configurations consistently completes within a few seconds. What are the version of CUDA Runtime and torch on your machine?

eppane · 2024-12-31T13:49:21Z

Hello @xibosun! Interesting, I am using 2.5.1+cu124. If you run those same commands I have above, you get different outputs?

I have also:

xfuser                    0.4.0                # from source, latest commit 57eb27f
transformers              4.47.1
diffusers                 0.33.0.dev0          # from source, latest commit 83da817
flash-attn                2.7.2.post1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USP latency test #416

USP latency test #416

eppane commented Dec 27, 2024 •

edited

Loading

xibosun commented Dec 30, 2024

eppane commented Dec 31, 2024

USP latency test #416

USP latency test #416

Comments

eppane commented Dec 27, 2024 • edited Loading

xibosun commented Dec 30, 2024

eppane commented Dec 31, 2024

eppane commented Dec 27, 2024 •

edited

Loading