Flux.1 Hopper Performance #378

thomasbtnfr · 2024-12-04T09:56:26Z

Hello everyone,
I'm benchmarking FLUX.1 [dev] with xDiT. Comparing my results with those presented here, I notice some important differences. My results are much worse...

Here is the table with two new columns corresponding to the results of my experiments:

Configuration	PyTorch (Sec)	torch.compile (Sec)	Me PyTorch (Sec)	Me torch.compile (Sec)
1 GPU	6.71	4.30	6.27	3.88
Ulysses-2	4.38	2.68	5.05	4.01
Ring-2	5.31	2.60	5.01	3.7
Ulysses-2 x Ring-2	5.19	1.80	3.23	2.85
Ulysses-4	4.24	1.63	2.96	2.21
Ring-4	5.11	1.98	3.7	3.05

My results are quite different, especially when using torch.compile. Do you have any ideas?

Environment:

I measure the time taken to compute pipe only. In the table, I indicate the average time per GPU.
The H100s (SXM5) I use provide an intra-node NVLink bandwidth of ~900gb/s.
xfuser==0.3.4
torch==2.5.0

Options:

num_inference_steps 28
height 1024
width 1024
no_use_resolution_binning
warmup_steps 1

The text was updated successfully, but these errors were encountered:

feifeibear · 2024-12-09T08:20:22Z

Our torch version is 2.5.1 cuda 12.6.

Could you upgrade your libraries?

eppane · 2024-12-20T14:49:15Z

@feifeibear does 2.5.1 support cuda 12.6? Is your torch built/compiled with cuda 12.6, or e.g. 2.5.1+cu124? (what is the output of torch.version.cuda)

feifeibear · 2024-12-21T01:04:05Z

2.5.1+cu124 this one!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux.1 Hopper Performance #378

Flux.1 Hopper Performance #378

thomasbtnfr commented Dec 4, 2024 •

edited

Loading

feifeibear commented Dec 9, 2024

eppane commented Dec 20, 2024

feifeibear commented Dec 21, 2024

Flux.1 Hopper Performance #378

Flux.1 Hopper Performance #378

Comments

thomasbtnfr commented Dec 4, 2024 • edited Loading

feifeibear commented Dec 9, 2024

eppane commented Dec 20, 2024

feifeibear commented Dec 21, 2024

thomasbtnfr commented Dec 4, 2024 •

edited

Loading