You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello everyone,
I'm benchmarking FLUX.1 [dev] with xDiT. Comparing my results with those presented here, I notice some important differences. My results are much worse...
Here is the table with two new columns corresponding to the results of my experiments:
Configuration
PyTorch (Sec)
torch.compile (Sec)
Me PyTorch (Sec)
Me torch.compile (Sec)
1 GPU
6.71
4.30
6.27
3.88
Ulysses-2
4.38
2.68
5.05
4.01
Ring-2
5.31
2.60
5.01
3.7
Ulysses-2 x Ring-2
5.19
1.80
3.23
2.85
Ulysses-4
4.24
1.63
2.96
2.21
Ring-4
5.11
1.98
3.7
3.05
My results are quite different, especially when using torch.compile. Do you have any ideas?
Environment:
I measure the time taken to compute pipe only. In the table, I indicate the average time per GPU.
The H100s (SXM5) I use provide an intra-node NVLink bandwidth of ~900gb/s.
xfuser==0.3.4
torch==2.5.0
Options:
num_inference_steps 28
height 1024
width 1024
no_use_resolution_binning
warmup_steps 1
The text was updated successfully, but these errors were encountered:
Hello everyone,
I'm benchmarking FLUX.1 [dev] with xDiT. Comparing my results with those presented here, I notice some important differences. My results are much worse...
Here is the table with two new columns corresponding to the results of my experiments:
My results are quite different, especially when using
torch.compile
. Do you have any ideas?Environment:
pipe
only. In the table, I indicate the average time per GPU.Options:
The text was updated successfully, but these errors were encountered: