You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running test for size 1024, ulysses_degree 1, ring_degree 8
epoch time: 4.34 sec, parameter memory: 33.76 GB, memory: 33.87 GB
Running test for size 1024, ulysses_degree 2, ring_degree 4
epoch time: 3.55 sec, parameter memory: 33.76 GB, memory: 33.90 GB
Running test for size 1024, ulysses_degree 4, ring_degree 2
epoch time: 2.81 sec, parameter memory: 33.76 GB, memory: 33.89 GB
Running test for size 1024, ulysses_degree 8, ring_degree 1
epoch time: 2.48 sec, parameter memory: 33.76 GB, memory: 33.87 GB
@feifeibear are the scripts and timings working as expected? In the latter case, the true time is more similar to the former case, rather than couple of seconds. Did you use the same scripts and timing points for producing the results in performance/flux.md?
Another example (without and with torch.compile, default mode):
When we conduct experiments on the H100 with CUDA 12 and Torch 2.5.1, both flux_example.py and flux_usp_example.py (with and without torch.compile) exhibit comparable performance levels. The inference process across all configurations consistently completes within a few seconds. What are the version of CUDA Runtime and torch on your machine?
Hello,
On a 8xH100 80GB node, when running:
I get the following results:
Meanwhile, when running
flux_example.py
instead offlux_usp_example.py
(is this even intended usage?):Produces:
@feifeibear are the scripts and timings working as expected? In the latter case, the true time is more similar to the former case, rather than couple of seconds. Did you use the same scripts and timing points for producing the results in performance/flux.md?
Another example (without and with
torch.compile
, default mode):The text was updated successfully, but these errors were encountered: