Reasons about inference speed difference with diffusers on single GPU without parrallelism? #392

xyyan0123 · 2024-12-13T08:09:55Z

Hi, I'm trying to run Stable Diffusion 3 with xDiT on single 4090 via following command. I set all parrallel related parameter to 1 to disable parrallelism. And I got epoch time: 7.96 sec, parameter memory: 17.58 GB, peak memory: 20.16 GB.

torchrun --nproc_per_node=1 examples/sd3_example.py \
--model "stabilityai/stable-diffusion-3-medium-diffusers" \
--pipefusion_parallel_degree 1 --ulysses_degree 1 \
--data_parallel_degree 1 --ring_degree 1 --tensor_parallel_degree 1 \
--num_inference_steps 50 --warmup_steps 0 \
--prompt "A sign that reads Raining Cats and Dogs with a dog smiling and wagging its tail."

However, I found that the inference time only with diffusers is slower than xDiT. I coded as below and got a result of 9.28s. It makes me confused about whether some optimization used in the situation of single gpu and non parrallelism. And I tried some difference inference steps , but there is always a gap around 1.4 seconds. So my question is that, what's reason of this difference?

import torch
from diffusers import StableDiffusion3Pipeline
import time

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

start_time = time.time()
image = pipe(
    "A sign that reads Raining Cats and Dogs with a dog smiling and wagging its tail.",
    negative_prompt="",
    num_inference_steps=50,
    guidance_scale=7.0,
).images[0]

end_time = time.time()
print(f"inference time: {end_time - start_time:.2f} s")

image.save("sign.png")

The text was updated successfully, but these errors were encountered:

feifeibear · 2024-12-13T10:58:15Z

They should be the same, no quatization and other optimazation is applied. When parallel degree is 1, it use the diffusers code.

xyyan0123 · 2024-12-13T13:34:29Z

Yeah, I know it "should be the same". But i tested on multiple machines with different GPUs. Tha gap is still exists. I don't know why this phenomena happens. Is there something I did wrong? I refactor my scirpt format above. Could you help me?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasons about inference speed difference with diffusers on single GPU without parrallelism? #392

Reasons about inference speed difference with diffusers on single GPU without parrallelism? #392

xyyan0123 commented Dec 13, 2024 •

edited

Loading

feifeibear commented Dec 13, 2024 •

edited

Loading

xyyan0123 commented Dec 13, 2024

Reasons about inference speed difference with diffusers on single GPU without parrallelism? #392

Reasons about inference speed difference with diffusers on single GPU without parrallelism? #392

Comments

xyyan0123 commented Dec 13, 2024 • edited Loading

feifeibear commented Dec 13, 2024 • edited Loading

xyyan0123 commented Dec 13, 2024

xyyan0123 commented Dec 13, 2024 •

edited

Loading

feifeibear commented Dec 13, 2024 •

edited

Loading