Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasons about inference speed difference with diffusers on single GPU without parrallelism? #392

Open
xyyan0123 opened this issue Dec 13, 2024 · 2 comments

Comments

@xyyan0123
Copy link

xyyan0123 commented Dec 13, 2024

Hi, I'm trying to run Stable Diffusion 3 with xDiT on single 4090 via following command. I set all parrallel related parameter to 1 to disable parrallelism. And I got epoch time: 7.96 sec, parameter memory: 17.58 GB, peak memory: 20.16 GB.

torchrun --nproc_per_node=1 examples/sd3_example.py \
--model "stabilityai/stable-diffusion-3-medium-diffusers" \
--pipefusion_parallel_degree 1 --ulysses_degree 1 \
--data_parallel_degree 1 --ring_degree 1 --tensor_parallel_degree 1 \
--num_inference_steps 50 --warmup_steps 0 \
--prompt "A sign that reads Raining Cats and Dogs with a dog smiling and wagging its tail."

However, I found that the inference time only with diffusers is slower than xDiT. I coded as below and got a result of 9.28s. It makes me confused about whether some optimization used in the situation of single gpu and non parrallelism. And I tried some difference inference steps , but there is always a gap around 1.4 seconds. So my question is that, what's reason of this difference?

import torch
from diffusers import StableDiffusion3Pipeline
import time

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

start_time = time.time()
image = pipe(
    "A sign that reads Raining Cats and Dogs with a dog smiling and wagging its tail.",
    negative_prompt="",
    num_inference_steps=50,
    guidance_scale=7.0,
).images[0]

end_time = time.time()
print(f"inference time: {end_time - start_time:.2f} s")

image.save("sign.png")
@feifeibear
Copy link
Collaborator

feifeibear commented Dec 13, 2024

They should be the same, no quatization and other optimazation is applied. When parallel degree is 1, it use the diffusers code.

@xyyan0123
Copy link
Author

Yeah, I know it "should be the same". But i tested on multiple machines with different GPUs. Tha gap is still exists. I don't know why this phenomena happens. Is there something I did wrong? I refactor my scirpt format above. Could you help me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants