v0.24.0: IP Adapters, Kandinsky 3.0, Stable Video Diffusion, SDXL Turbo
Stable Video Diffusion, SDXL Turbo, IP Adapters, Kandinsky 3.0
Stable Diffusion Video
Stable Video Diffusion is a powerful image-to-video generation model that can generate high resolution (576x1024) 2-4 seconds videos conditioned on the input image.
Image to Video Generation
There are two variants of SVD. SVD and SVD-XT. The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.
You need to condition the generation on an initial image, as follows:
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()
# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
Since generating videos is more memory intensive, we can use the decode_chunk_size
argument to control how many frames are decoded at once. This will reduce the memory usage. It's recommended to tweak this value based on your GPU memory. Setting decode_chunk_size=1
will decode one frame at a time and will use the least amount of memory, but the video might have some flickering.
Additionally, we also use model cpu offloading to reduce the memory usage.
SDXL Turbo
SDXL Turbo is an adversarial time-distilled Stable Diffusion XL (SDXL) model capable of running inference in as little as 1 step. Also, it does not use classifier-free guidance, further increasing its speed. On a good consumer GPU, you can now generate an image in just 100ms.
Text-to-Image
For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the height
and width
parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so.
Make sure to set guidance_scale
to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images.
Increasing the number of steps to 2, 3 or 4 should improve image quality.
from diffusers import AutoPipelineForText2Image
import torch
pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipeline_text2image = pipeline_text2image.to("cuda")
prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."
image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0]
image
Image-to-image
For image-to-image generation, make sure that num_inference_steps * strength
is larger or equal to 1.
The image-to-image pipeline will run for int(num_inference_steps * strength)
steps, e.g. 0.5 * 2.0 = 1
step in
our example below.
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid
# use from_pipe to avoid consuming additional memory when loading a checkpoint
pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
init_image = init_image.resize((512, 512))
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipeline(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2).images[0]
make_image_grid([init_image, image], rows=1, cols=2)
IP Adapters
IP Adapters have shown to be remarkably powerful at images conditioned on other images.
Thanks to @okotaku, we have added IP adapters to the most important pipelines allowing you to combine them for a variety of different workflows, e.g. they work with Img2Img2, ControlNet, and LCM-LoRA out of the box.
LCM-LoRA
from diffusers import DiffusionPipeline, LCMScheduler
import torch
from diffusers.utils import load_image
model_id = "sd-dreambooth-library/herge-style"
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipe.load_lora_weights(lcm_lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "best quality, high quality"
image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
images = pipe(
prompt=prompt,
ip_adapter_image=image,
num_inference_steps=4,
guidance_scale=1,
).images[0]
ControlNet
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
from diffusers.utils import load_image
controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)
pipeline = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
pipeline.to("cuda")
image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
prompt='best quality, high quality',
image=depth_map,
ip_adapter_image=image,
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=50,
generator=generator,
).images
images[0].save("yiyi_test_2_out.png")
ip_image | condition | output |
---|---|---|
For more information:
Kandinsky 3.0
Kandinsky has released the 3rd version, which has much improved text-to-image alignment thanks to using Flan-T5 as the text encoder.
Text-to-Image
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."
generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]
Image-to-Image
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
import torch
pipe = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "A painting of the inside of a subway train with tiny raccoons."
image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png")
generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]
Check it out:
All commits
- LCM-LoRA docs by @patil-suraj in #5782
- [
Docs
] Update and make improvements by @StandardAI in #5819 - [docs] Fix title by @stevhliu in #5831
- Improve setup.py and add dependency check by @patrickvonplaten in #5826
- [Docs] add: japanese sdxl as a reference by @sayakpaul in #5844
- Set
usedforsecurity=False
in hashlib methods (FIPS compliance) by @Wauplin in #5790 - fix memory consistency decoder test by @williamberman in #5828
- [PEFT] Unpin peft by @patrickvonplaten in #5850
- Speed up the peft lora unload by @pacman100 in #5741
- [
Tests
/LoRA
/PEFT
] Test also on PEFT / transformers / accelerate latest by @younesbelkada in #5820 - UnboundLocalError in SDXLInpaint.prepare_latents() by @a-r-r-o-w in #5648
- [ControlNet] fix import in single file loading by @sayakpaul in #5834
- [Styling] stylify using ruff by @kashif in #5841
- [Community] [WIP] LCM Interpolation Pipeline by @a-r-r-o-w in #5767
- [JAX] Replace uses of jax.devices("cpu") with jax.local_devices(backend="cpu") by @hvaara in #5864
- [
test
/peft
] Fix silent behaviour on PR tests by @younesbelkada in #5852 - fix an issue that ipex occupy too much memory, it will not impact per… by @linlifan in #5625
- Update LCMScheduler Inference Timesteps to be More Evenly Spaced by @dg845 in #5836
- Revert "[
Docs
] Update and make improvements" by @StandardAI in #5858 - [docs] Loader APIs by @stevhliu in #5813
- Update README.md by @co63oc in #5855
- Add tests fetcher by @DN6 in #5848
- Addition of new callbacks to controlnets by @a-r-r-o-w in #5812
- [docs] MusicLDM by @stevhliu in #5854
- Add features to the Dreambooth LoRA SDXL training script by @linoytsaban in #5508
- [feat] IP Adapters (author @okotaku ) by @yiyixuxu in #5713
- [Lora] Seperate logic by @patrickvonplaten in #5809
- ControlNet+Adapter pipeline, and ControlNet+Adapter+Inpaint pipeline by @affromero in #5869
- Adds an advanced version of the SD-XL DreamBooth LoRA training script supporting pivotal tuning by @linoytsaban in #5883
- [bug fix] fix small bug in readme template of sdxl lora training script by @linoytsaban in #5906
- [bug fix] fix small bug in readme template of sdxl lora training script by @linoytsaban in #5914
- [Docs] add: 8bit inference with pixart alpha by @sayakpaul in #5814
- [@cene555][Kandinsky 3.0] Add Kandinsky 3.0 by @patrickvonplaten in #5913
- [Examples] Allow downloading variant model files by @patrickvonplaten in #5531
- [Fix: pixart-alpha] random 512px resolution bug by @lawrence-cj in #5842
- [Core] add support for gradient checkpointing in transformer_2d by @sayakpaul in #5943
- Deprecate KarrasVeScheduler and ScoreSdeVpScheduler by @a-r-r-o-w in #5269
- Add Custom Timesteps Support to LCMScheduler and Supported Pipelines by @dg845 in #5874
- set the model to train state before accelerator prepare by @sywangyi in #5099
- Avoid computing min() that is expensive when do_normalize is False in the image processor by @ivanprado in #5896
- Fix LCM Stable Diffusion distillation bug related to parsing unet_time_cond_proj_dim by @dg845 in #5893
- add LoRA weights load and fuse support for IPEX pipeline by @linlifan in #5920
- Replace multiple variables with one variable. by @hi-sushanta in #5715
- fix: error on device for
lpw_stable_diffusion_xl
pipeline ifpipe.enable_sequential_cpu_offload()
enabled by @VicGrygorchyk in #5885 - [Vae] Make sure all vae's work with latent diffusion models by @patrickvonplaten in #5880
- [Tests] Make sure that we don't run tests multiple times by @patrickvonplaten in #5949
- [Community Pipeline] Diffusion Posterior Sampling for General Noisy Inverse Problems by @tongdaxu in #5939
- [From_pretrained] Fix warning by @patrickvonplaten in #5948
- [load_textual_inversion]: allow multiple tokens by @yiyixuxu in #5837
- [docs] Fix space by @stevhliu in #5898
- fix: minor typo in docstring by @soumik12345 in #5961
- [ldm3d] Ldm3d upscaler to community pipeline by @estelleafl in #5870
- [docs] Update pipeline list by @stevhliu in #5952
- [Tests] Refactor
test_examples.py
for better readability by @sayakpaul in #5946 - added doc for Kandinsky3.0 by @charchit7 in #5937
- [bug fix] Inpainting for MultiAdapter by @affromero in #5922
- Rename output_dir argument by @linhqyy in #5916
- [LoRA refactor] move several state dict conversion utils out of lora.py by @sayakpaul in #5955
- Support of ip-adapter to the StableDiffusionControlNetInpaintPipeline by @juancopi81 in #5887
- [docs] LCM training by @stevhliu in #5796
- Controlnet ssd 1b support by @MarkoKostiv in #5779
- [Pipeline] Add TextToVideoZeroSDXLPipeline by @vahramtadevosyan in #4695
- [Wuerstchen] Adapt lora training example scripts to use PEFT by @kashif in #5959
- Fixed custom module importing on Windows by @PENGUINLIONG in #5891
- Add SVD by @patil-suraj in #5895
- [SDXL Turbo] Add some docs by @patrickvonplaten in #5982
- Fix SVD doc by @patil-suraj in #5983
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @a-r-r-o-w
- @dg845
- @affromero
- @tongdaxu
- [Community Pipeline] Diffusion Posterior Sampling for General Noisy Inverse Problems (#5939)
- @estelleafl
- [ldm3d] Ldm3d upscaler to community pipeline (#5870)
- @vahramtadevosyan
- [Pipeline] Add TextToVideoZeroSDXLPipeline (#4695)