-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement framewise encoding/decoding in LTX Video VAE #10488
base: main
Are you sure you want to change the base?
Conversation
For decode, original: org_output.mp4org_output2.mp4org_output3.mp4org_output4.mp4framewise: output.mp4output2.mp4output3.mp4output4.mp4 |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is awesome @rootonchair! So cool
Just some questions and asks:
- Did you verify that the expected number of frames are the same with framewise enabled vs disabled?
- Is there any numerical difference between the tensors with framewise enabled vs disabled? A small absmax difference is usually okay/expected due to order of tensor operations changing, but since all matmul operations are on the embedding dimension, it should be very small and not affected by how many frames are being encoded/decoded at once
- Let's try to make vae encoding work with framewise as well
- Let's enable framewise encoding/decoding as default, by setting the values of the flags that enable them to True
This will also help reduce the memory requirements for training LTX significantly so really appreciate you looking into this :)
I'm sure it works as expected as the videos look great, so will try and get back to you quickly after a sanity check. Thank you!
@@ -1114,6 +1116,53 @@ def encode( | |||
if not return_dict: | |||
return (posterior,) | |||
return AutoencoderKLOutput(latent_dist=posterior) | |||
|
|||
def blend_t(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would move this down a few methods to where blend_h
and blend_v
are located
) | ||
return b | ||
|
||
def _temporal_tiled_decode(self, z: torch.Tensor, temb: Optional[torch.Tensor], return_dict: bool = True) -> Union[DecoderOutput, torch.Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the debug statements from this method and move it below tiled_decode
.
I think we would also need to implement tiled_encode
. Happy to help with the changes if needed 🤗
Hi @a-r-r-o-w , thank you for your queries
Yes, the number of frames remain the same for both framewise enabled and disabled
I have just completed the encoding part, will proceed to the sanity check and then enable framewise encoding/decoding as default |
for the decoding part, the output does not change at all, for encoding part, there is a small difference in mean about Below is my testing script import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
#prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
#prompt = "The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery."
#prompt = "A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility."
#prompt = "Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
pipe.vae.use_framewise_decoding = False
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=768,
height=512,
num_frames=161,
decode_timestep=0.03,
decode_noise_scale=0.025,
num_inference_steps=50,
generator=torch.Generator(device="cuda").manual_seed(42),
output_type='pt',
).frames[0]
pipe.vae.use_framewise_decoding = True
video2 = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=768,
height=512,
num_frames=161,
decode_timestep=0.03,
decode_noise_scale=0.025,
num_inference_steps=50,
generator=torch.Generator(device="cuda").manual_seed(42),
output_type='pt',
).frames[0]
print(video2.size())
print(f"Diff: {torch.mean(video-video)}")
print("Test encoding")
generator = torch.Generator(device="cuda").manual_seed(42)
dummy_input = torch.rand((1, 3, 161, 512//8, 768//8), device="cuda", dtype=torch.bfloat16, generator=generator)
vae = pipe.vae
vae.use_framewise_encoding = False
posterior = vae.encode(dummy_input).latent_dist
z = posterior.sample(generator=generator)
vae.use_framewise_encoding = True
posterior = vae.encode(dummy_input).latent_dist
z2 = posterior.sample(generator=generator)
print(f"Diff: {torch.mean(z-z2)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding support for this @rootonchair! I've verified for all frames values upto 257 that encoding/decoding has negligible difference with the changes here
@yiyixuxu Do you want to give this a look too? We're flipping the default value for use_framewise_encoding/decoding here, but should be okay. It is expected to always be defaulted to True when framewise implementation is available |
@rootonchair Could you run |
@a-r-r-o-w thank you for reviewing. I have run |
ohh I don't think we should flip the default other changes look good to me |
But we've been using framewise encoding/decoding by default wherever possible in past VAE integrations @yiyixuxu. I think it would be nice to maintain that consistency, no? For example, both CogVideoX and Hunyuan use this by default |
@a-r-r-o-w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rootonchair! Just one more thing to do in accordance with YiYi's comment. We can merge after this change
Co-authored-by: Aryan <[email protected]>
What does this PR do?
Fixes #10333
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.