Implement framewise encoding/decoding in LTX Video VAE #10488

rootonchair · 2025-01-07T17:13:41Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

rootonchair · 2025-01-07T17:17:15Z

For decode, original:

org_output.mp4

org_output2.mp4

org_output3.mp4

org_output4.mp4

framewise:

output.mp4

output2.mp4

output3.mp4

output4.mp4

HuggingFaceDocBuilderDev · 2025-01-07T18:53:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w

Wow, this is awesome @rootonchair! So cool

Just some questions and asks:

Did you verify that the expected number of frames are the same with framewise enabled vs disabled?
Is there any numerical difference between the tensors with framewise enabled vs disabled? A small absmax difference is usually okay/expected due to order of tensor operations changing, but since all matmul operations are on the embedding dimension, it should be very small and not affected by how many frames are being encoded/decoded at once
Let's try to make vae encoding work with framewise as well
Let's enable framewise encoding/decoding as default, by setting the values of the flags that enable them to True

This will also help reduce the memory requirements for training LTX significantly so really appreciate you looking into this :)

I'm sure it works as expected as the videos look great, so will try and get back to you quickly after a sanity check. Thank you!

a-r-r-o-w · 2025-01-07T22:12:07Z

src/diffusers/models/autoencoders/autoencoder_kl_ltx.py

@@ -1114,6 +1116,53 @@ def encode(
        if not return_dict:
            return (posterior,)
        return AutoencoderKLOutput(latent_dist=posterior)
+
+    def blend_t(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:


Would move this down a few methods to where blend_h and blend_v are located

a-r-r-o-w · 2025-01-07T22:15:47Z

src/diffusers/models/autoencoders/autoencoder_kl_ltx.py

+            )
+        return b
+
+    def _temporal_tiled_decode(self, z: torch.Tensor, temb: Optional[torch.Tensor], return_dict: bool = True) -> Union[DecoderOutput, torch.Tensor]:


Let's remove the debug statements from this method and move it below tiled_decode.

I think we would also need to implement tiled_encode. Happy to help with the changes if needed 🤗

rootonchair · 2025-01-08T04:21:21Z

Hi @a-r-r-o-w , thank you for your queries

Did you verify that the expected number of frames are the same with framewise enabled vs disabled?

Yes, the number of frames remain the same for both framewise enabled and disabled

Is there any numerical difference between the tensors with framewise enabled vs disabled? A small absmax difference is usually okay/expected due to order of tensor operations changing, but since all matmul operations are on the embedding dimension, it should be very small and not affected by how many frames are being encoded/decoded at once

I have just completed the encoding part, will proceed to the sanity check and then enable framewise encoding/decoding as default

rootonchair · 2025-01-08T07:53:40Z

for the decoding part, the output does not change at all, for encoding part, there is a small difference in mean about -8.285045623779297e-06 between the output

Below is my testing script

import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
#prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
#prompt = "The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery."
#prompt = "A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility."
#prompt = "Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

pipe.vae.use_framewise_decoding = False

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type='pt',
).frames[0]

pipe.vae.use_framewise_decoding = True
video2 = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type='pt',
).frames[0]
print(video2.size())
print(f"Diff: {torch.mean(video-video)}")

print("Test encoding")

generator = torch.Generator(device="cuda").manual_seed(42)
dummy_input = torch.rand((1, 3, 161, 512//8, 768//8), device="cuda", dtype=torch.bfloat16, generator=generator)

vae = pipe.vae
vae.use_framewise_encoding = False
posterior = vae.encode(dummy_input).latent_dist
z = posterior.sample(generator=generator)
vae.use_framewise_encoding = True
posterior = vae.encode(dummy_input).latent_dist
z2 = posterior.sample(generator=generator)
print(f"Diff: {torch.mean(z-z2)}")

a-r-r-o-w

Thank you for adding support for this @rootonchair! I've verified for all frames values upto 257 that encoding/decoding has negligible difference with the changes here

a-r-r-o-w · 2025-01-09T06:14:59Z

@yiyixuxu Do you want to give this a look too? We're flipping the default value for use_framewise_encoding/decoding here, but should be okay. It is expected to always be defaulted to True when framewise implementation is available

a-r-r-o-w · 2025-01-09T06:15:16Z

@rootonchair Could you run make style here for the tests to pass?

…into ltx_vae_framewise

rootonchair · 2025-01-09T17:10:32Z

@a-r-r-o-w thank you for reviewing. I have run make style

yiyixuxu · 2025-01-09T21:59:50Z

ohh I don't think we should flip the default
see https://huggingface.co/docs/diffusers/en/conceptual/philosophy#usability-over-performance

other changes look good to me

a-r-r-o-w · 2025-01-09T22:03:00Z

But we've been using framewise encoding/decoding by default wherever possible in past VAE integrations @yiyixuxu. I think it would be nice to maintain that consistency, no? For example, both CogVideoX and Hunyuan use this by default

yiyixuxu · 2025-01-09T22:11:36Z

@a-r-r-o-w
we really should not have done that though
let's make sure not to do that moving forward

a-r-r-o-w

Thanks @rootonchair! Just one more thing to do in accordance with YiYi's comment. We can merge after this change

src/diffusers/models/autoencoders/autoencoder_kl_ltx.py

Co-authored-by: Aryan <[email protected]>

add framewise decode

ec918b9

a-r-r-o-w self-requested a review January 7, 2025 19:35

a-r-r-o-w reviewed Jan 7, 2025

View reviewed changes

a-r-r-o-w mentioned this pull request Jan 8, 2025

(fake*) FP8 training support a-r-r-o-w/finetrainers#184

Open

add framewise encode, refactor tiled encode/decode

64a0849

add sanity test tiling for ltx

e79162c

Merge branch 'main' into ltx_vae_framewise

b7d02c9

a-r-r-o-w approved these changes Jan 9, 2025

View reviewed changes

a-r-r-o-w requested a review from yiyixuxu January 9, 2025 06:13

rootonchair added 3 commits January 10, 2025 00:07

run make style

c5e6d62

Merge branch 'ltx_vae_framewise' of github.com:rootonchair/diffusers …

ab31c3b

…into ltx_vae_framewise

Merge branch 'main' into ltx_vae_framewise

220fc78

yiyixuxu added the close-to-merge label Jan 9, 2025

a-r-r-o-w approved these changes Jan 11, 2025

View reviewed changes

src/diffusers/models/autoencoders/autoencoder_kl_ltx.py Outdated Show resolved Hide resolved

Update src/diffusers/models/autoencoders/autoencoder_kl_ltx.py

88bfc36

Co-authored-by: Aryan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement framewise encoding/decoding in LTX Video VAE #10488

Implement framewise encoding/decoding in LTX Video VAE #10488

rootonchair commented Jan 7, 2025

rootonchair commented Jan 7, 2025

HuggingFaceDocBuilderDev commented Jan 7, 2025

a-r-r-o-w left a comment

a-r-r-o-w Jan 7, 2025

a-r-r-o-w Jan 7, 2025

rootonchair commented Jan 8, 2025

rootonchair commented Jan 8, 2025 •

edited

Loading

a-r-r-o-w left a comment •

edited

Loading

a-r-r-o-w commented Jan 9, 2025

a-r-r-o-w commented Jan 9, 2025

rootonchair commented Jan 9, 2025

yiyixuxu commented Jan 9, 2025

a-r-r-o-w commented Jan 9, 2025

yiyixuxu commented Jan 9, 2025

a-r-r-o-w left a comment

Implement framewise encoding/decoding in LTX Video VAE #10488

Are you sure you want to change the base?

Implement framewise encoding/decoding in LTX Video VAE #10488

Conversation

rootonchair commented Jan 7, 2025

What does this PR do?

Before submitting

Who can review?

rootonchair commented Jan 7, 2025

HuggingFaceDocBuilderDev commented Jan 7, 2025

a-r-r-o-w left a comment

Choose a reason for hiding this comment

a-r-r-o-w Jan 7, 2025

Choose a reason for hiding this comment

a-r-r-o-w Jan 7, 2025

Choose a reason for hiding this comment

rootonchair commented Jan 8, 2025

rootonchair commented Jan 8, 2025 • edited Loading

a-r-r-o-w left a comment • edited Loading

Choose a reason for hiding this comment

a-r-r-o-w commented Jan 9, 2025

a-r-r-o-w commented Jan 9, 2025

rootonchair commented Jan 9, 2025

yiyixuxu commented Jan 9, 2025

a-r-r-o-w commented Jan 9, 2025

yiyixuxu commented Jan 9, 2025

a-r-r-o-w left a comment

Choose a reason for hiding this comment

rootonchair commented Jan 8, 2025 •

edited

Loading

a-r-r-o-w left a comment •

edited

Loading