Feature Request, on person specific auto encoder for stage3 #183

Soonwang1988 · 2025-03-22T15:14:38Z

Hi i found a way to use latentsync where i extracted out aligned frames from a video at inference time
and train autoencoder on these images with noise and blur as augmentation with vgg and l1 loss

then in final inference i use this autocoder on the output of latentsync model and get clear person specific model with good lipsync

i found this does not work on data less then 5 min , as no of frames are less and examples of diffrent viseme are less,

also you can train 256x256 lipsync model and 256->512 autoencoder person specific

please consider implementing it would help in avatar technology

mahirshahriar1 · 2025-03-22T16:34:12Z

can you share more details

chunyu-li · 2025-03-23T07:22:17Z

@Soonwang1988 Hey, I am very interested in how you train VAE. Can you provide more details? Such as:

Which codebase you used to train VAE, and the detailed operations of noise and blur?
Did you use a discriminator loss to train VAE? If you used it, what is the specific architecture of discriminator?
When you use this VAE with U-Net, what scaling_factor you set? Is it 0.18215?

Soonwang1988 · 2025-03-23T08:46:18Z

so i implemented basic auto encoder , with gaussian noise in input and random gaussian blur with random sigma

        if self.training:
            add_blur = True
            add_noise = True
            if add_blur:
                x = torch.stack([self.apply_random_blur(x[i]) for i in range(x.shape[0])])

            if add_noise:
                n_level = random.choice([0, 0, 0, 1, 2, 3, 4, 5])
                noise_x = n_level * torch.randn_like(x) / 10
                x = x + noise_x.detach()

i did not use discriminator at all , only VGG 19 for Lpips and l1 loss
i used default configuration for inference in latent sync, and saved faces from affine_transform_video which is in file (latentsync/pipelines/lipsync_pipeline.py), you are loading all frame in one go that's why for longer video code would go out of memrory so had to change it with opencv and read face one by one
and in restore video after inferencing from latentsync before restoration i added my custom model

width = int(x2 - x1)

face = self.custom_ae(face.unsqueeze(0)).squeeze(0)  

face = torchvision.transforms.functional.resize(face, size=(height, width), antialias=True)
face = rearrange(face, "c h w -> h w c")

for good Ae you might need 3-5 min of data

Soonwang1988 · 2025-03-23T09:09:30Z

Now once it is implemented , main research would be on how we can train stage3 with less data like say 1 min or 30 sec , for that may be we can use sd-vae-ft-mse type of model

chunyu-li · 2025-03-23T09:13:29Z

Thanks for providing so much useful information!
But I still don't understand why it is said "you are loading all frame in one go", I used cv2 to read and write video frames one by one.

chunyu-li · 2025-03-23T09:18:20Z

And another question: you said you implemented a basic auto encoder. Is its architecture the same as https://huggingface.co/stabilityai/sd-vae-ft-mse and you are finetuning based on it?

Soonwang1988 · 2025-03-23T09:45:05Z

in read_video_cv2 you are loading all images in a numpy array, which is not good , lts say when you have longer video , as all frames would get accumulated in memory :

def read_video_cv2(video_path: str):
    # Open the video file
    cap = cv2.VideoCapture(video_path)

    # Check if the video was opened successfully
    if not cap.isOpened():
        print("Error: Could not open video.")
        return np.array([])

    frames = []

    while True:
        # Read a frame
        ret, frame = cap.read()

        # If frame is read correctly ret is True
        if not ret:
            break

        # Convert BGR to RGB
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

        frames.append(frame_rgb)

    # Release the video capture object
    cap.release()

    return np.array(frames)

chunyu-li · 2025-03-23T10:04:45Z

Actually I don't think the problem is here, Wav2Lip uses the exactly same method to read video, you can see https://github.com/Rudrabha/Wav2Lip/blob/d07fc4d8431cc5378c8c0239392485b08a976f43/inference.py#L190

I think the OOM issue might be caused by a memory leak somewhere in the affine transformation, but I haven't found the root cause yet.

Soonwang1988 · 2025-03-24T06:52:15Z

https://github.com/user-attachments/assets/7c08f389-e0e9-4805-9c08-295669c87a85 (with ae)
https://github.com/user-attachments/assets/e677d2bf-6703-4c4e-8d32-cb650405f50c (without ae)

Checkout these two files one is without extra AE and another one is with ae,

source used
https://www.youtube.com/watch?v=r1CInfA6lV4

pedrolabonia · 2025-03-24T07:17:33Z

Hi @Soonwang1988, could you share the full implementation and how-to for newbies like myself? Are you doing any unet training at all? Or is it all inference time?

Thanks!

mahirshahriar1 · 2025-03-24T08:36:26Z

@Soonwang1988 I am unable to find "some_random_name162534" username

Soonwang1988 · 2025-03-24T09:20:03Z

@pedrolabonia you can connect with me on discord at soonwang1988

pedrolabonia · 2025-03-24T14:24:20Z

@pedrolabonia you can connect with me on discord at soonwang1988

Sent you a request! It's the one with the parrot pic.

chunyu-li · 2025-03-24T14:24:21Z

https://github.com/user-attachments/assets/7c08f389-e0e9-4805-9c08-295669c87a85 (with ae) https://github.com/user-attachments/assets/e677d2bf-6703-4c4e-8d32-cb650405f50c (without ae)

Checkout these two files one is without extra AE and another one is with ae,

source used https://www.youtube.com/watch?v=r1CInfA6lV4

To be honest, I don't see a significant difference in clarity between these two videos. 😂

Soonwang1988 · 2025-03-24T16:04:42Z

one is having glich so it is easy to judge it is ai generated and another one is have viaseme very close to the original person, and also to save time i trained 256x256 ae only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request, on person specific auto encoder for stage3 #183

Feature Request, on person specific auto encoder for stage3 #183

Soonwang1988 commented Mar 22, 2025

mahirshahriar1 commented Mar 22, 2025

chunyu-li commented Mar 23, 2025

Soonwang1988 commented Mar 23, 2025

Soonwang1988 commented Mar 23, 2025 •

edited

Loading

chunyu-li commented Mar 23, 2025 •

edited

Loading

chunyu-li commented Mar 23, 2025

Soonwang1988 commented Mar 23, 2025 •

edited

Loading

chunyu-li commented Mar 23, 2025

Soonwang1988 commented Mar 24, 2025 •

edited

Loading

pedrolabonia commented Mar 24, 2025 •

edited

Loading

mahirshahriar1 commented Mar 24, 2025

Soonwang1988 commented Mar 24, 2025 •

edited

Loading

pedrolabonia commented Mar 24, 2025

chunyu-li commented Mar 24, 2025

Soonwang1988 commented Mar 24, 2025 •

edited

Loading

Feature Request, on person specific auto encoder for stage3 #183

Feature Request, on person specific auto encoder for stage3 #183

Comments

Soonwang1988 commented Mar 22, 2025

mahirshahriar1 commented Mar 22, 2025

chunyu-li commented Mar 23, 2025

Soonwang1988 commented Mar 23, 2025

Soonwang1988 commented Mar 23, 2025 • edited Loading

chunyu-li commented Mar 23, 2025 • edited Loading

chunyu-li commented Mar 23, 2025

Soonwang1988 commented Mar 23, 2025 • edited Loading

chunyu-li commented Mar 23, 2025

Soonwang1988 commented Mar 24, 2025 • edited Loading

pedrolabonia commented Mar 24, 2025 • edited Loading

mahirshahriar1 commented Mar 24, 2025

Soonwang1988 commented Mar 24, 2025 • edited Loading

pedrolabonia commented Mar 24, 2025

chunyu-li commented Mar 24, 2025

Soonwang1988 commented Mar 24, 2025 • edited Loading

Soonwang1988 commented Mar 23, 2025 •

edited

Loading

chunyu-li commented Mar 23, 2025 •

edited

Loading

Soonwang1988 commented Mar 23, 2025 •

edited

Loading

Soonwang1988 commented Mar 24, 2025 •

edited

Loading

pedrolabonia commented Mar 24, 2025 •

edited

Loading

Soonwang1988 commented Mar 24, 2025 •

edited

Loading

Soonwang1988 commented Mar 24, 2025 •

edited

Loading