Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request, on person specific auto encoder for stage3 #183

Open
Soonwang1988 opened this issue Mar 22, 2025 · 15 comments
Open

Feature Request, on person specific auto encoder for stage3 #183

Soonwang1988 opened this issue Mar 22, 2025 · 15 comments

Comments

@Soonwang1988
Copy link

Hi i found a way to use latentsync where i extracted out aligned frames from a video at inference time
and train autoencoder on these images with noise and blur as augmentation with vgg and l1 loss

then in final inference i use this autocoder on the output of latentsync model and get clear person specific model with good lipsync

i found this does not work on data less then 5 min , as no of frames are less and examples of diffrent viseme are less,

also you can train 256x256 lipsync model and 256->512 autoencoder person specific

please consider implementing it would help in avatar technology

@mahirshahriar1
Copy link

can you share more details

@chunyu-li
Copy link
Collaborator

@Soonwang1988 Hey, I am very interested in how you train VAE. Can you provide more details? Such as:

  1. Which codebase you used to train VAE, and the detailed operations of noise and blur?
  2. Did you use a discriminator loss to train VAE? If you used it, what is the specific architecture of discriminator?
  3. When you use this VAE with U-Net, what scaling_factor you set? Is it 0.18215?

@Soonwang1988
Copy link
Author

  1. so i implemented basic auto encoder , with gaussian noise in input and random gaussian blur with random sigma
        if self.training:
            add_blur = True
            add_noise = True
            if add_blur:
                x = torch.stack([self.apply_random_blur(x[i]) for i in range(x.shape[0])])

            if add_noise:
                n_level = random.choice([0, 0, 0, 1, 2, 3, 4, 5])
                noise_x = n_level * torch.randn_like(x) / 10
                x = x + noise_x.detach()
  1. i did not use discriminator at all , only VGG 19 for Lpips and l1 loss
  2. i used default configuration for inference in latent sync, and saved faces from affine_transform_video which is in file (latentsync/pipelines/lipsync_pipeline.py), you are loading all frame in one go that's why for longer video code would go out of memrory so had to change it with opencv and read face one by one
  3. and in restore video after inferencing from latentsync before restoration i added my custom model
width = int(x2 - x1)

face = self.custom_ae(face.unsqueeze(0)).squeeze(0)  

face = torchvision.transforms.functional.resize(face, size=(height, width), antialias=True)
face = rearrange(face, "c h w -> h w c")
  1. for good Ae you might need 3-5 min of data

@Soonwang1988
Copy link
Author

Soonwang1988 commented Mar 23, 2025

Now once it is implemented , main research would be on how we can train stage3 with less data like say 1 min or 30 sec , for that may be we can use sd-vae-ft-mse type of model

@chunyu-li
Copy link
Collaborator

chunyu-li commented Mar 23, 2025

Thanks for providing so much useful information!
But I still don't understand why it is said "you are loading all frame in one go", I used cv2 to read and write video frames one by one.

@chunyu-li
Copy link
Collaborator

And another question: you said you implemented a basic auto encoder. Is its architecture the same as https://huggingface.co/stabilityai/sd-vae-ft-mse and you are finetuning based on it?

@Soonwang1988
Copy link
Author

Soonwang1988 commented Mar 23, 2025

in read_video_cv2 you are loading all images in a numpy array, which is not good , lts say when you have longer video , as all frames would get accumulated in memory :

def read_video_cv2(video_path: str):
    # Open the video file
    cap = cv2.VideoCapture(video_path)

    # Check if the video was opened successfully
    if not cap.isOpened():
        print("Error: Could not open video.")
        return np.array([])

    frames = []

    while True:
        # Read a frame
        ret, frame = cap.read()

        # If frame is read correctly ret is True
        if not ret:
            break

        # Convert BGR to RGB
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

        frames.append(frame_rgb)

    # Release the video capture object
    cap.release()

    return np.array(frames)

@chunyu-li
Copy link
Collaborator

Actually I don't think the problem is here, Wav2Lip uses the exactly same method to read video, you can see https://github.com/Rudrabha/Wav2Lip/blob/d07fc4d8431cc5378c8c0239392485b08a976f43/inference.py#L190

I think the OOM issue might be caused by a memory leak somewhere in the affine transformation, but I haven't found the root cause yet.

@Soonwang1988
Copy link
Author

Soonwang1988 commented Mar 24, 2025

@pedrolabonia
Copy link

pedrolabonia commented Mar 24, 2025

Hi @Soonwang1988, could you share the full implementation and how-to for newbies like myself? Are you doing any unet training at all? Or is it all inference time?

Thanks!

@mahirshahriar1
Copy link

@Soonwang1988 I am unable to find "some_random_name162534" username

@Soonwang1988
Copy link
Author

Soonwang1988 commented Mar 24, 2025

@pedrolabonia you can connect with me on discord at soonwang1988

@pedrolabonia
Copy link

@pedrolabonia you can connect with me on discord at soonwang1988

Sent you a request! It's the one with the parrot pic.

@chunyu-li
Copy link
Collaborator

https://github.com/user-attachments/assets/7c08f389-e0e9-4805-9c08-295669c87a85 (with ae) https://github.com/user-attachments/assets/e677d2bf-6703-4c4e-8d32-cb650405f50c (without ae)

Checkout these two files one is without extra AE and another one is with ae,

source used https://www.youtube.com/watch?v=r1CInfA6lV4

To be honest, I don't see a significant difference in clarity between these two videos. 😂

@Soonwang1988
Copy link
Author

Soonwang1988 commented Mar 24, 2025

one is having glich so it is easy to judge it is ai generated and another one is have viaseme very close to the original person, and also to save time i trained 256x256 ae only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants