-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request, on person specific auto encoder for stage3 #183
Comments
can you share more details |
@Soonwang1988 Hey, I am very interested in how you train VAE. Can you provide more details? Such as:
|
if self.training:
add_blur = True
add_noise = True
if add_blur:
x = torch.stack([self.apply_random_blur(x[i]) for i in range(x.shape[0])])
if add_noise:
n_level = random.choice([0, 0, 0, 1, 2, 3, 4, 5])
noise_x = n_level * torch.randn_like(x) / 10
x = x + noise_x.detach()
width = int(x2 - x1)
face = self.custom_ae(face.unsqueeze(0)).squeeze(0)
face = torchvision.transforms.functional.resize(face, size=(height, width), antialias=True)
face = rearrange(face, "c h w -> h w c")
|
Now once it is implemented , main research would be on how we can train stage3 with less data like say 1 min or 30 sec , for that may be we can use sd-vae-ft-mse type of model |
Thanks for providing so much useful information! |
And another question: you said you implemented a basic auto encoder. Is its architecture the same as https://huggingface.co/stabilityai/sd-vae-ft-mse and you are finetuning based on it? |
in read_video_cv2 you are loading all images in a numpy array, which is not good , lts say when you have longer video , as all frames would get accumulated in memory : def read_video_cv2(video_path: str):
# Open the video file
cap = cv2.VideoCapture(video_path)
# Check if the video was opened successfully
if not cap.isOpened():
print("Error: Could not open video.")
return np.array([])
frames = []
while True:
# Read a frame
ret, frame = cap.read()
# If frame is read correctly ret is True
if not ret:
break
# Convert BGR to RGB
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame_rgb)
# Release the video capture object
cap.release()
return np.array(frames) |
Actually I don't think the problem is here, Wav2Lip uses the exactly same method to read video, you can see https://github.com/Rudrabha/Wav2Lip/blob/d07fc4d8431cc5378c8c0239392485b08a976f43/inference.py#L190 I think the OOM issue might be caused by a memory leak somewhere in the affine transformation, but I haven't found the root cause yet. |
https://github.com/user-attachments/assets/7c08f389-e0e9-4805-9c08-295669c87a85 (with ae) Checkout these two files one is without extra AE and another one is with ae, source used |
Hi @Soonwang1988, could you share the full implementation and how-to for newbies like myself? Are you doing any unet training at all? Or is it all inference time? Thanks! |
@Soonwang1988 I am unable to find "some_random_name162534" username |
@pedrolabonia you can connect with me on discord at soonwang1988 |
Sent you a request! It's the one with the parrot pic. |
To be honest, I don't see a significant difference in clarity between these two videos. 😂 |
one is having glich so it is easy to judge it is ai generated and another one is have viaseme very close to the original person, and also to save time i trained 256x256 ae only |
Hi i found a way to use latentsync where i extracted out aligned frames from a video at inference time
and train autoencoder on these images with noise and blur as augmentation with vgg and l1 loss
then in final inference i use this autocoder on the output of latentsync model and get clear person specific model with good lipsync
i found this does not work on data less then 5 min , as no of frames are less and examples of diffrent viseme are less,
also you can train 256x256 lipsync model and 256->512 autoencoder person specific
please consider implementing it would help in avatar technology
The text was updated successfully, but these errors were encountered: