Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Implementation of Diffusion Model #35

Open
siyag12 opened this issue Nov 24, 2023 · 1 comment
Open

Different Implementation of Diffusion Model #35

siyag12 opened this issue Nov 24, 2023 · 1 comment

Comments

@siyag12
Copy link

siyag12 commented Nov 24, 2023

I'm a researcher working on building a TTS model using diffusion. While looking for the implementation of this, I found this repo.

According to my understanding of the paper, both the processes in the decoder diffusion model, forward and backward diffusion are supposed to take place on the latent space vector z [which is provided by UNET encoder part]. However, the repo's implementation seems to be different from this understanding.
Could you give a reasoning behind this?

@li1jkdaw
Copy link

li1jkdaw commented Aug 23, 2024

Usually, the term "latent" used in the context of diffusion modeling denotes the space where forward and reverse diffusions are defined, i.e. if the clean image/spectrogram is x_0, then its noisy versions x_t can be called "latents". The paper you mentioned uses the term "latent" in this meaning. In Grad-TTS, score-matching network is parameterized with UNet, but its encoder does not provide "latents" in the mentioned meaning. So, diffusion in Grad-TTS does not take place in the space of the outputs of the UNet encoder, but UNet itself (encoder + decoder) maps noisy object x_t in the "latent" space to the score function at x_t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants