Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two questions about DiffVC #31

Open
huangf79 opened this issue Sep 14, 2023 · 1 comment
Open

Two questions about DiffVC #31

huangf79 opened this issue Sep 14, 2023 · 1 comment

Comments

@huangf79
Copy link

huangf79 commented Sep 14, 2023

Hello, thank you for sharing this excellent work. After briefly browsing the code, I have two questions:
(1) What is the use of x_ref ? During training it seems to be a different fragment of the same mel-spectrogram as x. And to which part of the paper does it correspond?
(2) Why do we need to perform a weighted summation of mean and x? Does this mean that the reverse diffusion during inference starts from the weighted mean_x?
I'm new to diffusion models and don't quite understand the theory in the paper, so sorry if I asked some stupid questions.

@li1jkdaw
Copy link

Hi!

  1. The speaker encoder uses this x_ref (different fragment of the same mel-spectrogram as x) as additional input to the trainable speaker conditioning network denoted by g_t(Y) in the paper. Different inputs to this network are compared in Table 1.
  2. Yes, reverse diffusion starts from mean_x = self.decoder.compute_diffused_mean(x, x_mask, mean, 1.0), which is in fact very close to mean (because we have t=1.0 in this case). It is "average voice" mel-spectrogram denoted by X^{bar} in the paper.
    At training, weighted summation of mean and x is necessary since it is related to the forward diffusion (see formula (3) in the paper), and at final time t=1.0 forward diffusion ends in the prior N(X^{bar}, I).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants