Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoVQ implementation question #10

Open
kebijuelun opened this issue Dec 15, 2023 · 2 comments
Open

MoVQ implementation question #10

kebijuelun opened this issue Dec 15, 2023 · 2 comments

Comments

@kebijuelun
Copy link

I have some questions regarding the implementation of MoVQ and would appreciate your clarification.

From the original MoVQ paper, it is mentioned that a multi-channle VQ is adopted.
image

However, the implementation of kandinsky3 does not involve any vector quantization operation:

class MoVQ(nn.Module):
    
    def __init__(self, generator_params):
        super().__init__()
        z_channels = generator_params["z_channels"]
        self.encoder = Encoder(**generator_params)
        self.quant_conv = torch.nn.Conv2d(z_channels, z_channels, 1)
        self.post_quant_conv = torch.nn.Conv2d(z_channels, z_channels, 1)
        self.decoder = Decoder(zq_ch=z_channels, **generator_params)
        
    @torch.no_grad()
    def encode(self, x):
        h = self.encoder(x)
        h = self.quant_conv(h)
        return h

    @torch.no_grad()
    def decode(self, quant):
        decoder_input = self.post_quant_conv(quant)
        decoded = self.decoder(decoder_input, quant)
        return decoded

May I ask if it is a misunderstanding on my part regarding MoVQ, or if Kandinsky has made some modifications to the implementation of MoVQ?

@pekinghk
Copy link

I've noticed the same issue. It appears that the author has just utilized the encoder and decoder network architecture from movqgan and has retrained a VAE model. I'm not sure if my understanding is correct, so I would appreciate clarification from the author.

@FlyHighest
Copy link

After training, each token of the encoder's output is very close to a certain vector in the codebook. I've tried adding the VQ to refine the encoder's output and find the MSE between the features before VQ and after VQ is small, thus w/ VQ or w/o VQ produces nearly the same decoding results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants