Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attention layer in GradTTS #15

Open
patrickvonplaten opened this issue Jun 28, 2022 · 2 comments
Open

Attention layer in GradTTS #15

patrickvonplaten opened this issue Jun 28, 2022 · 2 comments

Comments

@patrickvonplaten
Copy link

Hey @ivanvovk et al.

Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here:

k = k.softmax(dim=-1)

computes the softmax on the projected key values instead of computing it on the product of query and key.

Usually, I know self-attention as:

Value x Softmax(Query x Key^T / d_k)

but it seems like here it is

(Value x Softmax(Key)) x Query

=> Is it similar to self-attention? Where does it come from?

Best,
Patrick

@ivanvovk
Copy link
Contributor

Hi, @patrickvonplaten! Sorry for late reply and thank you very much for pointing that issue out!

Actually, this is the form of compute-and-memory-efficient attention mechanism called Efficient Attention. Mathematically, it is claimed to be approximately equivalent to the classical dot-product attention.

However, unfortunately, we noticed that we missed taking softmax of the query vectors, our bad. Nonetheless, at the same time, taking softmax is just a form of normalization, so no surprise it worked out of the box as well.

@patrickvonplaten
Copy link
Author

I see that makes sense! Thanks for replying so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants