Attention layer in GradTTS #15

patrickvonplaten · 2022-06-28T21:06:20Z

Thanks a lot for open-sourcing the model - it's working really welll! I've been looking a bit through the code base and I was surprised to see that the attention layer here:

Speech-Backbones/Grad-TTS/model/diffusion.py

Line 95 in b82fdd5

k = k.softmax(dim=-1)

computes the softmax on the projected key values instead of computing it on the product of query and key.

Usually, I know self-attention as:

Value x Softmax(Query x Key^T / d_k)

but it seems like here it is

(Value x Softmax(Key)) x Query

=> Is it similar to self-attention? Where does it come from?

Best,
Patrick

ivanvovk · 2022-06-30T08:58:02Z

Hi, @patrickvonplaten! Sorry for late reply and thank you very much for pointing that issue out!

Actually, this is the form of compute-and-memory-efficient attention mechanism called Efficient Attention. Mathematically, it is claimed to be approximately equivalent to the classical dot-product attention.

However, unfortunately, we noticed that we missed taking softmax of the query vectors, our bad. Nonetheless, at the same time, taking softmax is just a form of normalization, so no surprise it worked out of the box as well.

patrickvonplaten · 2022-06-30T09:18:25Z

I see that makes sense! Thanks for replying so quickly!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention layer in GradTTS #15

Attention layer in GradTTS #15

patrickvonplaten commented Jun 28, 2022

ivanvovk commented Jun 30, 2022

patrickvonplaten commented Jun 30, 2022

Attention layer in GradTTS #15

Attention layer in GradTTS #15

Comments

patrickvonplaten commented Jun 28, 2022

ivanvovk commented Jun 30, 2022

patrickvonplaten commented Jun 30, 2022