Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why we can remove the softmax? #6

Open
HaoWuSR opened this issue Nov 28, 2022 · 0 comments
Open

Why we can remove the softmax? #6

HaoWuSR opened this issue Nov 28, 2022 · 0 comments

Comments

@HaoWuSR
Copy link

HaoWuSR commented Nov 28, 2022

Hi,
Thanks for your excellent work!
I have read your paper carefully but still confused about the following issues:

  1. I understand the normalization in Q and K for balancing the weights. But why we can remove the SOFTMAX, the description in the paper is: Given this simple normalization method, we remove the SOFTMAX layer, so the attention block can be ...... But what is the reason why we can remove the SOFTMAX?

  2. The paper said the attention values are bounded between -D and D due to the Normalization. What is the difference between this format and the traditional format?

Thanks for your time and waiting for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant