Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linear_focus_attention问题 #1

Open
chenying0722 opened this issue Aug 24, 2024 · 5 comments
Open

linear_focus_attention问题 #1

chenying0722 opened this issue Aug 24, 2024 · 5 comments

Comments

@chenying0722
Copy link

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

@marlin-codes
Copy link
Collaborator

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning $q$ and $k$, so we did not do this operation for $v$. Nevertheless, we added this operation for $v$ as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.

@chenying0722
Copy link
Author

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning q and k , so we did not do this operation for v . Nevertheless, we added this operation for v as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.

I understand. Thank you for your answer!

@chenying0722
Copy link
Author

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning q and k , so we did not do this operation for v . Nevertheless, we added this operation for v as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.

Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!

@chenying0722 chenying0722 reopened this Aug 30, 2024
@marlin-codes
Copy link
Collaborator

Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!

Thanks for your question!

For the first question

If we fix the curvature c as a negative value (i.e., c<0), and then - c == abs(c), for example, -(-1) == abs(-1) = 1.

Therefore, we directly set a k, a positive value, and apply +1 in the equations, equivalent to -(-1).

For the second question

Using k or 1/k are both acceptable, and this is decided by how you define the curvature, but it's important to be consistent. In the code, +k represents that the curvature is −1/k.

But why do we use +k in the code? since it is easy to implement and optimize the parameters.

@chenying0722
Copy link
Author

Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!

Thanks for your question!

For the first question

If we fix the curvature c as a negative value (i.e., c<0), and then - c == abs(c), for example, -(-1) == abs(-1) = 1.

Therefore, we directly set a k, a positive value, and apply +1 in the equations, equivalent to -(-1).

For the second question

Using k or 1/k are both acceptable, and this is decided by how you define the curvature, but it's important to be consistent. In the code, +k represents that the curvature is −1/k.

But why do we use +k in the code? since it is easy to implement and optimize the parameters.

Thank you for your answer!
I saw that the curvature was described as k < 0, which led me to think that k is also negative in the code. Now I understand.
Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants