Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How the recovery of locality works? #1

Open
liyucheng09 opened this issue Oct 25, 2024 · 3 comments
Open

How the recovery of locality works? #1

liyucheng09 opened this issue Oct 25, 2024 · 3 comments

Comments

@liyucheng09
Copy link

Congrats for the great paper!

image

I have a simple question about the recovering of locality.

by adding a w, it literally shifts the left bottom corner to the right, which is just recovering the first step of shifting leftwards. am I misunderstand this operation, or there are more tricks or findings here?

@ChenxinAn-fdu
Copy link
Contributor

ChenxinAn-fdu commented Oct 26, 2024

Hi! Thank you for the great question.

I think you can understand this operation in that way. Let me explain using the shifted position matrix of Llama3.1 128K as an example:
comp

If we do not set the local_value = 128, the 42K-th slash lines will be set to 0 (instead of 128). However, as we all know, LLMs strongly rely on the neighboring N tokens to maintain fluent content. By setting local_value = 128, all tokens can ensure that their neighboring 128 tokens have the closest distance.

If my answer does not fully address your question, please feel free to ask for further clarification.

@liyucheng09
Copy link
Author

Thanks for the response! I assume there will be some overlapping on position ids, like there will be multiple token associated to position 128? can the model works on this setting without further training?

@ChenxinAn-fdu
Copy link
Contributor

ChenxinAn-fdu commented Oct 27, 2024

In fact, positions 128-42K are used twice in STRING. I think there may be some negative effects due to this duplication. However, based on the experimental results obtained, the side effects caused by the repetition appear to be significantly outweighed by the performance improvements gained from using well-trained positional embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants