Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the rope scaling with x_transformers ? #159

Open
cutoken opened this issue Jun 29, 2023 · 21 comments
Open

How to use the rope scaling with x_transformers ? #159

cutoken opened this issue Jun 29, 2023 · 21 comments

Comments

@cutoken
Copy link

cutoken commented Jun 29, 2023

Hi,
Thank you for this wonderful library. I'm wondering if there is a way to use the recent rope paper's scaling workaround with x_transformers. I have seen your recent change to the rotary position encoding repo but wasn't able to identify where to modify similarly in x_transfomers repo.

@lucidrains
Copy link
Owner

@cutoken oh hey

do you mean the interpolation of the positions? do you have a pretrained model with RoPE you are finetuning?

@lucidrains
Copy link
Owner

you are referring to lucidrains/rotary-embedding-torch@e7ce8e0 ?

lucidrains added a commit that referenced this issue Jun 29, 2023
lucidrains added a commit that referenced this issue Jun 29, 2023
@lucidrains
Copy link
Owner

@cutoken
Copy link
Author

cutoken commented Jun 29, 2023

Thank you @lucidrains for your quick help. To answer your other query I have some small models trained on TinyStories with RoPE which I wanted to use to recreate the interpolation paper results with. So far no luck :)

@cutoken
Copy link
Author

cutoken commented Jun 29, 2023

It is actually quite weird. I'm able to recreate the issue of increasing loss on context increase but the interpolation solution doesn't really work for me. Have you had success recreating the paper on any smaller model ?

@lucidrains
Copy link
Owner

@cutoken do you mean you tried it just now and it didn't work? did you follow their recipe of fine tuning on 1k longer context samples, with an rotary_interpolation_factor = new_max_seq_len / old_max_seq_len ?

@lucidrains
Copy link
Owner

@cutoken if i hear it doesn't work from a few people, i may just remove it in favor of xpos

@cutoken
Copy link
Author

cutoken commented Jun 30, 2023

Will confirm once again in a fresh experiment @lucidrains

@lucidrains
Copy link
Owner

share it with w&b!

@cutoken
Copy link
Author

cutoken commented Jun 30, 2023

Can confirm it is actually worse than directly fine tuning without interpolation. I'm using a really small model but I don't see why that should matter as the interpolation factor is being followed as per the paper guidelines. I can send you the weights and biases if needed of the smaller pre-intrapolated one so that you can also test if needed (please provide your mail id in that case)

@lucidrains
Copy link
Owner

@cutoken that would be great! could you share the training script too? i think i'll go ahead and remove it from this repository until i hear more feedback (or see a paper that corroborates the technique)

lucidrains added a commit that referenced this issue Jun 30, 2023
This reverts commit 8fa7b4c.
@lucidrains
Copy link
Owner

lucidrains commented Jun 30, 2023

@cutoken do you want to double check your experiments? i see someone legit corroborating the results from Meta https://kaiokendev.github.io/context (seems to be concurrent work)

@lucidrains
Copy link
Owner

@cutoken they also released a new long context model, LongChat, at 16k

@cutoken
Copy link
Author

cutoken commented Jun 30, 2023

I hope it works as well :)

Below link contains the training script I have used and also the checkpoint with rope enabled (both checkpoints are the same. Just made a backup in case you end up overwriting it). You can use it as the starting point to train with and without the newly added parameter to see the difference. You will need TinyStories data set from huggingface. I have added the sentencepiece vocab file already so you would only need sentencepiece available - no need to tokenize the dateset again. Let me know if you face any issues running it.

warning - large file due to the model size

@lucidrains
Copy link
Owner

@cutoken thanks! i'll allot some time this Sunday to do some training

@kaiokendev
Copy link

Hello, it is really cool you added this feature! One thing I would mention is that in my case, the intuition is that the large model may overfit to the position embedding, such that it is easier to train on the interpolated position than using OOD positions. The counter point is that small models may not be overfit in the same way - I see Meta only trained on 7B parameters and up, so it's possible the effect decreases for smaller models. There was also no ablation performed for non-LLaMA RoPE models, so it is unknown how much it depends on other factors as well. Just a thought

@cutoken
Copy link
Author

cutoken commented Jul 2, 2023

@kaiokendev , that sounds like a good explanation on why I'm not seeing same results with a smaller model (50M params).

@lucidrains
Copy link
Owner

@kaiokendev oh interesting; you can probably run a few experiments to back up your idea, and share it on twitter

would be an important caveat that should be noted in their paper!

@lucidrains
Copy link
Owner

@kaiokendev there's actually a number of papers popping up here and there that tries to reduce overfitting of the positional embeddings

the two i've seen are (1) randomly offset positions by some constant and (2) within a range of 0 to a length L where L > maximum number of tokens, use a random subset of positions from that range, ascending

@lucidrains
Copy link
Owner

@kaiokendev have you seen https://arxiv.org/abs/2307.03170 ?

@kaiokendev
Copy link

@kaiokendev have you seen https://arxiv.org/abs/2307.03170 ?

I did skim it, there are a lot of external memory approaches I saw, but I do not really play with the cases involving approximated attention like that one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants