How to use the rope scaling with x_transformers ? #159

cutoken · 2023-06-29T16:26:36Z

Hi,
Thank you for this wonderful library. I'm wondering if there is a way to use the recent rope paper's scaling workaround with x_transformers. I have seen your recent change to the rotary position encoding repo but wasn't able to identify where to modify similarly in x_transfomers repo.

lucidrains · 2023-06-29T16:29:15Z

@cutoken oh hey

do you mean the interpolation of the positions? do you have a pretrained model with RoPE you are finetuning?

lucidrains · 2023-06-29T16:29:56Z

you are referring to lucidrains/rotary-embedding-torch@e7ce8e0 ?

lucidrains · 2023-06-29T17:39:57Z

@cutoken try using this setting 8fa7b4c#diff-2e64ac8840195d7dc3e07a3aac70b50bbab1cdf80f3a7432be40105e6097fc0aR896

cutoken · 2023-06-29T17:52:32Z

Thank you @lucidrains for your quick help. To answer your other query I have some small models trained on TinyStories with RoPE which I wanted to use to recreate the interpolation paper results with. So far no luck :)

cutoken · 2023-06-29T18:17:06Z

It is actually quite weird. I'm able to recreate the issue of increasing loss on context increase but the interpolation solution doesn't really work for me. Have you had success recreating the paper on any smaller model ?

lucidrains · 2023-06-29T18:18:58Z

@cutoken do you mean you tried it just now and it didn't work? did you follow their recipe of fine tuning on 1k longer context samples, with an rotary_interpolation_factor = new_max_seq_len / old_max_seq_len ?

lucidrains · 2023-06-29T18:19:40Z

@cutoken if i hear it doesn't work from a few people, i may just remove it in favor of xpos

cutoken · 2023-06-30T03:24:45Z

Will confirm once again in a fresh experiment @lucidrains

lucidrains · 2023-06-30T03:25:46Z

share it with w&b!

cutoken · 2023-06-30T07:07:30Z

Can confirm it is actually worse than directly fine tuning without interpolation. I'm using a really small model but I don't see why that should matter as the interpolation factor is being followed as per the paper guidelines. I can send you the weights and biases if needed of the smaller pre-intrapolated one so that you can also test if needed (please provide your mail id in that case)

lucidrains · 2023-06-30T13:13:24Z

@cutoken that would be great! could you share the training script too? i think i'll go ahead and remove it from this repository until i hear more feedback (or see a paper that corroborates the technique)

This reverts commit 8fa7b4c.

lucidrains · 2023-06-30T14:34:24Z

@cutoken do you want to double check your experiments? i see someone legit corroborating the results from Meta https://kaiokendev.github.io/context (seems to be concurrent work)

lucidrains · 2023-06-30T14:34:47Z

@cutoken they also released a new long context model, LongChat, at 16k

cutoken · 2023-06-30T15:57:27Z

I hope it works as well :)

Below link contains the training script I have used and also the checkpoint with rope enabled (both checkpoints are the same. Just made a backup in case you end up overwriting it). You can use it as the starting point to train with and without the newly added parameter to see the difference. You will need TinyStories data set from huggingface. I have added the sentencepiece vocab file already so you would only need sentencepiece available - no need to tokenize the dateset again. Let me know if you face any issues running it.

warning - large file due to the model size

lucidrains · 2023-06-30T16:04:21Z

@cutoken thanks! i'll allot some time this Sunday to do some training

kaiokendev · 2023-07-02T00:49:01Z

Hello, it is really cool you added this feature! One thing I would mention is that in my case, the intuition is that the large model may overfit to the position embedding, such that it is easier to train on the interpolated position than using OOD positions. The counter point is that small models may not be overfit in the same way - I see Meta only trained on 7B parameters and up, so it's possible the effect decreases for smaller models. There was also no ablation performed for non-LLaMA RoPE models, so it is unknown how much it depends on other factors as well. Just a thought

cutoken · 2023-07-02T02:53:21Z

@kaiokendev , that sounds like a good explanation on why I'm not seeing same results with a smaller model (50M params).

lucidrains · 2023-07-02T15:50:13Z

@kaiokendev oh interesting; you can probably run a few experiments to back up your idea, and share it on twitter

would be an important caveat that should be noted in their paper!

lucidrains · 2023-07-02T15:53:11Z

@kaiokendev there's actually a number of papers popping up here and there that tries to reduce overfitting of the positional embeddings

the two i've seen are (1) randomly offset positions by some constant and (2) within a range of 0 to a length L where L > maximum number of tokens, use a random subset of positions from that range, ascending

lucidrains · 2023-07-13T18:04:54Z

@kaiokendev have you seen https://arxiv.org/abs/2307.03170 ?

kaiokendev · 2023-07-13T21:12:08Z

@kaiokendev have you seen https://arxiv.org/abs/2307.03170 ?

I did skim it, there are a lot of external memory approaches I saw, but I do not really play with the cases involving approximated attention like that one

lucidrains added a commit that referenced this issue Jun 29, 2023

help out @cutoken at #159

c85769b

lucidrains added a commit that referenced this issue Jun 29, 2023

help out @cutoken at #159

8fa7b4c

lucidrains added a commit that referenced this issue Jun 30, 2023

Revert "help out @cutoken at #159"

a957b68

This reverts commit 8fa7b4c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use the rope scaling with x_transformers ? #159

How to use the rope scaling with x_transformers ? #159

cutoken commented Jun 29, 2023

lucidrains commented Jun 29, 2023

lucidrains commented Jun 29, 2023

lucidrains commented Jun 29, 2023

cutoken commented Jun 29, 2023

cutoken commented Jun 29, 2023

lucidrains commented Jun 29, 2023

lucidrains commented Jun 29, 2023

cutoken commented Jun 30, 2023

lucidrains commented Jun 30, 2023

cutoken commented Jun 30, 2023

lucidrains commented Jun 30, 2023

lucidrains commented Jun 30, 2023 •

edited

Loading

lucidrains commented Jun 30, 2023

cutoken commented Jun 30, 2023

lucidrains commented Jun 30, 2023

kaiokendev commented Jul 2, 2023

cutoken commented Jul 2, 2023

lucidrains commented Jul 2, 2023

lucidrains commented Jul 2, 2023

lucidrains commented Jul 13, 2023

kaiokendev commented Jul 13, 2023

How to use the rope scaling with x_transformers ? #159

How to use the rope scaling with x_transformers ? #159

Comments

cutoken commented Jun 29, 2023

lucidrains commented Jun 29, 2023

lucidrains commented Jun 29, 2023

lucidrains commented Jun 29, 2023

cutoken commented Jun 29, 2023

cutoken commented Jun 29, 2023

lucidrains commented Jun 29, 2023

lucidrains commented Jun 29, 2023

cutoken commented Jun 30, 2023

lucidrains commented Jun 30, 2023

cutoken commented Jun 30, 2023

lucidrains commented Jun 30, 2023

lucidrains commented Jun 30, 2023 • edited Loading

lucidrains commented Jun 30, 2023

cutoken commented Jun 30, 2023

lucidrains commented Jun 30, 2023

kaiokendev commented Jul 2, 2023

cutoken commented Jul 2, 2023

lucidrains commented Jul 2, 2023

lucidrains commented Jul 2, 2023

lucidrains commented Jul 13, 2023

kaiokendev commented Jul 13, 2023

lucidrains commented Jun 30, 2023 •

edited

Loading