Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased context length with NTK Rope Scaling #158

Open
juanps90 opened this issue Jul 16, 2023 · 13 comments
Open

Increased context length with NTK Rope Scaling #158

juanps90 opened this issue Jul 16, 2023 · 13 comments

Comments

@juanps90
Copy link

I am having bad quality results with prompts longer than 2048 tokens with a LoRA trained with alpaca_lora_4bit.

These are the settings I am using:

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file
config.alpha_value = 2
config.max_seq_len = 4096

config.gpu_peer_fix = True
config.set_auto_map("10,24")

I tried with higher values of alpha_value and max_seq_len, as well as temperature and similar values but it still fails. It doesn't fail with shorter sequences, so it seems to be an issue with the extended context. Using this configuration, short sequences work fine with the LoRA, but longer sequences just output garbage.

@Panchovix
Copy link
Contributor

Above 2048 context you shouldn't have any issues up to ~3400 context with Static NTK RoPE Scaling, I use it like that on 65b at least.

Now about a lora itself, I'm not sure. I have been using it on base models and finetuned NTK model (https://huggingface.co/bhenrym14/airoboros-33b-gpt4-1.4.1-NTK-16384-GPTQ)

Based on some tests at least, it should be like this.

ppls ntkv2

Linear has the value inverted on exllama (embedding compression)

@juanps90
Copy link
Author

juanps90 commented Jul 16, 2023

I am using Neko-Institute-of-Science_LLaMA-30B-4bit-128g with no context scaling training at all. As I understand, NTK RoPE Scaling does not require any finetuning at all, unlike SuperHOT.

Am I setting the NTK RoPE parameters correctly?

Update: Loading the LoRA with this model and switching from alpha_value to compress_pos_emb works a LOT better.

@EyeDeck
Copy link
Contributor

EyeDeck commented Jul 16, 2023

I think you need to call
config.calculate_rotary_embedding_base()
with the current way RoPE NTK scaling is implemented for the settings to properly take effect. Make sure config.alpha_value is already set when you do.

@juanps90
Copy link
Author

I think you need to call config.calculate_rotary_embedding_base() with the current way RoPE NTK scaling is implemented for the settings to properly take effect. Make sure config.alpha_value is already set when you do.

Thanks a lot! Works wonders with the stock 30B LLaMA model!

@juanps90
Copy link
Author

juanps90 commented Jul 16, 2023

I'm having a weird issue where it just skips or adds digits to numbers. For example, if there's a phone number in the prompt, the generated text may add another digit to it, or maybe skip one of the digits.

It's also displaying for example $1.6280 when it should display $1.628

Has anyone noticed this? The generated text looks solid but the numbers seem to be garbled.

Single or double digits seem fine

@juanps90 juanps90 reopened this Jul 16, 2023
@EyeDeck
Copy link
Contributor

EyeDeck commented Jul 17, 2023

I've seen that effect while running a linear-scaled LoRA (SuperHOT or Airoboros 8k or 16k) with the wrong compress_pos_emb value. If it's set to anything other than what it was trained on (typically 4 for 8k or 8 for 16k) it causes brain damage, which is usually fairly subtle except when numbers are involved, and then it almost always screws them up. Haven't seen that happen with NTK scaling though.

@juanps90
Copy link
Author

I've seen that effect while running a linear-scaled LoRA (SuperHOT or Airoboros 8k or 16k) with the wrong compress_pos_emb value. If it's set to anything other than what it was trained on (typically 4 for 8k or 8 for 16k) it causes brain damage, which is usually fairly subtle except when numbers are involved, and then it almost always screws them up. Haven't seen that happen with NTK scaling though.

Thank you. I am using Neko LLaMA 30B with a LoRA trained on it. Using only alpha_value and no compress_pos_emb.

The results with NTK appear to be much better than PI, though it's having issues with numbers. Will try a different model and check the code once I'm back home.

@juanps90
Copy link
Author

juanps90 commented Jul 20, 2023

Well, LLaMA v2 13B GPTQ from The-Bloke goes NUTS after I do:

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

config.alpha_value = 2
config.compress_pos_emb = 1
config.max_seq_len = 8192
config.calculate_rotary_embedding_base()

If alpha_value = 1 and max_seq_len = 4096 (model's native length), the outputs are perfect with the LoRA applied.

@EyeDeck
Copy link
Contributor

EyeDeck commented Jul 20, 2023

NTKv1 alpha=2 won't get you 2x context, try like alpha=2.6. I just picked that number arbitrarily and tested it to work, there's almost definitely a more optimal value >2 and <2.6, but you'd have to trial-and-error it.

NTK-by-parts (which the Transformers devs had proposed to short-hand to NTKv2, but that might mean dynamic NTK-by-parts now, which is what Transformers ultimately implemented) is supposed to correct this, so a scaling value of 2 = 2x, and 4 = 4x and so on; anecdotally I tried implementing it (the non-dynamic version) in ExLlama and while the code ran, and ppl tested a little lower, actual generation was definitely a little off. Weird token repetition and stuff that I never managed to debug.
Turboderp is working on it though, I'm much more confident in his ability than mine. #174

@juanps90
Copy link
Author

juanps90 commented Jul 20, 2023

I understand that alpha=2 should still allow for 4.5k or 5k token length (which it was failing to do), right? Also, I wonder what the relationship between alpha_value and max_seq_len is? Can you just do max_seq_len=8192 with alpha_value=2.6 or alpha_value=2.7 or any other number just like that?

I was under the impression (probably getting confused with cpe) that alpha_value * native_max_seq_len = max_seq_len (even if it would go off the rails with less tokens that max_seq_len), but it seems from your message that it will work even if a value of 2.6 is provided and this will work, just with a specific maximum context length that's less than max_seq_len ?

@EyeDeck
Copy link
Contributor

EyeDeck commented Jul 20, 2023

I was under the impression (probably getting confused with cpe) that alpha_value * native_max_seq_len = max_seq_len

See the chart in #158 (comment)
compress_pos_emb = Linear, except it's inverted (1/n), and alpha_value = NTK
and probably multiply everything by 2 for LLaMA v2.
Also I'm pretty sure that chart was comparing the same SuperHOT 8k (or might've been 13B 16k?) finetune for all the "Linear" lines, to some regular LLaMA v1 2k model with NTK scaling applied for the NTK lines.
Which isn't a fair comparison for two reasons: one, linear scaled finetunes only work properly with the same linear value they were finetuned on, despite what perplexity metrics say; and two, it's possible to finetune for NTK too.

Anyway I've only tested this specific quant (LLaMA 2 13B, no finetune, 4-bit 128g act-order), and with alpha_value=2, it seems to be good until ~6800, then starts devolving into incoherence and eventually noise. Not sure what's up if alpha_value=2 doesn't even get to 4.5k for you.

Also, not sure if you've got new LoRAs or not, but keep in mind that all LoRAs for LLaMA v1 aren't compatible with v2 since it's a complete retrain, at best they'll do nothing, at worst they'll cause brain damage.

Yes, you can run with whatever numbers you want, max_seq_len just controls some memory allocation stuff and when to start throwing out old context; and alpha_value controls how high you can go before the model goes nuts.
So e.g. alpha_value=2, max_seq_len=2048, is exactly the same as alpha_value=2, max_seq_len=4096 between 0-2048 tokens, after that max_seq_len=2048 starts truncating the oldest tokens, while max_seq_len=4096 waits for twice as long before truncating, and so on.

@juanps90
Copy link
Author

Thank you for your reply. Yes, the LoRA is freshly trained on v2 and works great up to 4k.

Have you tried using a LoRA with ntk and exllama?

@EyeDeck
Copy link
Contributor

EyeDeck commented Jul 21, 2023

Well, I just tried loading this LoRA (first LLaMA 2 LoRA I could find on HF) on top of this quant, using an alpha value of 6 and max_seq_len of 16384. Then I gave it the first 9 pages of The Hobbit (11899 tokens) and let it go for 6800 tokens, up to a total token count of 18699, where of course towards the end the first few thousand tokens had fallen off. Here's the output (with some linebreaks and a === inserted after generation, to separate the original text). I can't vouch for the quality of the text, but it's definitely coherent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants