-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased context length with NTK Rope Scaling #158
Comments
Above 2048 context you shouldn't have any issues up to ~3400 context with Static NTK RoPE Scaling, I use it like that on 65b at least. Now about a lora itself, I'm not sure. I have been using it on base models and finetuned NTK model (https://huggingface.co/bhenrym14/airoboros-33b-gpt4-1.4.1-NTK-16384-GPTQ) Based on some tests at least, it should be like this. Linear has the value inverted on exllama (embedding compression) |
I am using Neko-Institute-of-Science_LLaMA-30B-4bit-128g with no context scaling training at all. As I understand, NTK RoPE Scaling does not require any finetuning at all, unlike SuperHOT. Am I setting the NTK RoPE parameters correctly? Update: Loading the LoRA with this model and switching from alpha_value to compress_pos_emb works a LOT better. |
I think you need to call |
Thanks a lot! Works wonders with the stock 30B LLaMA model! |
I'm having a weird issue where it just skips or adds digits to numbers. For example, if there's a phone number in the prompt, the generated text may add another digit to it, or maybe skip one of the digits. It's also displaying for example $1.6280 when it should display $1.628 Has anyone noticed this? The generated text looks solid but the numbers seem to be garbled. Single or double digits seem fine |
I've seen that effect while running a linear-scaled LoRA (SuperHOT or Airoboros 8k or 16k) with the wrong compress_pos_emb value. If it's set to anything other than what it was trained on (typically 4 for 8k or 8 for 16k) it causes brain damage, which is usually fairly subtle except when numbers are involved, and then it almost always screws them up. Haven't seen that happen with NTK scaling though. |
Thank you. I am using Neko LLaMA 30B with a LoRA trained on it. Using only alpha_value and no compress_pos_emb. The results with NTK appear to be much better than PI, though it's having issues with numbers. Will try a different model and check the code once I'm back home. |
Well, LLaMA v2 13B GPTQ from The-Bloke goes NUTS after I do:
If alpha_value = 1 and max_seq_len = 4096 (model's native length), the outputs are perfect with the LoRA applied. |
NTKv1 alpha=2 won't get you 2x context, try like alpha=2.6. I just picked that number arbitrarily and tested it to work, there's almost definitely a more optimal value >2 and <2.6, but you'd have to trial-and-error it. NTK-by-parts (which the Transformers devs had proposed to short-hand to NTKv2, but that might mean dynamic NTK-by-parts now, which is what Transformers ultimately implemented) is supposed to correct this, so a scaling value of 2 = 2x, and 4 = 4x and so on; anecdotally I tried implementing it (the non-dynamic version) in ExLlama and while the code ran, and ppl tested a little lower, actual generation was definitely a little off. Weird token repetition and stuff that I never managed to debug. |
I understand that alpha=2 should still allow for 4.5k or 5k token length (which it was failing to do), right? Also, I wonder what the relationship between alpha_value and max_seq_len is? Can you just do max_seq_len=8192 with alpha_value=2.6 or alpha_value=2.7 or any other number just like that? I was under the impression (probably getting confused with cpe) that alpha_value * native_max_seq_len = max_seq_len (even if it would go off the rails with less tokens that max_seq_len), but it seems from your message that it will work even if a value of 2.6 is provided and this will work, just with a specific maximum context length that's less than max_seq_len ? |
See the chart in #158 (comment) Anyway I've only tested this specific quant (LLaMA 2 13B, no finetune, 4-bit 128g act-order), and with Also, not sure if you've got new LoRAs or not, but keep in mind that all LoRAs for LLaMA v1 aren't compatible with v2 since it's a complete retrain, at best they'll do nothing, at worst they'll cause brain damage. Yes, you can run with whatever numbers you want, |
Thank you for your reply. Yes, the LoRA is freshly trained on v2 and works great up to 4k. Have you tried using a LoRA with ntk and exllama? |
Well, I just tried loading this LoRA (first LLaMA 2 LoRA I could find on HF) on top of this quant, using an alpha value of 6 and |
I am having bad quality results with prompts longer than 2048 tokens with a LoRA trained with alpaca_lora_4bit.
These are the settings I am using:
I tried with higher values of alpha_value and max_seq_len, as well as temperature and similar values but it still fails. It doesn't fail with shorter sequences, so it seems to be an issue with the extended context. Using this configuration, short sequences work fine with the LoRA, but longer sequences just output garbage.
The text was updated successfully, but these errors were encountered: