Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird issue with context length #220

Open
zzzacwork opened this issue Aug 3, 2023 · 6 comments
Open

Weird issue with context length #220

zzzacwork opened this issue Aug 3, 2023 · 6 comments

Comments

@zzzacwork
Copy link

First of all, thanks a lot for this great project!

I got a weird issue when generating with llama 2 on 4096 context using generator.generate_simple,

  File "/codebase/research/exllama/model.py", line 556, in forward
    cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (3313) + length (808) exceeds 

As I understand the code, it already limits the number of new tokens to under the context limit. Is there any settings that I might need to change?

@turboderp
Copy link
Owner

What is the sequence length set to in the model config? Maybe something weird is happening if you haven't changed it from the default (2048), and it tries to generate a negative number of tokens.

@zzzacwork
Copy link
Author

thanks for the reply

{
    "architectures": [
        "LlamaForCausalLM"
    ],
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "silu",
    "hidden_size": 8192,
    "initializer_range": 0.02,
    "intermediate_size": 28672,
    "max_position_embeddings": 4096,
    "max_length": 4096,
    "model_type": "llama",
    "num_attention_heads": 64,
    "num_hidden_layers": 80,
    "num_key_value_heads": 8,
    "pad_token_id": 0,
    "pretraining_tp": 1,
    "rms_norm_eps": 1e-05,
    "rope_scaling": null,
    "tie_word_embeddings": false,
    "torch_dtype": "float16",
    "transformers_version": "4.32.0.dev0",
    "use_cache": true,
    "vocab_size": 32000
}

here is the model config file, I got the model from Llama-2-70B-chat-gptq

@turboderp
Copy link
Owner

turboderp commented Aug 5, 2023

Is there more of this error message?

  File "/codebase/research/exllama/model.py", line 556, in forward
    cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (3313) + length (808) exceeds 

It looks like it's been cut off.

Also, the line number is weird. Has something else been modified in model.py, because ExLlamaAttention.forward ends on line 502?

@w013nad
Copy link

w013nad commented Aug 11, 2023

I got a similar error. It seems to come from trying to put too many tokens into the model. I was putting 5k words into the model.

  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\example_flask.py", line 48, in inferContextP
    outputs = generator.generate_simple(prompt, max_new_tokens=16000)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 316, in generate_simple
    self.gen_begin(ids, mask = mask)
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 186, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora, input_mask = mask)
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 1053, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 536, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 440, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

I'm using the_bloke/vicuna-13B-v1.5-16K-GPTQ which is supposed to be a 16k context model so it should be able to handle it. At any rate, this is the relevant portions of the config.json.

    "max_sequence_length": 16384,
    "max_position_embeddings": 4096,

What I found that worked was changing the parameters on lines 82-87 in model.py

        self.max_seq_len = 16384  # Reduce to save memory. Can also be increased, ideally while also using compress_pos_emn and a compatible model/LoRA
        self.max_input_len = 4096  # Maximum length of input IDs in a single forward pass. Sequences longer than this will be processed in multiple steps
        self.max_attention_size = 2048**2  # Sequences will be processed in chunks to keep the size of the attention weights matrix <= this
        self.compress_pos_emb = 4.0  # Increase to compress positional embeddings applied to sequence

Previously, these were 2048, 2048, 4096, and 1.0 respectively. This worked and seems to give reasonable results but I'm not sure if it's the correct way to go about it.

@Rajmehta123
Copy link

Rajmehta123 commented Sep 14, 2023

@w013nad Where do you define those changes? In the source code or generator model settings?

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.05
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5
generator.settings.max_seq_len = 16000
# Produce a simple generation

output = generator.generate_simple(prompt_template, max_new_tokens = 500)

I am using the same model but getting the following error

RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

@turboderp
Copy link
Owner

@w013nad You wouldn't need to hard-code new values into the config class. You can just override the values after creating the config.

Also, it looks like that config file is incorrect. "max_sequence_length" and "max_position_embeddings" should mean the same thing, or at least I don't know how to interpret those values if they're different.

The max_input_len argument means specifically the longest sequence to allow during a forward pass. Longer sequences will be chunked into portions of this length to reduce VRAM usage during inference, and to make the VRAM requirement predictable which is sort of required when splitting the model across multiple devices. But max_attention_size imposes an additional restriction on the chunk length. In short, setting max_input_len > sqrt(max_attention_size) just wastes a bit of VRAM.

@Rajmehta123 The max_seq_len parameter is in the ExLlamaConfig object, not the generator settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants