Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qwen coder32b run on colab t4 #682

Open
3 tasks done
werruww opened this issue Nov 23, 2024 · 10 comments
Open
3 tasks done

qwen coder32b run on colab t4 #682

werruww opened this issue Nov 23, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@werruww
Copy link

werruww commented Nov 23, 2024

OS

Linux

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

xxxxxxxxxxx

Model

turboderp/Mistral-7B-instruct-exl2

Describe the bug

Warning: Flash Attention is installed but unsupported GPUs were detected

Reproduction steps

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Expected behavior

not run

Logs

!cd exllamav2 && python examples/chat.py -m ../my_model -mode llama -pt -ncf -ngram
[4]
14m
!cd exllamav2 && python examples/chat.py -m ../my_model -mode llama -pt -ncf -ngram
Loading exllamav2_ext extension (JIT)...
Building C++/CUDA extension ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:13:45 0:00:00

Warning: Flash Attention is installed but unsupported GPUs were detected.

-- Model: ../my_model
-- Options: []
-- Loading tokenizer...
-- Loading model...
-- Loading model...
-- Prompt format: llama
-- System prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

User: welcome

Traceback (most recent call last):
File "/content/exllamav2/examples/chat.py", line 310, in
generator.begin_stream_ex(active_context, settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 363, in begin_stream_ex
self._gen_begin_reuse(input_ids, gen_settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 731, in _gen_begin_reuse
self._gen_begin(in_tokens, gen_settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 692, in _gen_begin
self.model.forward(self.sequence_ids[:, :-1],
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/content/exllamav2/exllamav2/model.py", line 898, in forward
r = self.forward_chunk(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/content/exllamav2/exllamav2/model.py", line 1004, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
File "/content/exllamav2/exllamav2/attn.py", line 1125, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
File "/content/exllamav2/exllamav2/attn.py", line 929, in _attn_flash
attn_output = flash_attn_func(
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 1163, in flash_attn_func
return FlashAttnFunc.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 810, in forward
out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_forward(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call
return self._op(*args, **(kwargs or {}))
File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 324, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 91, in _flash_attn_forward
out, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@werruww werruww added the bug Something isn't working label Nov 23, 2024
@werruww
Copy link
Author

werruww commented Nov 23, 2024

Problem solved when flash-attention erase

@werruww
Copy link
Author

werruww commented Nov 23, 2024

!mkdir my_model2
!huggingface-cli download bartowski/Qwen2.5-Coder-32B-Instruct-exl2 --revision 2_2 --local-dir my_model2

!cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram

  • Model: ../my_model2
    -- Options: []
    -- Loading tokenizer...
    -- Loading model...
    -- Loading model...
    Traceback (most recent call last):
    File "/content/exllamav2/examples/chat.py", line 174, in
    cache = cache_type(model, lazy = not model.loaded)
    File "/content/exllamav2/exllamav2/cache.py", line 256, in init
    self.create_state_tensors(copy_from, lazy)
    File "/content/exllamav2/exllamav2/cache.py", line 92, in create_state_tensors
    p_value_states = torch.zeros(self.shape_wv, dtype = self.dtype, device = device).contiguous()
    torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 23.06 MiB is free. Process 166535 has 14.72 GiB memory in use. Of the allocated memory 14.58 GiB is allocated by PyTorch, and 11.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@werruww
Copy link
Author

werruww commented Nov 23, 2024

not run on colab t4 with
bartowski/Qwen2.5-Coder-32B-Instruct-exl2

@werruww
Copy link
Author

werruww commented Nov 23, 2024

bartowski/Qwen2.5-Coder-32B-Instruct-exl2
The problem was solved when adding -cq4

!cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram -cq4

@werruww
Copy link
Author

werruww commented Nov 23, 2024

qwencoder32bex12

@werruww
Copy link
Author

werruww commented Nov 23, 2024

@werruww werruww changed the title not run on colab t4 qwen coder32b run on colab t4 Nov 23, 2024
@werruww
Copy link
Author

werruww commented Nov 23, 2024

usage: chat.py [-h] [-dm DRAFT_MODEL_DIR] [-nds] [-dn DRAFT_N_TOKENS] [-modes]
[-mode {raw,llama,llama3,codellama,chatml,tinyllama,zephyr,deepseek,solar,openchat,nous,gemma,cohere,phi3,granite}]
[-un USERNAME] [-bn BOTNAME] [-sp SYSTEM_PROMPT] [-temp TEMPERATURE]
[-smooth SMOOTHING_FACTOR] [-dyntemp DYNAMIC_TEMPERATURE] [-topk TOP_K]
[-topp TOP_P] [-topa TOP_A] [-skew SKEW] [-typical TYPICAL]
[-repp REPETITION_PENALTY] [-freqpen FREQUENCY_PENALTY] [-prespen PRESENCE_PENALTY]
[-xtcp XTC_PROBABILITY] [-xtct XTC_THRESHOLD] [-drym DRY_MULTIPLIER]
[-drya DRY_ALLOWED_LENGTH] [-dryb DRY_BASE] [-dryr DRY_RANGE]
[-maxr MAX_RESPONSE_TOKENS] [-resc RESPONSE_CHUNK] [-ncf] [-c8] [-cq4] [-cq6]
[-cq8] [-ngram] [-pt] [-amnesia] [-m MODEL_DIR] [-gs GPU_SPLIT] [-tp] [-l LENGTH]
[-rs ROPE_SCALE] [-ra ROPE_ALPHA] [-ry ROPE_YARN] [-nfa] [-nxf] [-nsdpa] [-ng]
[-lm] [-ept EXPERTS_PER_TOKEN] [-lq4] [-fst] [-ic] [-chunk CHUNK_SIZE]

@werruww
Copy link
Author

werruww commented Nov 23, 2024

run qwen coder32b on colab t4

with-c8

!cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram -c8

@werruww
Copy link
Author

werruww commented Nov 23, 2024

c8

@DocShotgun
Copy link
Contributor

The T4 is too old of a GPU to use Flash Attention 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants