qwen coder32b run on colab t4 #682

werruww · 2024-11-23T19:06:58Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

xxxxxxxxxxx

Model

turboderp/Mistral-7B-instruct-exl2

Describe the bug

Warning: Flash Attention is installed but unsupported GPUs were detected

Reproduction steps

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Expected behavior

not run

Logs

!cd exllamav2 && python examples/chat.py -m ../my_model -mode llama -pt -ncf -ngram
[4]
14m
!cd exllamav2 && python examples/chat.py -m ../my_model -mode llama -pt -ncf -ngram
Loading exllamav2_ext extension (JIT)...
Building C++/CUDA extension ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:13:45 0:00:00

Warning: Flash Attention is installed but unsupported GPUs were detected.

-- Model: ../my_model
-- Options: []
-- Loading tokenizer...
-- Loading model...
-- Loading model...
-- Prompt format: llama
-- System prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

User: welcome

Traceback (most recent call last):
File "/content/exllamav2/examples/chat.py", line 310, in
generator.begin_stream_ex(active_context, settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 363, in begin_stream_ex
self._gen_begin_reuse(input_ids, gen_settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 731, in _gen_begin_reuse
self._gen_begin(in_tokens, gen_settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 692, in _gen_begin
self.model.forward(self.sequence_ids[:, :-1],
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/content/exllamav2/exllamav2/model.py", line 898, in forward
r = self.forward_chunk(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/content/exllamav2/exllamav2/model.py", line 1004, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
File "/content/exllamav2/exllamav2/attn.py", line 1125, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
File "/content/exllamav2/exllamav2/attn.py", line 929, in _attn_flash
attn_output = flash_attn_func(
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 1163, in flash_attn_func
return FlashAttnFunc.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 810, in forward
out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_forward(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call
return self._op(*args, **(kwargs or {}))
File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 324, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 91, in _flash_attn_forward
out, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

werruww · 2024-11-23T19:17:34Z

Problem solved when flash-attention erase

werruww · 2024-11-23T19:25:18Z

!mkdir my_model2
!huggingface-cli download bartowski/Qwen2.5-Coder-32B-Instruct-exl2 --revision 2_2 --local-dir my_model2

!cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram

Model: ../my_model2
-- Options: []
-- Loading tokenizer...
-- Loading model...
-- Loading model...
Traceback (most recent call last):
File "/content/exllamav2/examples/chat.py", line 174, in
cache = cache_type(model, lazy = not model.loaded)
File "/content/exllamav2/exllamav2/cache.py", line 256, in init
self.create_state_tensors(copy_from, lazy)
File "/content/exllamav2/exllamav2/cache.py", line 92, in create_state_tensors
p_value_states = torch.zeros(self.shape_wv, dtype = self.dtype, device = device).contiguous()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 23.06 MiB is free. Process 166535 has 14.72 GiB memory in use. Of the allocated memory 14.58 GiB is allocated by PyTorch, and 11.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

werruww · 2024-11-23T19:25:47Z

not run on colab t4 with
bartowski/Qwen2.5-Coder-32B-Instruct-exl2

werruww · 2024-11-23T20:04:06Z

bartowski/Qwen2.5-Coder-32B-Instruct-exl2
The problem was solved when adding -cq4

!cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram -cq4

werruww · 2024-11-23T20:05:26Z

werruww · 2024-11-23T20:06:55Z

bartowski/Qwen2.5-Coder-32B-Instruct-exl2
https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-exl2

use 2.2 bits per weight
https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-exl2/tree/2_2

werruww · 2024-11-23T20:08:16Z

usage: chat.py [-h] [-dm DRAFT_MODEL_DIR] [-nds] [-dn DRAFT_N_TOKENS] [-modes]
[-mode {raw,llama,llama3,codellama,chatml,tinyllama,zephyr,deepseek,solar,openchat,nous,gemma,cohere,phi3,granite}]
[-un USERNAME] [-bn BOTNAME] [-sp SYSTEM_PROMPT] [-temp TEMPERATURE]
[-smooth SMOOTHING_FACTOR] [-dyntemp DYNAMIC_TEMPERATURE] [-topk TOP_K]
[-topp TOP_P] [-topa TOP_A] [-skew SKEW] [-typical TYPICAL]
[-repp REPETITION_PENALTY] [-freqpen FREQUENCY_PENALTY] [-prespen PRESENCE_PENALTY]
[-xtcp XTC_PROBABILITY] [-xtct XTC_THRESHOLD] [-drym DRY_MULTIPLIER]
[-drya DRY_ALLOWED_LENGTH] [-dryb DRY_BASE] [-dryr DRY_RANGE]
[-maxr MAX_RESPONSE_TOKENS] [-resc RESPONSE_CHUNK] [-ncf] [-c8] [-cq4] [-cq6]
[-cq8] [-ngram] [-pt] [-amnesia] [-m MODEL_DIR] [-gs GPU_SPLIT] [-tp] [-l LENGTH]
[-rs ROPE_SCALE] [-ra ROPE_ALPHA] [-ry ROPE_YARN] [-nfa] [-nxf] [-nsdpa] [-ng]
[-lm] [-ept EXPERTS_PER_TOKEN] [-lq4] [-fst] [-ic] [-chunk CHUNK_SIZE]

werruww · 2024-11-23T20:12:33Z

run qwen coder32b on colab t4

with-c8

!cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram -c8

werruww · 2024-11-23T20:12:46Z

DocShotgun · 2024-11-25T02:49:38Z

The T4 is too old of a GPU to use Flash Attention 2.

werruww added the bug Something isn't working label Nov 23, 2024

werruww changed the title ~~not run on colab t4~~ qwen coder32b run on colab t4 Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen coder32b run on colab t4 #682

qwen coder32b run on colab t4 #682

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

DocShotgun commented Nov 25, 2024

qwen coder32b run on colab t4 #682

qwen coder32b run on colab t4 #682

Comments

werruww commented Nov 23, 2024

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Warning: Flash Attention is installed but unsupported GPUs were detected

Reproduction steps

Expected behavior

Logs

Warning: Flash Attention is installed but unsupported GPUs were detected.

Additional context

Acknowledgements

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

werruww commented Nov 23, 2024

DocShotgun commented Nov 25, 2024