You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
compatibility mode is not enabled for draft models, so draft models can't be used on older GPU.
Reproduction steps
enable a draft model on an older gpu, works without draft, errors with draft.
Expected behavior
compatibility mode should be enabled for draft models also.
Logs
Loading draft modules ╸ 2% 1/51 -:--:--
Traceback (most recent call last):
File "/usr/src/app/main.py", line 168, in <module>
entrypoint()
File "/usr/src/app/main.py", line 164, in entrypoint
asyncio.run(entrypoint_async())
File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/usr/src/app/main.py", line 70, in entrypoint_async
await model.load_model(
File "/usr/src/app/common/model.py", line 101, in load_model
async for _ in load_model_gen(model_path, **kwargs):
File "/usr/src/app/common/model.py", line 80, in load_model_gen
async for module, modules in load_status:
File "/usr/src/app/backends/exllamav2/model.py", line 533, in load_gen
async for value in iterate_in_threadpool(model_load_generator):
File "/usr/src/app/common/concurrency.py", line 30, in iterate_in_threadpool
yield await asyncio.to_thread(gen_next, generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/common/concurrency.py", line 20, in gen_next
return next(generator)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 57, in generator_context
response = gen.send(request)
^^^^^^^^^^^^^^^^^
File "/usr/src/app/backends/exllamav2/model.py", line 598, in load_model_sync
for value in self.draft_model.load_autosplit_gen(
File "/usr/local/lib/python3.11/site-packages/exllamav2/model.py", line 623, in load_autosplit_gen
hidden_state = module.forward(hidden_state, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/exllamav2/attn.py", line 1125, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/exllamav2/attn.py", line 929, in _attn_flash
attn_output = flash_attn_func(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 880, in flash_attn_func
return FlashAttnFunc.apply(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 546, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 52, in _flash_attn_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Additional context
GPU's are 2080/2070Ti, the main model loads fine in compatibility mode without a draft model.
Acknowledgements
I have looked for similar issues before submitting this one.
I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.
The text was updated successfully, but these errors were encountered:
Can you test if this fixes it before I mark the PR as ready to merge? #244
I don't have a machine with hardware incompatible with flash attention 2 to test this, and I was unable to reproduce your error by simply uninstalling the flash attention package on my setup - torch was just throwing a warning.
EDIT: Was able to load up a colab instance with T4 and reproduce the issue/verify that the PR fixes this.
OS
Linux
GPU Library
CUDA 12.x
Python version
3.11
Describe the bug
compatibility mode is not enabled for draft models, so draft models can't be used on older GPU.
Reproduction steps
enable a draft model on an older gpu, works without draft, errors with draft.
Expected behavior
compatibility mode should be enabled for draft models also.
Logs
Additional context
GPU's are 2080/2070Ti, the main model loads fine in compatibility mode without a draft model.
Acknowledgements
The text was updated successfully, but these errors were encountered: