[BUG] Compatibility mode (CUDA < 8.0) not working with draft models #238

matatonic · 2024-11-14T17:42:52Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.11

Describe the bug

compatibility mode is not enabled for draft models, so draft models can't be used on older GPU.

Reproduction steps

enable a draft model on an older gpu, works without draft, errors with draft.

Expected behavior

compatibility mode should be enabled for draft models also.

Logs

Loading draft modules ╸                                         2%  1/51 -:--:--
Traceback (most recent call last):
  File "/usr/src/app/main.py", line 168, in <module>
    entrypoint()
  File "/usr/src/app/main.py", line 164, in entrypoint
    asyncio.run(entrypoint_async())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/src/app/main.py", line 70, in entrypoint_async
    await model.load_model(
  File "/usr/src/app/common/model.py", line 101, in load_model
    async for _ in load_model_gen(model_path, **kwargs):
  File "/usr/src/app/common/model.py", line 80, in load_model_gen
    async for module, modules in load_status:
  File "/usr/src/app/backends/exllamav2/model.py", line 533, in load_gen
    async for value in iterate_in_threadpool(model_load_generator):
  File "/usr/src/app/common/concurrency.py", line 30, in iterate_in_threadpool
    yield await asyncio.to_thread(gen_next, generator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/common/concurrency.py", line 20, in gen_next
    return next(generator)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 57, in generator_context
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/usr/src/app/backends/exllamav2/model.py", line 598, in load_model_sync
    for value in self.draft_model.load_autosplit_gen(
  File "/usr/local/lib/python3.11/site-packages/exllamav2/model.py", line 623, in load_autosplit_gen
    hidden_state = module.forward(hidden_state, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/exllamav2/attn.py", line 1125, in forward
    attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/exllamav2/attn.py", line 929, in _attn_flash
    attn_output = flash_attn_func(
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 880, in flash_attn_func
    return FlashAttnFunc.apply(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 546, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
                                                                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 52, in _flash_attn_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
                                                                ^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Additional context

GPU's are 2080/2070Ti, the main model loads fine in compatibility mode without a draft model.

Acknowledgements

I have looked for similar issues before submitting this one.
I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

DocShotgun · 2024-11-16T05:16:44Z

Can you test if this fixes it before I mark the PR as ready to merge? #244

I don't have a machine with hardware incompatible with flash attention 2 to test this, and I was unable to reproduce your error by simply uninstalling the flash attention package on my setup - torch was just throwing a warning.

EDIT: Was able to load up a colab instance with T4 and reproduce the issue/verify that the PR fixes this.

matatonic · 2024-11-16T15:30:17Z

Will do, I can test now

matatonic · 2024-11-16T16:01:02Z

Confirmed, #244 fixes the issue - Thanks @DocShotgun !

bdashore3 · 2024-11-16T20:22:03Z

Thanks for testing, will merge when possible.

matatonic added the bug Something isn't working label Nov 14, 2024

DocShotgun mentioned this issue Nov 16, 2024

Fix draft model non-FA2 fallback #244

Merged

bdashore3 closed this as completed in #244 Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Compatibility mode (CUDA < 8.0) not working with draft models #238

[BUG] Compatibility mode (CUDA < 8.0) not working with draft models #238

matatonic commented Nov 14, 2024

DocShotgun commented Nov 16, 2024 •

edited

Loading

matatonic commented Nov 16, 2024 •

edited

Loading

matatonic commented Nov 16, 2024

bdashore3 commented Nov 16, 2024

[BUG] Compatibility mode (CUDA < 8.0) not working with draft models #238

[BUG] Compatibility mode (CUDA < 8.0) not working with draft models #238

Comments

matatonic commented Nov 14, 2024

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

DocShotgun commented Nov 16, 2024 • edited Loading

matatonic commented Nov 16, 2024 • edited Loading

matatonic commented Nov 16, 2024

bdashore3 commented Nov 16, 2024

DocShotgun commented Nov 16, 2024 •

edited

Loading

matatonic commented Nov 16, 2024 •

edited

Loading