Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Compatibility mode (CUDA < 8.0) not working with draft models #238

Closed
4 tasks done
matatonic opened this issue Nov 14, 2024 · 4 comments · Fixed by #244
Closed
4 tasks done

[BUG] Compatibility mode (CUDA < 8.0) not working with draft models #238

matatonic opened this issue Nov 14, 2024 · 4 comments · Fixed by #244
Labels
bug Something isn't working

Comments

@matatonic
Copy link

OS

Linux

GPU Library

CUDA 12.x

Python version

3.11

Describe the bug

compatibility mode is not enabled for draft models, so draft models can't be used on older GPU.

Reproduction steps

enable a draft model on an older gpu, works without draft, errors with draft.

Expected behavior

compatibility mode should be enabled for draft models also.

Logs

Loading draft modules ╸                                         2%  1/51 -:--:--
Traceback (most recent call last):
  File "/usr/src/app/main.py", line 168, in <module>
    entrypoint()
  File "/usr/src/app/main.py", line 164, in entrypoint
    asyncio.run(entrypoint_async())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/src/app/main.py", line 70, in entrypoint_async
    await model.load_model(
  File "/usr/src/app/common/model.py", line 101, in load_model
    async for _ in load_model_gen(model_path, **kwargs):
  File "/usr/src/app/common/model.py", line 80, in load_model_gen
    async for module, modules in load_status:
  File "/usr/src/app/backends/exllamav2/model.py", line 533, in load_gen
    async for value in iterate_in_threadpool(model_load_generator):
  File "/usr/src/app/common/concurrency.py", line 30, in iterate_in_threadpool
    yield await asyncio.to_thread(gen_next, generator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/common/concurrency.py", line 20, in gen_next
    return next(generator)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 57, in generator_context
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/usr/src/app/backends/exllamav2/model.py", line 598, in load_model_sync
    for value in self.draft_model.load_autosplit_gen(
  File "/usr/local/lib/python3.11/site-packages/exllamav2/model.py", line 623, in load_autosplit_gen
    hidden_state = module.forward(hidden_state, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/exllamav2/attn.py", line 1125, in forward
    attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/exllamav2/attn.py", line 929, in _attn_flash
    attn_output = flash_attn_func(
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 880, in flash_attn_func
    return FlashAttnFunc.apply(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 546, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
                                                                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 52, in _flash_attn_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
                                                                ^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Additional context

GPU's are 2080/2070Ti, the main model loads fine in compatibility mode without a draft model.

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@matatonic matatonic added the bug Something isn't working label Nov 14, 2024
@DocShotgun
Copy link
Member

DocShotgun commented Nov 16, 2024

Can you test if this fixes it before I mark the PR as ready to merge? #244

I don't have a machine with hardware incompatible with flash attention 2 to test this, and I was unable to reproduce your error by simply uninstalling the flash attention package on my setup - torch was just throwing a warning.

EDIT: Was able to load up a colab instance with T4 and reproduce the issue/verify that the PR fixes this.

@matatonic
Copy link
Author

matatonic commented Nov 16, 2024

Will do, I can test now

@matatonic
Copy link
Author

Confirmed, #244 fixes the issue - Thanks @DocShotgun !

@bdashore3
Copy link
Member

Thanks for testing, will merge when possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants