Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama #729

Merged
merged 26 commits into from
Mar 13, 2024

Conversation

psych0v0yager
Copy link
Contributor

Refactored the exl2 function in exllamav2.py.

The new version offers the following benefits:

  1. auto split support. You no longer need to split a large model over 2 GPUs manually, exllama will do it for you
  2. 8 bit cache support. Supports the 8 bit cache, can squeeze more context into the same GPU
  3. Additional exllamav2 improvements. Supports low_mem, fasttensors.
  4. No longer need to pass in num_experts, it is optional.
  5. Future support for 4 bit cache. Whenever turbo updates the pip package, uncomment the 4 bit lines for 4 bit support.
  6. Refactored the function parameters. Changed the model_kwargs dictionary to individual parameters. Combined with documentation this makes it easier for new users to understand what options they can select.

Future effort.

  1. Will look into replacing the Huggingface Tokenizer with the ExllamaV2 Tokenizer. Currently I am unsure what benefits it will provide, but it is worth a shot.

@rlouf
Copy link
Member

rlouf commented Mar 6, 2024

Great! Is this ready for review?

@psych0v0yager
Copy link
Contributor Author

psych0v0yager commented Mar 7, 2024

Updates:

Added LoRA support. LoRAs can now be hot-swapped dynamically as needed.

Here is an example how to use the LoRA feature

from outlines import models, generate

model = models.exl2(model_path = "/path/to/mistral_openorca", max_seq_len=8192, device = "cuda", gpu_split = "auto", verbose=True)

generator = generate.text(model)
answer = generator("Почему трава зеленая?", max_tokens=100)
print(answer)

model.update_lora("/path/to/russian_openorca")

generator = generate.text(model)
answer = generator("Почему трава зеленая?", max_tokens=100)
print(answer)

model.update_lora(None)

generator = generate.text(model)
answer = generator("Почему трава зеленая?", max_tokens=100)
print(answer)

This is a demonstration showing the new loading/unloading capabilities.

The following models/adapters were used in this demo

Model:
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/d0d05321894845b388ce6ea85321b2e3a59aaf5f
(must use safetensors)

Adapter:
https://huggingface.co/IlyaGusev/saiga_mistral_7b_lora

@psych0v0yager psych0v0yager changed the title Refactored exl2 method to add in more features supported by exllama Refactored exl2 method to add LoRA, 8bit cache, and other featuressupported by exllama Mar 7, 2024
@psych0v0yager psych0v0yager changed the title Refactored exl2 method to add LoRA, 8bit cache, and other featuressupported by exllama Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama Mar 7, 2024
@rlouf
Copy link
Member

rlouf commented Mar 7, 2024

This is really awesome! Let me know when I can review

@psych0v0yager
Copy link
Contributor Author

Yeah the inputs were messed up. I put the try/except input block inside update LoRA. It may be a little slower to start, but the subsequent ones still load in 0.0s.

The latest update will fix the error

@psych0v0yager
Copy link
Contributor Author

Please feel free to review this branch. It is ready

outlines/models/exllamav2.py Outdated Show resolved Hide resolved
outlines/models/exllamav2.py Outdated Show resolved Hide resolved
outlines/models/exllamav2.py Show resolved Hide resolved
outlines/models/exllamav2.py Show resolved Hide resolved
outlines/models/exllamav2.py Outdated Show resolved Hide resolved
outlines/models/exllamav2.py Outdated Show resolved Hide resolved
outlines/models/exllamav2.py Show resolved Hide resolved
@psych0v0yager
Copy link
Contributor Author

I can make all these changes and push asap

@psych0v0yager
Copy link
Contributor Author

I pushed the changes and updated my branch. There are some issues with my implementation and the recent regex changes pushed last week.

A week ago this code was able to run without errors

from outlines import models, generate

model = models.exl2(model_path = "./miqu_GPTQ", max_seq_len=25000, device = "cuda", gpu_split = "auto", verbose=True)

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

However now I receive the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], [line 7]
      [1] prompt = """You are a sentiment-labelling assistant.
      [2] Is the following review positive or negative?
      [3] 
      [4] Review: This restaurant is just awesome!
      [5] """
----> [7] generator = outlines.generate.choice(model, ["Positive", "Negative"])
      [8] answer = generator(prompt)

File [...]/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    [905] if not args:
    [906]     raise TypeError(f'{funcname} requires at least '
    [907]                     '1 positional argument')
--> [909] return dispatch(args[0].__class__)(*args, **kw)

File [...]/outlines/generate/choice.py:17, in choice(model, choices, sampler)
     [11] @singledispatch
     [12] def choice(
     [13]     model, choices: List[str], sampler: Sampler = multinomial()
     [14] ) -> SequenceGenerator:
     [15]     regex_str = r"(" + r"|".join(choices) + r")"
---> [17]     generator = regex(model, regex_str, sampler)
     [18]     generator.format_sequence = lambda x: x
     [20]     return generator

File [...]/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    [905] if not args:
    [906]     raise TypeError(f'{funcname} requires at least '
    [907]                     '1 positional argument')
--> [909] return dispatch(args[0].__class__)(*args, **kw)

File [...]/outlines/generate/regex.py:35, in regex(model, regex_str, sampler)
     [14] @singledispatch
     [15] def regex(model, regex_str: str, sampler: Sampler = multinomial()):
     [16]     """Generate structured text in the language of a regular expression.
     [17] 
     [18]     Parameters
   (...)
     [33] 
     [34]     """
---> [35]     fsm = RegexGuide(regex_str, model.tokenizer)
     [37]     device = model.device
     [38]     generator = SequenceGenerator(fsm, model, sampler, device)

File [...]/outlines/fsm/guide.py:132, in RegexGuide.__init__(self, regex_string, tokenizer)
    [126]         raise ValueError(
    [127]             "The vocabulary does not allow us to build a sequence that matches the input regex"
    [128]         )
    [130]     return states_to_token_maps, empty_token_ids, regex_fsm.finals
--> [132] (
    [133]     self.states_to_token_maps,
    [134]     self.empty_token_ids,
    [135]     fsm_finals,
    [136] ) = create_states_mapping(
    [137]     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    [138] )
    [139] self.vocabulary = list(tokenizer.vocabulary.values())
    [140] self.eos_token_id = tokenizer.eos_token_id

ValueError: not enough values to unpack (expected 3, got 2)

@rlouf
Copy link
Member

rlouf commented Mar 12, 2024

Can you clear the cache using outlines.caching.clear_cache() and try again?

@rlouf rlouf mentioned this pull request Mar 12, 2024
@rlouf
Copy link
Member

rlouf commented Mar 12, 2024

I don't have this problem locally, so it must be the cache. I still need to try the lora hotswapping functionality, will take a look tomorrow and hopefully merge this.

@rlouf rlouf merged commit 03c71f7 into dottxt-ai:main Mar 13, 2024
5 checks passed
@rlouf
Copy link
Member

rlouf commented Mar 13, 2024

Great work, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants