Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exllamav2 Integration #1010

Closed
wants to merge 33 commits into from

Conversation

isamu-isozaki
Copy link
Contributor

@isamu-isozaki isamu-isozaki commented Jun 29, 2024

This fixes #1009
Also fixes #807

The tests I did were:

For loading:

from outlines.integrations.exllamav2 import RegexFilter, TextFilter, JSONFilter, ChoiceFilter
import json
import torch
from exllamav2.generator.filters import ExLlamaV2PrefixFilter
from pydantic import BaseModel
from typing import Literal
from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Cache_8bit,
    ExLlamaV2Cache_Q4,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler, ExLlamaV2DynamicJob
from transformers import AutoTokenizer
import uuid

repo_id = "../Phi-3-mini-128k-instruct-exl2"
paged = False
model_dir = repo_id
total_context = 8192
max_context = 1024 
max_batch_size = 4 if paged else 1
max_chunk_size = 1024
max_new_tokens = 1024
healing = True
draft_model = None
draft_cache = None
use_ngram_draft = None
use_ngram = None

config = ExLlamaV2Config(model_dir)
config.max_input_len = max_chunk_size
config.max_attention_size = max_chunk_size ** 2

config.max_seq_len = max_context
model = ExLlamaV2(config)

cache = ExLlamaV2Cache_Q4(
    model,
    max_seq_len = total_context,
    lazy = True
)
tokenizer = ExLlamaV2Tokenizer(config)
hf_tokenizer_kwargs = {}
hf_tokenizer_kwargs.setdefault("padding_side", "left")
hf_tokenizer = AutoTokenizer.from_pretrained(model_dir, **hf_tokenizer_kwargs)
model.load_autosplit(cache, progress = True)
generator = ExLlamaV2DynamicGenerator(
    model = model,
    cache = cache,
    draft_model = draft_model,
    draft_cache = draft_cache,
    tokenizer = tokenizer,
    max_batch_size = max_batch_size,
    use_ngram_draft = use_ngram,
    max_chunk_size = max_chunk_size,
    paged = paged,
)

Choices test:

filters = [
    ChoiceFilter(["bob", "fred"], hf_tokenizer)
]
context_ids = torch.empty((1, 0), dtype = torch.long)


instruction = "Who is better bob or fred?"
print()
print("Assistant:", end = "")

instruction_ids = tokenizer.encode(f"[INST] {instruction} [/INST]", add_bos = True)
context_ids = torch.cat([context_ids, instruction_ids], dim = -1)

generator.enqueue(
    ExLlamaV2DynamicJob(
        input_ids = context_ids,
        max_new_tokens = 1024,
        stop_conditions = [],
        filters=filters
    )
)

eos = False
while not eos:
    results = generator.iterate()
    for result in results:
        if result["stage"] == "streaming":
            eos = result["eos"]
            if "text" in result:
                print(result["text"], end="")
                sys.stdout.flush()
            if "token_ids" in result:
                context_ids = torch.cat([context_ids, result["token_ids"]], dim = -1)

print()

Returns

Assistant:bob

Json test

class JSONResponse(BaseModel):
    response: str
    confidence: Literal["low", "medium", "high"]
    is_subjective: Literal["no", "yes", "possibly"]
filters = [
    JSONFilter(JSONResponse, hf_tokenizer)
]
context_ids = torch.empty((1, 0), dtype = torch.long)


instruction = f"Give a sample response in the format of {JSONResponse.schema()} on a movie review of love actually"
print()
print("Assistant: ", end = "")

instruction_ids = tokenizer.encode(f"[INST] {instruction} [/INST]", add_bos = True)
context_ids = torch.cat([context_ids, instruction_ids], dim = -1)

generator.enqueue(
    ExLlamaV2DynamicJob(
        input_ids = context_ids,
        max_new_tokens = 1024,
        stop_conditions = [tokenizer.eos_token_id],
        filters=filters
    )
)

eos = False
while not eos:
    results = generator.iterate()
    for result in results:
        if result["stage"] == "streaming":
            eos = result["eos"]
            if "text" in result:
                print(result["text"], end="")
                sys.stdout.flush()
            if "token_ids" in result:
                context_ids = torch.cat([context_ids, result["token_ids"]], dim = -1)

print()

Returns

Assistant: {"response": "Love Actually is a charming and heartwarming romantic comedy that delivers a delightful experience. The performances by the lead actors, especially Drew Barrymore and Gael García Bernal, are genuinely commendable. The film beautifully blends humor with heart-tugging moments, making it an ideal watch for those in search of a feel-good cinematic experience. Despite some predictable plot trends, the overall impact of the film remains largely positive. Rating: 7/10", "confidence": "medium", "is_subjective": "no"}

@isamu-isozaki isamu-isozaki changed the title Exllamav2 filter Exllamav2 Integration Jun 29, 2024
@isamu-isozaki
Copy link
Contributor Author

Some questions I had for maintainers were

  1. Should we do the prefix logic here? I noticed that in some exllamav2 filters in their repo the prefix is ignored but for one they are used.
  2. Do we want to return the stop tokens? It requires us to check through all the allowed tokens to see which one gives a final state. This may be a bit slower

Copy link
Contributor

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply and request for refactor.

We've been moving towards using SequenceGeneratorAdapter and outlines.processors in outlines.generate. Currently the only local outlines.model which doesn't have a SequenceGeneratorAdapter based implementation is exllamav2.

Would you be able to refactor this to use SequenceGeneratorAdapter instead?

This would involve

@isamu-isozaki
Copy link
Contributor Author

@lapp0 make sense! Let me try doing this tomorrow

@lapp0
Copy link
Contributor

lapp0 commented Jul 20, 2024

Thanks so much, please let me know if you have any questions!

@isamu-isozaki
Copy link
Contributor Author

isamu-isozaki commented Jul 21, 2024

@lapp0 sry for delay! Two questions
Background:
The current exllamav2 model in outlines(with ExllamaV2 class) as can be seen here, doesn't support filters(which is exllamav2's logitsprocessor). The filters are mainly used in exllamav2's custom generators like ExLlamaV2DynamicGenerator, ExLlamaV2DynamicGeneratorAsync etc
So my questions are

  1. Do we want to use the logits_processor in SequenceGeneratorAdapter to be converted to an exllamav2 filter like in their library(like logits_processor in llamacpp)? In this case this might involve changing the logic here to use one of the generators. Another option is using ExLlamaV2Sampler but we will be redoing the generator logic in this case.
  2. This depends on the previous question but in this case, do you have a recommended generator? I have mainly used the ExLlamaV2DynamicGenerator which is mainly used for handling multiple asynchronous requests and responses which is not necessarily what I think outlines is going for ex (one request, one response at a time). But it seems like the most well-supported generator in exllamav2.

Sry for the delayed response and let me know if I'm going in the right direction!

@lapp0
Copy link
Contributor

lapp0 commented Jul 21, 2024

Great questions!

Converting it to a filter is a bit hacky IMO, but may be the simplest solution and doesn't require an upstream change.

Alternatively we could apply logits processing directly. The way exllamav2s library is structured makes this a tricky. ExLlamaV2Sampler.sample() is a staticmethod, and gen_settings: ExLlamaV2Sampler.Settings is a generate(...) argument, however the sampler itself is not. I think the only clean way to handle this is an upstream PR:

  • Option 1: update ExLlamaV2Sampler.Settings to accept a logits_processor argument, and update def sample() to apply the settings.logits_processor if it exists.
  • Option 2: update ExLlamaV2DynamicGenerator.generate() to accept a sampler argument, defaulting to ExLlamaV2Sampler, allowing us to inject our own sampler class.

The first option makes more sense to me, it is generator-class agnostic.

This depends on the previous question but in this case, do you have a recommended generator? I have mainly used the ExLlamaV2DynamicGenerator which is mainly used for handling multiple asynchronous requests and responses which is not necessarily what I think outlines is going for ex (one request, one response at a time). But it seems like the most well-supported generator in exllamav2.

Tbh, I'm not sure how well outlines.models works with asyncronous structured generation. It is a reasonable use case though, and necessary for #655

@isamu-isozaki
Copy link
Contributor Author

isamu-isozaki commented Jul 22, 2024

@lapp0 Sounds good! I think I'll go with option 1. For this, I think the steps needed are

  • Make a new OutlinesLogitsProcessor inherited class like structured but do not return mask and only return the next available tokens. I might even override the call since conversion to pytorch might not be necessary(similar to filters). Happy to talk about this further. The main issue with this is it changes the logic of outlinepreprocessors to be similar to filters but the alternative is passing like torch.ones as logits and using torch.where on the mask
  • To models/exllamav2 add in class ExllamaV2SamplerOutlines(ExLlamaV2Sampler) with optional logits processor
  • See if I can convert the current exllamav2 base model forward logic etc to ExllamaV2DynamicGenerator
  • Add to the unified dispatcher
  • Test code
    Let me know if it looks good. I'll try finishing this within a week or two

@rlouf rlouf added this to the 0.1 milestone Jul 22, 2024
@lapp0
Copy link
Contributor

lapp0 commented Jul 22, 2024

Rather than implementing a new logits processor, I'm awaiting correspondence with the ExLlamaV2 maintainer, turboderp, regarding whether a logits_processor argument would be acceptable within their sampler.

@isamu-isozaki
Copy link
Contributor Author

isamu-isozaki commented Jul 22, 2024

@lapp0 interesting! The main reason I was thinking of a new logits processor is because we do some redundant steps in terms of exllamav2's code base I thought. In that for them, they first

  1. Get the passed tokens(the next allowed tokens) and then apply that filter using cuda code likeext_c.logit_filter_exclusive(logit_filter, [sorted(list(pass_tokens))])
  2. Then finally the logits are computed in cuda code

while in our case, we start with the assumption of the logits getting computed then construct mask etc.

So I thought some of the steps here overlap with our current logits processor. But yeah very much happy to get advice here since this is just making the exllamav2 filter. And also happy to hear what turboderp thinks.

@lapp0
Copy link
Contributor

lapp0 commented Jul 23, 2024

So I thought some of the steps here overlap with our current logits processor. But yeah very much happy to get advice here since this is just making the exllamav2 filter. And also happy to hear what turboderp thinks.

Yes, they will have multiple methods of filtering, but given Outlines singular logits processor implementation, which is tested against all inference engines, it's likely better to follow the same pattern with ExLlamaV2. This will ensure bug fixes, optimizations, enhancement, and new features present in one integration are available to all integrations!

I spoke with turboderp on their discord server, he is open to having a logits_processor argument in ExLlamaV2Sampler.Settings.

Here's the steps I think we should take, let me know what you think:

    1. Update outlines.generate.* generators so they use the default dispatcher for ExLlamaV2 (simply delete the ExLlamaV2 dispatcher in each outlines.generate module) This will ensure the default method of SequenceGeneratorAdapter and outlines.processors is used.
    1. Implement a turboderp/exllamav2 fork with a logits_processor argument in ExLlamaV2Sampler.Settings which is applied in ExLlamaV2Sampler.sample() (let me know if you'd like to take this over, or if you'd like me to take a shot at it)
    1. Implement a new model outlines.models.exllamav2 which is compatible with the fork
    1. Test it against outlines.models.exllamav2 by adding an exllamav2 fixture to tests/generate/test_generate.py and running pytest -s tests/generate/test_generate.py -k exllamav2

Let me know if you think this is the right path.

Thanks so much for your great work on this PR. The users in the ExLlamaV2 discord were excited to hear about this PR!

@isamu-isozaki
Copy link
Contributor Author

@lapp0 wow, didn't know exllamav2 had a discord server! And makes perfect sense.
If you can do ii that'll be awesome since I was thinking of this and I couldn't think of a clean way to do it atm.
For iii. sounds good. I'll try converting it to the dynamic generator

@lapp0
Copy link
Contributor

lapp0 commented Jul 27, 2024

@isamu-isozaki can you please take a look at this changeset and the provided example json_schema_outlines.py?

lapp0/exllamav2#1

I believe it should provide a sufficient basis for implementing outlines.models.exllamav2.

Let me know if you see anything that should be changed in my implementation. If you have any questions, please do not hesitate! Good luck!

Edit: Also please add "Fixes #807" to the PR description.

@isamu-isozaki
Copy link
Contributor Author

@lapp0 sounds good. And sorry got a bit side tracked by some work. I'll try get to this at least by the weekend. Sorry for delay!

@isamu-isozaki
Copy link
Contributor Author

Sorry for the delay, I finally got the exllamav2 fork built and I was able to run the current pr's code with below which worked!

import sys
sys.path.append("../outlines-dev")
import outlines

from enum import Enum
from pydantic import BaseModel, constr

model = outlines.models.exl2(
    model_path="turboderp/TinyLlama-1B-32k-exl2",
    cache_q4=True,
    paged=False
)

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)
print(answer)

prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "
answer = outlines.generate.format(model, int)(prompt, max_tokens=1)
print(answer)

generator = outlines.generate.format(model, float)
answer = generator(prompt, max_tokens=10)
print(answer)

generator = outlines.generate.text(model)
unstructured = generator(prompt, max_tokens=30)

generator = outlines.generate.regex(
    model,
    r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
)
structured = generator(prompt, max_tokens=30)

print(unstructured)
# What is the IP address of the Google DNS servers?
#
# Passive DNS servers are at DNS servers that are private.
# In other words, both IP servers are private. The database
# does not contain Chelsea Manning

print(structured)

class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"


class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"


class Character(BaseModel):
    name: constr(max_length=10)
    age: int
    armor: Armor
    weapon: Weapon
    strength: int


# Construct structured sequence generator
generator = outlines.generate.json(model, Character)

# Draw a sample
seed = 789001

character = generator("Give me a character description", seed=seed)

print(repr(character))
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)

character = generator("Give me an interesting character description", seed=seed)

print(repr(character))
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125)

@isamu-isozaki
Copy link
Contributor Author

The current main issue is that I can't seem to run the tests due to some error with the pyairports. @lapp0 do you have some advice on how to fix this?

 pytest -s tests/generate/test_generate.py -k exllamav2
======================== test session starts =========================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2
collected 0 items / 1 error

=============================== ERRORS ===============================
__________ ERROR collecting tests/generate/test_generate.py __________
tests/generate/test_generate.py:6: in <module>
    import outlines.generate as generate
outlines/__init__.py:6: in <module>
    import outlines.types
outlines/types/__init__.py:1: in <module>
    from . import airports, countries
outlines/types/airports.py:4: in <module>
    from pyairports.airports import AIRPORT_LIST
/home/isamu/miniconda3/lib/python3.10/site-packages/pyairports/airports.py:1: in <module>
    from pkg_resources import resource_string
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3663: in <module>
    def _initialize_master_working_set():
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3646: in _call_aside
    f(*args, **kwargs)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3687: in _initialize_master_working_set
    tuple(dist.activate(replace=False) for dist in working_set)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3687: in <genexpr>
    tuple(dist.activate(replace=False) for dist in working_set)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3144: in activate
    declare_namespace(pkg)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:2542: in declare_namespace
    warnings.warn(msg, DeprecationWarning, stacklevel=2)
E   DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
E   Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
====================== short test summary info =======================
ERROR tests/generate/test_generate.py - DeprecationWarning: Deprecated call to `pkg_resources.declare_nam...
!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!
========================= 1 error in 17.91s ==========================

@lapp0
Copy link
Contributor

lapp0 commented Aug 12, 2024

@isamu-isozaki sorry for the delayed response.

pyairports is an annoying library which has caused a lot of issues for me as well. And the only thing we use the library for is loading the 3 letter airport code list from https://github.com/ozeliger/pyairports/blob/f611ee5a5a82b4e98b22641bb99693d862c802e4/pyairports/data/airport_list.json

A quick and easy hack is to remove the import and run tests again.

@lapp0 lapp0 mentioned this pull request Aug 12, 2024
@remichu-ai
Copy link

Hi, just want to pop by and see how it is going. Will this feature be released soon? If there is some dev branch i can try it as well.

@lapp0
Copy link
Contributor

lapp0 commented Aug 22, 2024

@lapp0 Got it and thanks! I think I'm only missing coverage which I'll try making tests for once I get time

Great, please let me know when you're ready for review!

Hi, just want to pop by and see how it is going. Will this feature be released soon? If there is some dev branch i can try it as well.

You might be able to get it working with the installation commands below. Please report back with any issues or feedback, it will help with this PR!

pip install git+https://github.com/isamu-isozaki/outlines@exllamav2_filter
pip install git+https://github.com/lapp0/exllamav2@sampler-logits-processor

@isamu-isozaki
Copy link
Contributor Author

@remichu-ai Hi! If you had an issue building exllamav2 like me you can just install outlines with my initial commit to this pr and you can use the code examples and it should work.
However, I did hear some issues with the speed of inference if you have a bad CPU in this case. I'm not sure how much more performant the current latest commit is.
You can def use this branch to test it out since the main thing left is just writing tests etc and not much for functionality

@isamu-isozaki
Copy link
Contributor Author

isamu-isozaki commented Aug 30, 2024

@lapp0 hi! Sorry for more qs. I did write some tests to attempt to fill up the exllamav2.py. The coverage is 100% locally for exllamav2.py. But it seems like if the tests are skipped they don't count towards coverage(which is the case for this pipeline). Do you happen to know a simple way to fix this by any chance?
Other than this I think I'm ready for review!

Copy link
Contributor

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My example script works with the code. Minor change requests, great work!

outlines/models/exllamav2.py Show resolved Hide resolved
outlines/models/exllamav2.py Outdated Show resolved Hide resolved
outlines/models/exllamav2.py Show resolved Hide resolved
outlines/models/exllamav2.py Outdated Show resolved Hide resolved
outlines/models/exllamav2.py Outdated Show resolved Hide resolved
@isamu-isozaki
Copy link
Contributor Author

@lapp0 Thanks for review! Let me check it out tomorrow

@isamu-isozaki
Copy link
Contributor Author

isamu-isozaki commented Sep 20, 2024

@lapp0 Thanks for the review. I did all the changes and all my tests passed locally(including pre-commit)

(base) outlines-dev$ pytest -s tests/generate/test_integration_exllamav2.py --cov=outlines.models
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2, cov-5.0.0
collected 19 items

Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:31 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
.

---------- coverage: platform linux, python 3.10.12-final-0 ----------
Name                                     Stmts   Miss Branch BrPart  Cover   Missing
------------------------------------------------------------------------------------
outlines/models/__init__.py                  9      0      0      0   100%
outlines/models/exllamav2.py               140      0     62      0   100%
outlines/models/llamacpp.py                154    110     60      0    21%   27-53, 56-57, 62-73, 76-84, 87-89, 92-94, 98, 107, 142, 146, 160-239, 277-293, 332-355, 358-362, 386-407
outlines/models/mlxlm.py                    81     72     30      0     8%   25-27, 38-41, 70-122, 147-196, 230-247
outlines/models/openai.py                  176    134     58      0    19%   97-105, 138-155, 158, 183-251, 255, 258, 261, 292-313, 318-322, 349-364, 381-388, 394-415, 420, 429-452, 461-484
outlines/models/tokenizer.py                12      0      0      0   100%
outlines/models/transformers.py            168    140     52      0    13%   28-56, 68-82, 87-90, 93-94, 97-106, 109-116, 119, 122-123, 126, 137-138, 163-184, 192-195, 225-253, 268-297, 309-340, 349-368, 371-381, 415-435, 444-452
outlines/models/transformers_vision.py      38     30     14      0    15%   12-13, 46-63, 73, 109-138
outlines/models/vllm.py                     78     66     42      0    10%   24-27, 30-42, 87-149, 159, 164-169, 184-188, 208-226
------------------------------------------------------------------------------------
TOTAL                                      856    552    318      0    31%


================================== 18 passed, 1 skipped in 72.95s (0:01:12) ==================================
(base) outlines-dev$ pytest -s tests/generate/test_generate.py -k exllamav2
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2, cov-5.0.0
collected 320 items / 288 deselected / 32 selected

Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:37 0:00:00
Loading tokenizer...
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 45.03it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 25/25 [00:00<00:00, 95.85it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 21/21 [00:00<00:00, 95.23it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 96.69it/s]
Compiling FSM index for all state transitions: 100%|█████████████████████████| 25/25 [00:00<00:00, 139.23it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 21/21 [00:00<00:00, 95.51it/s]
Compiling FSM index for all state transitions: 100%|████████████████████████████| 6/6 [00:00<00:00, 73.53it/s]
Compiling FSM index for all state transitions: 100%|████████████████████████████| 8/8 [00:00<00:00, 92.24it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 92.73it/s]
...................

========================== 31 passed, 1 skipped, 288 deselected in 85.01s (0:01:25) ==========================
outlines-dev> pre-commit run --all-files        
check for merge conflicts................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
isort....................................................................Passed
pyupgrade................................................................Passed
flake8...................................................................Passed
black....................................................................Passed
mypy.....................................................................Passed

@lapp0
Copy link
Contributor

lapp0 commented Sep 23, 2024

Great job @isamu-isozaki !

I've opened the EXL2 PR for logits processors

turboderp/exllamav2#634

@isamu-isozaki
Copy link
Contributor Author

@lapp0 awesome!

@lapp0
Copy link
Contributor

lapp0 commented Oct 1, 2024

@isamu-isozaki I'm not sure whether the ExLlamaV2 PR will be merged soon, it's been a week without comment. To get this out the door could you please Outlines ExLlamaV2 documentation to make the following clear:

  • ExLlamaV2 doesn't have logits processor support yet.
  • There is a third party fork which supports logits processors and is compatible with outlines
    • The install command is pip install git+https://github.com/lapp0/exllamav2@sampler-logits-processor

Could you also let me know what build issues you experienced? I didn't run into any but I'd like to ensure the install-from-git command doesn't result in additional confusion.

We can revert the documentation to reference the main ExLlamaV2 branch once the PR is merged.

@isamu-isozaki
Copy link
Contributor Author

@lapp0 Sorry was away for a bit and sounds good! And also thanks for making that pr! One question, I was following the discussion in exllamav2 but even after this will exllamav2's constrained gen be the slowest(because of double creation of the logits mask+it relies on this in the cpu)?

@lapp0
Copy link
Contributor

lapp0 commented Oct 5, 2024

@isamu-isozaki No problem, thanks for your great work!

I expect constrained generation with models.exllamav2 to have similar overhead to other models.

The mask is currently applied on CPU, but it can be applied on GPU, prior to CPU offloading. I've profiled this mode and it increased token throughput in "normal" mode to 145 tok/s. This isn't pushed yet though.

I'm not sure what you mean by double creation of logits mask. The mask should only be applied once per token generated.

Please let me know if you have any other questions.

@rlouf rlouf closed this in #1191 Oct 5, 2024
@isamu-isozaki
Copy link
Contributor Author

@lapp0 Ah, I misread and that sounds good! To do mask creation in gpu, then can this be done fully on the outlines side(after your pr)?

@lapp0
Copy link
Contributor

lapp0 commented Oct 5, 2024

Two changes are necessary:

Edit: Here are the benchmarks for overhead for a single token generated (batch size 4) for #1192. I'm not sure we need any changes for now, as CPU and GPU operations are both efficient. Let me know if you observe something different though

| After [8aa0b0d] | Benchmark (Parameter) |
|------------------+-----------------------------------------------------|
| 94.7_0.2μs | time_structured_generation('torch', 'Z*') |
| 149_0.4μs | time_structured_generation('torch_cuda', 'Z*') |
| 386_1μs | time_structured_generation('torch', '[^Z]*') |
| 229_1μs | time_structured_generation('torch_cuda', '[^Z]*') |

@isamu-isozaki
Copy link
Contributor Author

@lapp0 Thanks! Yeah, I just have a person hosting an outlines server with this code and he noticed that his production server with very weak CPUs was very slow which was our observation haha. I think what's best may depend on the type of hardware in production

@lapp0
Copy link
Contributor

lapp0 commented Oct 9, 2024

@isamu-isozaki could you please open a discussion with further details about their performance regression? I'll take a look and see what the root cause is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exllamav2 integration Update the exllamav2 integration
4 participants