Exllamav2 Integration #1010

isamu-isozaki · 2024-06-29T03:09:14Z

This fixes #1009
Also fixes #807

The tests I did were:

For loading:

from outlines.integrations.exllamav2 import RegexFilter, TextFilter, JSONFilter, ChoiceFilter
import json
import torch
from exllamav2.generator.filters import ExLlamaV2PrefixFilter
from pydantic import BaseModel
from typing import Literal
from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Cache_8bit,
    ExLlamaV2Cache_Q4,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler, ExLlamaV2DynamicJob
from transformers import AutoTokenizer
import uuid

repo_id = "../Phi-3-mini-128k-instruct-exl2"
paged = False
model_dir = repo_id
total_context = 8192
max_context = 1024 
max_batch_size = 4 if paged else 1
max_chunk_size = 1024
max_new_tokens = 1024
healing = True
draft_model = None
draft_cache = None
use_ngram_draft = None
use_ngram = None

config = ExLlamaV2Config(model_dir)
config.max_input_len = max_chunk_size
config.max_attention_size = max_chunk_size ** 2

config.max_seq_len = max_context
model = ExLlamaV2(config)

cache = ExLlamaV2Cache_Q4(
    model,
    max_seq_len = total_context,
    lazy = True
)
tokenizer = ExLlamaV2Tokenizer(config)
hf_tokenizer_kwargs = {}
hf_tokenizer_kwargs.setdefault("padding_side", "left")
hf_tokenizer = AutoTokenizer.from_pretrained(model_dir, **hf_tokenizer_kwargs)
model.load_autosplit(cache, progress = True)
generator = ExLlamaV2DynamicGenerator(
    model = model,
    cache = cache,
    draft_model = draft_model,
    draft_cache = draft_cache,
    tokenizer = tokenizer,
    max_batch_size = max_batch_size,
    use_ngram_draft = use_ngram,
    max_chunk_size = max_chunk_size,
    paged = paged,
)

Choices test:

filters = [
    ChoiceFilter(["bob", "fred"], hf_tokenizer)
]
context_ids = torch.empty((1, 0), dtype = torch.long)


instruction = "Who is better bob or fred?"
print()
print("Assistant:", end = "")

instruction_ids = tokenizer.encode(f"[INST] {instruction} [/INST]", add_bos = True)
context_ids = torch.cat([context_ids, instruction_ids], dim = -1)

generator.enqueue(
    ExLlamaV2DynamicJob(
        input_ids = context_ids,
        max_new_tokens = 1024,
        stop_conditions = [],
        filters=filters
    )
)

eos = False
while not eos:
    results = generator.iterate()
    for result in results:
        if result["stage"] == "streaming":
            eos = result["eos"]
            if "text" in result:
                print(result["text"], end="")
                sys.stdout.flush()
            if "token_ids" in result:
                context_ids = torch.cat([context_ids, result["token_ids"]], dim = -1)

print()

Returns

Assistant:bob

Json test

class JSONResponse(BaseModel):
    response: str
    confidence: Literal["low", "medium", "high"]
    is_subjective: Literal["no", "yes", "possibly"]
filters = [
    JSONFilter(JSONResponse, hf_tokenizer)
]
context_ids = torch.empty((1, 0), dtype = torch.long)


instruction = f"Give a sample response in the format of {JSONResponse.schema()} on a movie review of love actually"
print()
print("Assistant: ", end = "")

instruction_ids = tokenizer.encode(f"[INST] {instruction} [/INST]", add_bos = True)
context_ids = torch.cat([context_ids, instruction_ids], dim = -1)

generator.enqueue(
    ExLlamaV2DynamicJob(
        input_ids = context_ids,
        max_new_tokens = 1024,
        stop_conditions = [tokenizer.eos_token_id],
        filters=filters
    )
)

eos = False
while not eos:
    results = generator.iterate()
    for result in results:
        if result["stage"] == "streaming":
            eos = result["eos"]
            if "text" in result:
                print(result["text"], end="")
                sys.stdout.flush()
            if "token_ids" in result:
                context_ids = torch.cat([context_ids, result["token_ids"]], dim = -1)

print()

Returns

Assistant: {"response": "Love Actually is a charming and heartwarming romantic comedy that delivers a delightful experience. The performances by the lead actors, especially Drew Barrymore and Gael García Bernal, are genuinely commendable. The film beautifully blends humor with heart-tugging moments, making it an ideal watch for those in search of a feel-good cinematic experience. Despite some predictable plot trends, the overall impact of the film remains largely positive. Rating: 7/10", "confidence": "medium", "is_subjective": "no"}

isamu-isozaki · 2024-06-29T03:13:53Z

Some questions I had for maintainers were

Should we do the prefix logic here? I noticed that in some exllamav2 filters in their repo the prefix is ignored but for one they are used.
Do we want to return the stop tokens? It requires us to check through all the allowed tokens to see which one gives a final state. This may be a bit slower

…exllamav2_filter

lapp0

Sorry for the late reply and request for refactor.

We've been moving towards using SequenceGeneratorAdapter and outlines.processors in outlines.generate. Currently the only local outlines.model which doesn't have a SequenceGeneratorAdapter based implementation is exllamav2.

Would you be able to refactor this to use SequenceGeneratorAdapter instead?

This would involve

Best starting point: Adding an ExLlamaV2 fixture to tests/generate/test_generate.py which will automatically test all generation methods (structured, batch, stream, etc) against the model here
Adding ExLlamaV2 to the *_unified dispatcher https://github.com/outlines-dev/outlines/blob/main/outlines/generate/regex.py#L42-L53
Ensuring the passed OutlinesLogitsProcessor is applied when exllamav2 generator.generate(prompt) is called

isamu-isozaki · 2024-07-19T23:20:46Z

@lapp0 make sense! Let me try doing this tomorrow

lapp0 · 2024-07-20T11:37:35Z

Thanks so much, please let me know if you have any questions!

…exllamav2_filter

isamu-isozaki · 2024-07-21T04:25:36Z

@lapp0 sry for delay! Two questions
Background:
The current exllamav2 model in outlines(with ExllamaV2 class) as can be seen here, doesn't support filters(which is exllamav2's logitsprocessor). The filters are mainly used in exllamav2's custom generators like ExLlamaV2DynamicGenerator, ExLlamaV2DynamicGeneratorAsync etc
So my questions are

Do we want to use the logits_processor in SequenceGeneratorAdapter to be converted to an exllamav2 filter like in their library(like logits_processor in llamacpp)? In this case this might involve changing the logic here to use one of the generators. Another option is using ExLlamaV2Sampler but we will be redoing the generator logic in this case.
This depends on the previous question but in this case, do you have a recommended generator? I have mainly used the ExLlamaV2DynamicGenerator which is mainly used for handling multiple asynchronous requests and responses which is not necessarily what I think outlines is going for ex (one request, one response at a time). But it seems like the most well-supported generator in exllamav2.

Sry for the delayed response and let me know if I'm going in the right direction!

lapp0 · 2024-07-21T15:16:01Z

Great questions!

Converting it to a filter is a bit hacky IMO, but may be the simplest solution and doesn't require an upstream change.

Alternatively we could apply logits processing directly. The way exllamav2s library is structured makes this a tricky. ExLlamaV2Sampler.sample() is a staticmethod, and gen_settings: ExLlamaV2Sampler.Settings is a generate(...) argument, however the sampler itself is not. I think the only clean way to handle this is an upstream PR:

Option 1: update ExLlamaV2Sampler.Settings to accept a logits_processor argument, and update def sample() to apply the settings.logits_processor if it exists.
Option 2: update ExLlamaV2DynamicGenerator.generate() to accept a sampler argument, defaulting to ExLlamaV2Sampler, allowing us to inject our own sampler class.

The first option makes more sense to me, it is generator-class agnostic.

This depends on the previous question but in this case, do you have a recommended generator? I have mainly used the ExLlamaV2DynamicGenerator which is mainly used for handling multiple asynchronous requests and responses which is not necessarily what I think outlines is going for ex (one request, one response at a time). But it seems like the most well-supported generator in exllamav2.

Tbh, I'm not sure how well outlines.models works with asyncronous structured generation. It is a reasonable use case though, and necessary for #655

isamu-isozaki · 2024-07-22T05:13:54Z

@lapp0 Sounds good! I think I'll go with option 1. For this, I think the steps needed are

Make a new OutlinesLogitsProcessor inherited class like structured but do not return mask and only return the next available tokens. I might even override the call since conversion to pytorch might not be necessary(similar to filters). Happy to talk about this further. The main issue with this is it changes the logic of outlinepreprocessors to be similar to filters but the alternative is passing like torch.ones as logits and using torch.where on the mask
To models/exllamav2 add in class ExllamaV2SamplerOutlines(ExLlamaV2Sampler) with optional logits processor
See if I can convert the current exllamav2 base model forward logic etc to ExllamaV2DynamicGenerator
Add to the unified dispatcher
Test code
Let me know if it looks good. I'll try finishing this within a week or two

lapp0 · 2024-07-22T21:12:50Z

Rather than implementing a new logits processor, I'm awaiting correspondence with the ExLlamaV2 maintainer, turboderp, regarding whether a logits_processor argument would be acceptable within their sampler.

isamu-isozaki · 2024-07-22T21:39:32Z

@lapp0 interesting! The main reason I was thinking of a new logits processor is because we do some redundant steps in terms of exllamav2's code base I thought. In that for them, they first

Get the passed tokens(the next allowed tokens) and then apply that filter using cuda code likeext_c.logit_filter_exclusive(logit_filter, [sorted(list(pass_tokens))])
Then finally the logits are computed in cuda code

while in our case, we start with the assumption of the logits getting computed then construct mask etc.

So I thought some of the steps here overlap with our current logits processor. But yeah very much happy to get advice here since this is just making the exllamav2 filter. And also happy to hear what turboderp thinks.

lapp0 · 2024-07-23T01:38:19Z

So I thought some of the steps here overlap with our current logits processor. But yeah very much happy to get advice here since this is just making the exllamav2 filter. And also happy to hear what turboderp thinks.

Yes, they will have multiple methods of filtering, but given Outlines singular logits processor implementation, which is tested against all inference engines, it's likely better to follow the same pattern with ExLlamaV2. This will ensure bug fixes, optimizations, enhancement, and new features present in one integration are available to all integrations!

I spoke with turboderp on their discord server, he is open to having a logits_processor argument in ExLlamaV2Sampler.Settings.

Here's the steps I think we should take, let me know what you think:

1. Update outlines.generate.* generators so they use the default dispatcher for ExLlamaV2 (simply delete the ExLlamaV2 dispatcher in each outlines.generate module) This will ensure the default method of SequenceGeneratorAdapter and outlines.processors is used.
1. Implement a turboderp/exllamav2 fork with a logits_processor argument in ExLlamaV2Sampler.Settings which is applied in ExLlamaV2Sampler.sample() (let me know if you'd like to take this over, or if you'd like me to take a shot at it)
1. Implement a new model outlines.models.exllamav2 which is compatible with the fork
1. Test it against outlines.models.exllamav2 by adding an exllamav2 fixture to tests/generate/test_generate.py and running pytest -s tests/generate/test_generate.py -k exllamav2

Let me know if you think this is the right path.

Thanks so much for your great work on this PR. The users in the ExLlamaV2 discord were excited to hear about this PR!

isamu-isozaki · 2024-07-23T04:06:07Z

@lapp0 wow, didn't know exllamav2 had a discord server! And makes perfect sense.
If you can do ii that'll be awesome since I was thinking of this and I couldn't think of a clean way to do it atm.
For iii. sounds good. I'll try converting it to the dynamic generator

lapp0 · 2024-07-27T20:24:21Z

@isamu-isozaki can you please take a look at this changeset and the provided example json_schema_outlines.py?

lapp0/exllamav2#1

I believe it should provide a sufficient basis for implementing outlines.models.exllamav2.

Let me know if you see anything that should be changed in my implementation. If you have any questions, please do not hesitate! Good luck!

Edit: Also please add "Fixes #807" to the PR description.

isamu-isozaki · 2024-07-29T06:39:00Z

@lapp0 sounds good. And sorry got a bit side tracked by some work. I'll try get to this at least by the weekend. Sorry for delay!

isamu-isozaki · 2024-08-06T05:10:28Z

Sorry for the delay, I finally got the exllamav2 fork built and I was able to run the current pr's code with below which worked!

import sys
sys.path.append("../outlines-dev")
import outlines

from enum import Enum
from pydantic import BaseModel, constr

model = outlines.models.exl2(
    model_path="turboderp/TinyLlama-1B-32k-exl2",
    cache_q4=True,
    paged=False
)

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)
print(answer)

prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "
answer = outlines.generate.format(model, int)(prompt, max_tokens=1)
print(answer)

generator = outlines.generate.format(model, float)
answer = generator(prompt, max_tokens=10)
print(answer)

generator = outlines.generate.text(model)
unstructured = generator(prompt, max_tokens=30)

generator = outlines.generate.regex(
    model,
    r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
)
structured = generator(prompt, max_tokens=30)

print(unstructured)
# What is the IP address of the Google DNS servers?
#
# Passive DNS servers are at DNS servers that are private.
# In other words, both IP servers are private. The database
# does not contain Chelsea Manning

print(structured)

class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"


class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"


class Character(BaseModel):
    name: constr(max_length=10)
    age: int
    armor: Armor
    weapon: Weapon
    strength: int


# Construct structured sequence generator
generator = outlines.generate.json(model, Character)

# Draw a sample
seed = 789001

character = generator("Give me a character description", seed=seed)

print(repr(character))
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)

character = generator("Give me an interesting character description", seed=seed)

print(repr(character))
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125)

isamu-isozaki · 2024-08-06T05:11:38Z

The current main issue is that I can't seem to run the tests due to some error with the pyairports. @lapp0 do you have some advice on how to fix this?

 pytest -s tests/generate/test_generate.py -k exllamav2
======================== test session starts =========================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2
collected 0 items / 1 error

=============================== ERRORS ===============================
__________ ERROR collecting tests/generate/test_generate.py __________
tests/generate/test_generate.py:6: in <module>
    import outlines.generate as generate
outlines/__init__.py:6: in <module>
    import outlines.types
outlines/types/__init__.py:1: in <module>
    from . import airports, countries
outlines/types/airports.py:4: in <module>
    from pyairports.airports import AIRPORT_LIST
/home/isamu/miniconda3/lib/python3.10/site-packages/pyairports/airports.py:1: in <module>
    from pkg_resources import resource_string
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3663: in <module>
    def _initialize_master_working_set():
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3646: in _call_aside
    f(*args, **kwargs)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3687: in _initialize_master_working_set
    tuple(dist.activate(replace=False) for dist in working_set)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3687: in <genexpr>
    tuple(dist.activate(replace=False) for dist in working_set)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3144: in activate
    declare_namespace(pkg)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:2542: in declare_namespace
    warnings.warn(msg, DeprecationWarning, stacklevel=2)
E   DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
E   Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
====================== short test summary info =======================
ERROR tests/generate/test_generate.py - DeprecationWarning: Deprecated call to `pkg_resources.declare_nam...
!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!
========================= 1 error in 17.91s ==========================

lapp0 · 2024-08-12T19:12:26Z

@isamu-isozaki sorry for the delayed response.

pyairports is an annoying library which has caused a lot of issues for me as well. And the only thing we use the library for is loading the 3 letter airport code list from https://github.com/ozeliger/pyairports/blob/f611ee5a5a82b4e98b22641bb99693d862c802e4/pyairports/data/airport_list.json

A quick and easy hack is to remove the import and run tests again.

remichu-ai · 2024-08-22T06:21:33Z

Hi, just want to pop by and see how it is going. Will this feature be released soon? If there is some dev branch i can try it as well.

lapp0 · 2024-08-22T18:18:41Z

@lapp0 Got it and thanks! I think I'm only missing coverage which I'll try making tests for once I get time

Great, please let me know when you're ready for review!

Hi, just want to pop by and see how it is going. Will this feature be released soon? If there is some dev branch i can try it as well.

You might be able to get it working with the installation commands below. Please report back with any issues or feedback, it will help with this PR!

pip install git+https://github.com/isamu-isozaki/outlines@exllamav2_filter
pip install git+https://github.com/lapp0/exllamav2@sampler-logits-processor

…exllamav2_filter

…utlines into exllamav2_filter

isamu-isozaki · 2024-08-30T03:35:06Z

@remichu-ai Hi! If you had an issue building exllamav2 like me you can just install outlines with my initial commit to this pr and you can use the code examples and it should work.
However, I did hear some issues with the speed of inference if you have a bad CPU in this case. I'm not sure how much more performant the current latest commit is.
You can def use this branch to test it out since the main thing left is just writing tests etc and not much for functionality

isamu-isozaki · 2024-08-30T06:10:21Z

@lapp0 hi! Sorry for more qs. I did write some tests to attempt to fill up the exllamav2.py. The coverage is 100% locally for exllamav2.py. But it seems like if the tests are skipped they don't count towards coverage(which is the case for this pipeline). Do you happen to know a simple way to fix this by any chance?
Other than this I think I'm ready for review!

lapp0

My example script works with the code. Minor change requests, great work!

outlines/models/exllamav2.py

isamu-isozaki · 2024-09-19T03:36:44Z

@lapp0 Thanks for review! Let me check it out tomorrow

isamu-isozaki · 2024-09-20T17:46:29Z

@lapp0 Thanks for the review. I did all the changes and all my tests passed locally(including pre-commit)

(base) outlines-dev$ pytest -s tests/generate/test_integration_exllamav2.py --cov=outlines.models
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2, cov-5.0.0
collected 19 items

Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:31 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
.

---------- coverage: platform linux, python 3.10.12-final-0 ----------
Name                                     Stmts   Miss Branch BrPart  Cover   Missing
------------------------------------------------------------------------------------
outlines/models/__init__.py                  9      0      0      0   100%
outlines/models/exllamav2.py               140      0     62      0   100%
outlines/models/llamacpp.py                154    110     60      0    21%   27-53, 56-57, 62-73, 76-84, 87-89, 92-94, 98, 107, 142, 146, 160-239, 277-293, 332-355, 358-362, 386-407
outlines/models/mlxlm.py                    81     72     30      0     8%   25-27, 38-41, 70-122, 147-196, 230-247
outlines/models/openai.py                  176    134     58      0    19%   97-105, 138-155, 158, 183-251, 255, 258, 261, 292-313, 318-322, 349-364, 381-388, 394-415, 420, 429-452, 461-484
outlines/models/tokenizer.py                12      0      0      0   100%
outlines/models/transformers.py            168    140     52      0    13%   28-56, 68-82, 87-90, 93-94, 97-106, 109-116, 119, 122-123, 126, 137-138, 163-184, 192-195, 225-253, 268-297, 309-340, 349-368, 371-381, 415-435, 444-452
outlines/models/transformers_vision.py      38     30     14      0    15%   12-13, 46-63, 73, 109-138
outlines/models/vllm.py                     78     66     42      0    10%   24-27, 30-42, 87-149, 159, 164-169, 184-188, 208-226
------------------------------------------------------------------------------------
TOTAL                                      856    552    318      0    31%


================================== 18 passed, 1 skipped in 72.95s (0:01:12) ==================================
(base) outlines-dev$ pytest -s tests/generate/test_generate.py -k exllamav2
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2, cov-5.0.0
collected 320 items / 288 deselected / 32 selected

Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:37 0:00:00
Loading tokenizer...
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 45.03it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 25/25 [00:00<00:00, 95.85it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 21/21 [00:00<00:00, 95.23it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 96.69it/s]
Compiling FSM index for all state transitions: 100%|█████████████████████████| 25/25 [00:00<00:00, 139.23it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 21/21 [00:00<00:00, 95.51it/s]
Compiling FSM index for all state transitions: 100%|████████████████████████████| 6/6 [00:00<00:00, 73.53it/s]
Compiling FSM index for all state transitions: 100%|████████████████████████████| 8/8 [00:00<00:00, 92.24it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 92.73it/s]
...................

========================== 31 passed, 1 skipped, 288 deselected in 85.01s (0:01:25) ==========================

outlines-dev> pre-commit run --all-files        
check for merge conflicts................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
isort....................................................................Passed
pyupgrade................................................................Passed
flake8...................................................................Passed
black....................................................................Passed
mypy.....................................................................Passed

lapp0 · 2024-09-23T18:53:43Z

Great job @isamu-isozaki !

I've opened the EXL2 PR for logits processors

turboderp/exllamav2#634

isamu-isozaki · 2024-09-23T20:25:07Z

@lapp0 awesome!

lapp0 · 2024-10-01T21:40:40Z

@isamu-isozaki I'm not sure whether the ExLlamaV2 PR will be merged soon, it's been a week without comment. To get this out the door could you please Outlines ExLlamaV2 documentation to make the following clear:

ExLlamaV2 doesn't have logits processor support yet.
There is a third party fork which supports logits processors and is compatible with outlines
- The install command is pip install git+https://github.com/lapp0/exllamav2@sampler-logits-processor

Could you also let me know what build issues you experienced? I didn't run into any but I'd like to ensure the install-from-git command doesn't result in additional confusion.

We can revert the documentation to reference the main ExLlamaV2 branch once the PR is merged.

isamu-isozaki · 2024-10-05T03:23:28Z

@lapp0 Sorry was away for a bit and sounds good! And also thanks for making that pr! One question, I was following the discussion in exllamav2 but even after this will exllamav2's constrained gen be the slowest(because of double creation of the logits mask+it relies on this in the cpu)?

lapp0 · 2024-10-05T04:01:07Z

@isamu-isozaki No problem, thanks for your great work!

I expect constrained generation with models.exllamav2 to have similar overhead to other models.

The mask is currently applied on CPU, but it can be applied on GPU, prior to CPU offloading. I've profiled this mode and it increased token throughput in "normal" mode to 145 tok/s. This isn't pushed yet though.

I'm not sure what you mean by double creation of logits mask. The mask should only be applied once per token generated.

Please let me know if you have any other questions.

isamu-isozaki · 2024-10-05T15:36:22Z

@lapp0 Ah, I misread and that sounds good! To do mask creation in gpu, then can this be done fully on the outlines side(after your pr)?

lapp0 · 2024-10-05T16:17:24Z

Two changes are necessary:

Update Outlines so Guides optionally keep cached masks on GPU
Update my ExLlamaV2 branch so it applies logits processors before its offloaded to cpu (offloading to CPU occurs immediately after forward pass currently: https://github.com/lapp0/exllamav2/blob/10a8842b251685b4fbd3d263070b08ef07203903/exllamav2/generator/base.py#L265)

Edit: Here are the benchmarks for overhead for a single token generated (batch size 4) for #1192. I'm not sure we need any changes for now, as CPU and GPU operations are both efficient. Let me know if you observe something different though

| After [8aa0b0d] | Benchmark (Parameter) |
|------------------+-----------------------------------------------------|
| 94.7_0.2μs | time_structured_generation('torch', 'Z*') |
| 149_0.4μs | time_structured_generation('torch_cuda', 'Z*') |
| 386_1μs | time_structured_generation('torch', '[^Z]*') |
| 229_1μs | time_structured_generation('torch_cuda', '[^Z]*') |

isamu-isozaki · 2024-10-07T02:03:37Z

@lapp0 Thanks! Yeah, I just have a person hosting an outlines server with this code and he noticed that his production server with very weak CPUs was very slow which was our observation haha. I think what's best may depend on the type of hardware in production

lapp0 · 2024-10-09T01:29:14Z

@isamu-isozaki could you please open a discussion with further details about their performance regression? I'll take a look and see what the root cause is.

isamu-isozaki added 2 commits June 28, 2024 22:59

Exllamav2_filter

8191d21

Fix comment

42978c4

isamu-isozaki changed the title ~~Exllamav2 filter~~ Exllamav2 Integration Jun 29, 2024

Fixed precommit issues

d271fff

isamu-isozaki mentioned this pull request Jul 2, 2024

Dynamic generation with outlines blockentropy/ml-client#14

Merged

isamu-isozaki added 2 commits July 10, 2024 12:48

Removed text

1bdcd4e

Merge branch 'main' of https://github.com/outlines-dev/outlines into …

ecf1d3c

…exllamav2_filter

lapp0 suggested changes Jul 19, 2024

View reviewed changes

rlouf mentioned this pull request Jul 20, 2024

Clean Up Dead outlines.integrations Code #1054

Closed

Merge branch 'main' of https://github.com/outlines-dev/outlines into …

1a193a5

…exllamav2_filter

lapp0 mentioned this pull request Jul 21, 2024

Implement prompt/generation alignment #531

Open

rlouf added this to the 0.1 milestone Jul 22, 2024

isamu-isozaki added 2 commits August 1, 2024 15:12

Basic draft done

37d2471

Passed local test

a68ddd7

lapp0 mentioned this pull request Aug 12, 2024

Remove pyairports #1093

Closed

isamu-isozaki added 7 commits August 29, 2024 21:48

Attempt fixing coverage

785d7de

Merge branch 'main' of https://github.com/outlines-dev/outlines into …

91c3e7a

…exllamav2_filter

Attempt fix coverage

faadf5b

Merge branch 'exllamav2_filter' of https://github.com/isamu-isozaki/o…

2a909af

…utlines into exllamav2_filter

Remove flash-attn requirement

7ca151c

Fixed fixture tests

2c241ff

Removed lora

a289b5a

Passed coverage

c3681a8

lapp0 reviewed Sep 18, 2024

View reviewed changes

isamu-isozaki added 2 commits September 19, 2024 20:29

Added back transformers install

e6b3af6

Fixed per review

5508c92

Made coverage 100%

b7e92a1

lapp0 mentioned this pull request Sep 26, 2024

Update logits array in-place #859

Open

lapp0 mentioned this pull request Oct 5, 2024

Minor Changes on Top of Isamus EXL2 PR (Docs, Enable FSM/CFG, Fix Tokenizer) #1191

Merged

rlouf closed this in #1191 Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllamav2 Integration #1010

Exllamav2 Integration #1010

isamu-isozaki commented Jun 29, 2024 •

edited

Loading

isamu-isozaki commented Jun 29, 2024

lapp0 left a comment •

edited

Loading

isamu-isozaki commented Jul 19, 2024

lapp0 commented Jul 20, 2024

isamu-isozaki commented Jul 21, 2024 •

edited

Loading

lapp0 commented Jul 21, 2024

isamu-isozaki commented Jul 22, 2024 •

edited

Loading

lapp0 commented Jul 22, 2024

isamu-isozaki commented Jul 22, 2024 •

edited

Loading

lapp0 commented Jul 23, 2024 •

edited

Loading

isamu-isozaki commented Jul 23, 2024

lapp0 commented Jul 27, 2024 •

edited

Loading

isamu-isozaki commented Jul 29, 2024

isamu-isozaki commented Aug 6, 2024

isamu-isozaki commented Aug 6, 2024

lapp0 commented Aug 12, 2024

remichu-ai commented Aug 22, 2024

lapp0 commented Aug 22, 2024

isamu-isozaki commented Aug 30, 2024

isamu-isozaki commented Aug 30, 2024 •

edited

Loading

lapp0 left a comment

isamu-isozaki commented Sep 19, 2024

isamu-isozaki commented Sep 20, 2024 •

edited

Loading

lapp0 commented Sep 23, 2024

isamu-isozaki commented Sep 23, 2024

lapp0 commented Oct 1, 2024 •

edited

Loading

isamu-isozaki commented Oct 5, 2024

lapp0 commented Oct 5, 2024

isamu-isozaki commented Oct 5, 2024

lapp0 commented Oct 5, 2024 •

edited

Loading

isamu-isozaki commented Oct 7, 2024

lapp0 commented Oct 9, 2024

Exllamav2 Integration #1010

Exllamav2 Integration #1010

Conversation

isamu-isozaki commented Jun 29, 2024 • edited Loading

isamu-isozaki commented Jun 29, 2024

lapp0 left a comment • edited Loading

Choose a reason for hiding this comment

isamu-isozaki commented Jul 19, 2024

lapp0 commented Jul 20, 2024

isamu-isozaki commented Jul 21, 2024 • edited Loading

lapp0 commented Jul 21, 2024

isamu-isozaki commented Jul 22, 2024 • edited Loading

lapp0 commented Jul 22, 2024

isamu-isozaki commented Jul 22, 2024 • edited Loading

lapp0 commented Jul 23, 2024 • edited Loading

isamu-isozaki commented Jul 23, 2024

lapp0 commented Jul 27, 2024 • edited Loading

isamu-isozaki commented Jul 29, 2024

isamu-isozaki commented Aug 6, 2024

isamu-isozaki commented Aug 6, 2024

lapp0 commented Aug 12, 2024

remichu-ai commented Aug 22, 2024

lapp0 commented Aug 22, 2024

isamu-isozaki commented Aug 30, 2024

isamu-isozaki commented Aug 30, 2024 • edited Loading

lapp0 left a comment

Choose a reason for hiding this comment

isamu-isozaki commented Sep 19, 2024

isamu-isozaki commented Sep 20, 2024 • edited Loading

lapp0 commented Sep 23, 2024

isamu-isozaki commented Sep 23, 2024

lapp0 commented Oct 1, 2024 • edited Loading

isamu-isozaki commented Oct 5, 2024

lapp0 commented Oct 5, 2024

isamu-isozaki commented Oct 5, 2024

lapp0 commented Oct 5, 2024 • edited Loading

isamu-isozaki commented Oct 7, 2024

lapp0 commented Oct 9, 2024

isamu-isozaki commented Jun 29, 2024 •

edited

Loading

lapp0 left a comment •

edited

Loading

isamu-isozaki commented Jul 21, 2024 •

edited

Loading

isamu-isozaki commented Jul 22, 2024 •

edited

Loading

isamu-isozaki commented Jul 22, 2024 •

edited

Loading

lapp0 commented Jul 23, 2024 •

edited

Loading

lapp0 commented Jul 27, 2024 •

edited

Loading

isamu-isozaki commented Aug 30, 2024 •

edited

Loading

isamu-isozaki commented Sep 20, 2024 •

edited

Loading

lapp0 commented Oct 1, 2024 •

edited

Loading

lapp0 commented Oct 5, 2024 •

edited

Loading