-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exllamav2 Integration #1010
Exllamav2 Integration #1010
Conversation
Some questions I had for maintainers were
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply and request for refactor.
We've been moving towards using SequenceGeneratorAdapter
and outlines.processors
in outlines.generate
. Currently the only local outlines.model
which doesn't have a SequenceGeneratorAdapter
based implementation is exllamav2
.
Would you be able to refactor this to use SequenceGeneratorAdapter
instead?
This would involve
- Best starting point: Adding an
ExLlamaV2
fixture totests/generate/test_generate.py
which will automatically test all generation methods (structured, batch, stream, etc) against the model here - Adding
ExLlamaV2
to the*_unified
dispatcher https://github.com/outlines-dev/outlines/blob/main/outlines/generate/regex.py#L42-L53 - Ensuring the passed
OutlinesLogitsProcessor
is applied when exllamav2generator.generate(prompt)
is called
@lapp0 make sense! Let me try doing this tomorrow |
Thanks so much, please let me know if you have any questions! |
…exllamav2_filter
@lapp0 sry for delay! Two questions
Sry for the delayed response and let me know if I'm going in the right direction! |
Great questions! Converting it to a filter is a bit hacky IMO, but may be the simplest solution and doesn't require an upstream change. Alternatively we could apply logits processing directly. The way
The first option makes more sense to me, it is generator-class agnostic.
Tbh, I'm not sure how well |
@lapp0 Sounds good! I think I'll go with option 1. For this, I think the steps needed are
|
Rather than implementing a new logits processor, I'm awaiting correspondence with the ExLlamaV2 maintainer, turboderp, regarding whether a |
@lapp0 interesting! The main reason I was thinking of a new logits processor is because we do some redundant steps in terms of exllamav2's code base I thought. In that for them, they first
while in our case, we start with the assumption of the logits getting computed then construct mask etc. So I thought some of the steps here overlap with our current logits processor. But yeah very much happy to get advice here since this is just making the exllamav2 filter. And also happy to hear what turboderp thinks. |
Yes, they will have multiple methods of filtering, but given Outlines singular logits processor implementation, which is tested against all inference engines, it's likely better to follow the same pattern with ExLlamaV2. This will ensure bug fixes, optimizations, enhancement, and new features present in one integration are available to all integrations! I spoke with turboderp on their discord server, he is open to having a Here's the steps I think we should take, let me know what you think:
Let me know if you think this is the right path. Thanks so much for your great work on this PR. The users in the ExLlamaV2 discord were excited to hear about this PR! |
@lapp0 wow, didn't know exllamav2 had a discord server! And makes perfect sense. |
@isamu-isozaki can you please take a look at this changeset and the provided example I believe it should provide a sufficient basis for implementing Let me know if you see anything that should be changed in my implementation. If you have any questions, please do not hesitate! Good luck! Edit: Also please add "Fixes #807" to the PR description. |
@lapp0 sounds good. And sorry got a bit side tracked by some work. I'll try get to this at least by the weekend. Sorry for delay! |
Sorry for the delay, I finally got the exllamav2 fork built and I was able to run the current pr's code with below which worked! import sys
sys.path.append("../outlines-dev")
import outlines
from enum import Enum
from pydantic import BaseModel, constr
model = outlines.models.exl2(
model_path="turboderp/TinyLlama-1B-32k-exl2",
cache_q4=True,
paged=False
)
prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?
Review: This restaurant is just awesome!
"""
generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)
print(answer)
prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "
answer = outlines.generate.format(model, int)(prompt, max_tokens=1)
print(answer)
generator = outlines.generate.format(model, float)
answer = generator(prompt, max_tokens=10)
print(answer)
generator = outlines.generate.text(model)
unstructured = generator(prompt, max_tokens=30)
generator = outlines.generate.regex(
model,
r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
)
structured = generator(prompt, max_tokens=30)
print(unstructured)
# What is the IP address of the Google DNS servers?
#
# Passive DNS servers are at DNS servers that are private.
# In other words, both IP servers are private. The database
# does not contain Chelsea Manning
print(structured)
class Weapon(str, Enum):
sword = "sword"
axe = "axe"
mace = "mace"
spear = "spear"
bow = "bow"
crossbow = "crossbow"
class Armor(str, Enum):
leather = "leather"
chainmail = "chainmail"
plate = "plate"
class Character(BaseModel):
name: constr(max_length=10)
age: int
armor: Armor
weapon: Weapon
strength: int
# Construct structured sequence generator
generator = outlines.generate.json(model, Character)
# Draw a sample
seed = 789001
character = generator("Give me a character description", seed=seed)
print(repr(character))
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)
character = generator("Give me an interesting character description", seed=seed)
print(repr(character))
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125) |
The current main issue is that I can't seem to run the tests due to some error with the pyairports. @lapp0 do you have some advice on how to fix this? pytest -s tests/generate/test_generate.py -k exllamav2
======================== test session starts =========================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2
collected 0 items / 1 error
=============================== ERRORS ===============================
__________ ERROR collecting tests/generate/test_generate.py __________
tests/generate/test_generate.py:6: in <module>
import outlines.generate as generate
outlines/__init__.py:6: in <module>
import outlines.types
outlines/types/__init__.py:1: in <module>
from . import airports, countries
outlines/types/airports.py:4: in <module>
from pyairports.airports import AIRPORT_LIST
/home/isamu/miniconda3/lib/python3.10/site-packages/pyairports/airports.py:1: in <module>
from pkg_resources import resource_string
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3663: in <module>
def _initialize_master_working_set():
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3646: in _call_aside
f(*args, **kwargs)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3687: in _initialize_master_working_set
tuple(dist.activate(replace=False) for dist in working_set)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3687: in <genexpr>
tuple(dist.activate(replace=False) for dist in working_set)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:3144: in activate
declare_namespace(pkg)
/home/isamu/miniconda3/lib/python3.10/site-packages/pkg_resources/__init__.py:2542: in declare_namespace
warnings.warn(msg, DeprecationWarning, stacklevel=2)
E DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
E Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
====================== short test summary info =======================
ERROR tests/generate/test_generate.py - DeprecationWarning: Deprecated call to `pkg_resources.declare_nam...
!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!
========================= 1 error in 17.91s ========================== |
@isamu-isozaki sorry for the delayed response.
A quick and easy hack is to remove the import and run tests again. |
Hi, just want to pop by and see how it is going. Will this feature be released soon? If there is some dev branch i can try it as well. |
Great, please let me know when you're ready for review!
You might be able to get it working with the installation commands below. Please report back with any issues or feedback, it will help with this PR!
|
…exllamav2_filter
…utlines into exllamav2_filter
@remichu-ai Hi! If you had an issue building exllamav2 like me you can just install outlines with my initial commit to this pr and you can use the code examples and it should work. |
@lapp0 hi! Sorry for more qs. I did write some tests to attempt to fill up the exllamav2.py. The coverage is 100% locally for exllamav2.py. But it seems like if the tests are skipped they don't count towards coverage(which is the case for this pipeline). Do you happen to know a simple way to fix this by any chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My example script works with the code. Minor change requests, great work!
@lapp0 Thanks for review! Let me check it out tomorrow |
@lapp0 Thanks for the review. I did all the changes and all my tests passed locally(including pre-commit) (base) outlines-dev$ pytest -s tests/generate/test_integration_exllamav2.py --cov=outlines.models
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2, cov-5.0.0
collected 19 items
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:31 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 0:00:00
Loading tokenizer...
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
Loading tokenizer...
.
---------- coverage: platform linux, python 3.10.12-final-0 ----------
Name Stmts Miss Branch BrPart Cover Missing
------------------------------------------------------------------------------------
outlines/models/__init__.py 9 0 0 0 100%
outlines/models/exllamav2.py 140 0 62 0 100%
outlines/models/llamacpp.py 154 110 60 0 21% 27-53, 56-57, 62-73, 76-84, 87-89, 92-94, 98, 107, 142, 146, 160-239, 277-293, 332-355, 358-362, 386-407
outlines/models/mlxlm.py 81 72 30 0 8% 25-27, 38-41, 70-122, 147-196, 230-247
outlines/models/openai.py 176 134 58 0 19% 97-105, 138-155, 158, 183-251, 255, 258, 261, 292-313, 318-322, 349-364, 381-388, 394-415, 420, 429-452, 461-484
outlines/models/tokenizer.py 12 0 0 0 100%
outlines/models/transformers.py 168 140 52 0 13% 28-56, 68-82, 87-90, 93-94, 97-106, 109-116, 119, 122-123, 126, 137-138, 163-184, 192-195, 225-253, 268-297, 309-340, 349-368, 371-381, 415-435, 444-452
outlines/models/transformers_vision.py 38 30 14 0 15% 12-13, 46-63, 73, 109-138
outlines/models/vllm.py 78 66 42 0 10% 24-27, 30-42, 87-149, 159, 164-169, 184-188, 208-226
------------------------------------------------------------------------------------
TOTAL 856 552 318 0 31%
================================== 18 passed, 1 skipped in 72.95s (0:01:12) ==================================
(base) outlines-dev$ pytest -s tests/generate/test_generate.py -k exllamav2
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /mnt/d/personal_projects/whiterabbitneo-pentestgpt/outlines-dev
configfile: pyproject.toml
plugins: anyio-3.6.2, cov-5.0.0
collected 320 items / 288 deselected / 32 selected
Loading: blockblockblock/TinyLlama-1.1B-Chat-v1.0-bpw4-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:37 0:00:00
Loading tokenizer...
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 45.03it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 25/25 [00:00<00:00, 95.85it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 21/21 [00:00<00:00, 95.23it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 96.69it/s]
Compiling FSM index for all state transitions: 100%|█████████████████████████| 25/25 [00:00<00:00, 139.23it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 21/21 [00:00<00:00, 95.51it/s]
Compiling FSM index for all state transitions: 100%|████████████████████████████| 6/6 [00:00<00:00, 73.53it/s]
Compiling FSM index for all state transitions: 100%|████████████████████████████| 8/8 [00:00<00:00, 92.24it/s]
Compiling FSM index for all state transitions: 100%|██████████████████████████| 10/10 [00:00<00:00, 92.73it/s]
...................
========================== 31 passed, 1 skipped, 288 deselected in 85.01s (0:01:25) ========================== outlines-dev> pre-commit run --all-files
check for merge conflicts................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
isort....................................................................Passed
pyupgrade................................................................Passed
flake8...................................................................Passed
black....................................................................Passed
mypy.....................................................................Passed |
Great job @isamu-isozaki ! I've opened the EXL2 PR for logits processors |
@lapp0 awesome! |
@isamu-isozaki I'm not sure whether the ExLlamaV2 PR will be merged soon, it's been a week without comment. To get this out the door could you please Outlines ExLlamaV2 documentation to make the following clear:
Could you also let me know what build issues you experienced? I didn't run into any but I'd like to ensure the install-from-git command doesn't result in additional confusion. We can revert the documentation to reference the main |
@lapp0 Sorry was away for a bit and sounds good! And also thanks for making that pr! One question, I was following the discussion in exllamav2 but even after this will exllamav2's constrained gen be the slowest(because of double creation of the logits mask+it relies on this in the cpu)? |
@isamu-isozaki No problem, thanks for your great work! I expect constrained generation with The mask is currently applied on CPU, but it can be applied on GPU, prior to CPU offloading. I've profiled this mode and it increased token throughput in "normal" mode to 145 tok/s. This isn't pushed yet though. I'm not sure what you mean by double creation of logits mask. The mask should only be applied once per token generated. Please let me know if you have any other questions. |
@lapp0 Ah, I misread and that sounds good! To do mask creation in gpu, then can this be done fully on the outlines side(after your pr)? |
Two changes are necessary:
Edit: Here are the benchmarks for overhead for a single token generated (batch size 4) for #1192. I'm not sure we need any changes for now, as CPU and GPU operations are both efficient. Let me know if you observe something different though | After [8aa0b0d] | Benchmark (Parameter) | |
@lapp0 Thanks! Yeah, I just have a person hosting an outlines server with this code and he noticed that his production server with very weak CPUs was very slow which was our observation haha. I think what's best may depend on the type of hardware in production |
@isamu-isozaki could you please open a discussion with further details about their performance regression? I'll take a look and see what the root cause is. |
This fixes #1009
Also fixes #807
The tests I did were:
For loading:
Choices test:
Returns
Json test
Returns