Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Generator #109

Merged
merged 11 commits into from
May 26, 2024
Merged

Dynamic Generator #109

merged 11 commits into from
May 26, 2024

Conversation

bdashore3
Copy link
Member

@bdashore3 bdashore3 commented May 26, 2024

Why should this feature be added?
Dependency compat upgrade

Additional context
Adds support for ExllamaV2's dynamic generator which handles prompt processing and multi-user management internally.

Resolves #77, #98, #107

bdashore3 added 11 commits May 23, 2024 00:13
Adds basic support for ExllamaV2's dynamic generator. Can generate
a streaming and non-streaming completion.

Signed-off-by: kingbri <[email protected]>
Dynamic gen takes in filters differently. Adjust to set the filter list
per class rather than in the generation function.

Signed-off-by: kingbri <[email protected]>
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.

Signed-off-by: kingbri <[email protected]>
Dynamic generator needed multiple prompts to be tokenized and sent
for them to be sampled in serial, but generated in parallel.

Signed-off-by: kingbri <[email protected]>
In the form of min_new_tokens. Stopping strings take priority.

Signed-off-by: kingbri <[email protected]>
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <[email protected]>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.

Signed-off-by: kingbri <[email protected]>
v0.1.0

Signed-off-by: kingbri <[email protected]>
FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs.
Clarify the error message and explain what happens as a result.

Signed-off-by: kingbri <[email protected]>
No need to use extend if the array is length 1.

Signed-off-by: kingbri <[email protected]>
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.

Signed-off-by: kingbri <[email protected]>
@bdashore3 bdashore3 merged commit 9fbbc5a into main May 26, 2024
1 check passed
@bdashore3 bdashore3 deleted the dynamic-gen branch May 27, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup and improve API performance/concurrency
1 participant