Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor p #173

Merged
merged 5 commits into from
Aug 22, 2024
Merged

Tensor p #173

merged 5 commits into from
Aug 22, 2024

Conversation

bdashore3
Copy link
Member

Merges tensor parallel support into main.

Unifies the switch statement across both draft and model caches.

Signed-off-by: kingbri <[email protected]>
Use the tensor parallel loader when the flag is enabled. The new loader
has its own autosplit implementation, so gpu_split_auto isn't valid
here.

Also make it easier to determine which cache type to use rather than
multiple if/else statements.

Signed-off-by: kingbri <[email protected]>
Newer versions of exl2 v1.9-dev have quantized cache implemented. Add
those APIs.

Signed-off-by: kingbri <[email protected]>
v0.1.9

Signed-off-by: kingbri <[email protected]>
Exists in stable ExllamaV2 version.

Signed-off-by: kingbri <[email protected]>
@bdashore3 bdashore3 merged commit 364032e into main Aug 22, 2024
1 check passed
@bdashore3 bdashore3 deleted the tensor-p branch September 15, 2024 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant