Tensor p #173

bdashore3 · 2024-08-22T18:14:24Z

Merges tensor parallel support into main.

Unifies the switch statement across both draft and model caches. Signed-off-by: kingbri <[email protected]>

Use the tensor parallel loader when the flag is enabled. The new loader has its own autosplit implementation, so gpu_split_auto isn't valid here. Also make it easier to determine which cache type to use rather than multiple if/else statements. Signed-off-by: kingbri <[email protected]>

Newer versions of exl2 v1.9-dev have quantized cache implemented. Add those APIs. Signed-off-by: kingbri <[email protected]>

v0.1.9 Signed-off-by: kingbri <[email protected]>

Exists in stable ExllamaV2 version. Signed-off-by: kingbri <[email protected]>

bdashore3 added 5 commits August 21, 2024 12:48

Model: Split cache creation into a common function

070663e

Unifies the switch statement across both draft and model caches. Signed-off-by: kingbri <[email protected]>

Model: Add quantized cache support for tensor parallel

3dd1901

Newer versions of exl2 v1.9-dev have quantized cache implemented. Add those APIs. Signed-off-by: kingbri <[email protected]>

Dependencies: Update Exllamav2

c8218fc

v0.1.9 Signed-off-by: kingbri <[email protected]>

Config: Remove developement flag from tensor parallel

5d2ca49

Exists in stable ExllamaV2 version. Signed-off-by: kingbri <[email protected]>

bdashore3 merged commit 364032e into main Aug 22, 2024
1 check passed

bdashore3 deleted the tensor-p branch September 15, 2024 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor p #173

Tensor p #173

bdashore3 commented Aug 22, 2024

Tensor p #173

Tensor p #173

Conversation

bdashore3 commented Aug 22, 2024