Exllama/TabbyAPI Support #16

Koesn · 2024-08-13T23:21:18Z

Koesn
Aug 13, 2024

This is great works. Is paddler works specifically with llamacpp server or it's OpenAI compatible? Is there any plan to support exllamav2 server like tabbyAPI?

mcharytoniuk · 2024-08-20T10:34:34Z

mcharytoniuk
Aug 20, 2024
Maintainer

@Koesn Sorry for the late response, I didn't get a notification from GH discussions.

llama.cpp server is OpenAI compatible, and you can use OpenAI-like endpoints with Paddler the same way you use llama.cpp specific ones: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-v1chatcompletions-openai-compatible-chat-completions-api

From what I see, llama.cpp already supports everything that exllamav2 does. Do you have any benefits from using TabbyAPI in mind over llama.cpp?

Overall I am all for supporting runners other than llama.cpp.

4 replies

Koesn Aug 20, 2024
Author

Llama.cpp is using GGUF quants, while tabbyAPI using EXL2 quants. Because I see that Paddler use flags like "--external-llamacpp-port", so I guess it works directly with llama.cpp executable.

I mean if paddler is working through OpenAI API compatible, then it should be working on any server including tabbyAPI, vLLM, Aphrodite, etc. It's okay the flags still has "llama.cpp" elements, if it works based on OpenAI compatible.

mcharytoniuk Aug 20, 2024
Maintainer

@Koesn thanks for letting me know.

Paddler relies on llama.cpp server slots for parallel processing (https://github.com/ggerganov/llama.cpp/tree/master/examples/server). Slots are preconfigured and allow you to divide server resources equally, making the infrastructure more predictable. It doesn't use OpenAPI-like endpoints to balance the requests.

I didn't want it to use round-robin or other typical load-balancing strategies, as they are inefficient in the case of LLMs (requests to LLMs have varying durations). Instead, it uses a server with the largest number of preconfigured slots available.

Once Tabby also starts to support slots and a health check endpoint (similarly to how llama.cpp does it), it would make sense to use it with Paddler, and I would be all for it.

I think for that to work, we need to wait for this issue to finish: theroyallab/tabbyAPI#170 , and also request support for slots in TabbyAPI server.

On the other hand, I might try to get it to work for just OpenAPI-like endpoints. However, that would have some limitations compared to balancing based on slots (slots allow us to predict how many requests a server can handle at most, so that allows predictable buffering). Do you think that would be ok for your use case?

Koesn Aug 21, 2024
Author

It will be much more flexible with OpenAI-compatible endpoints, regardless it's limitations, at least Paddler enabled it as a choice. It might be another strategy for OpenAI-compatible which has no slot indicators by using timeouts before round robin to seek another available one.

With that flexibilities, Paddler can be configured to any endpoint target on various machines. We know that llama.cpp's GGUF is king on Apple Silicon Macs, also a PC with low RAM and odd numbers of GPUs is great using EXL2, and also well-configured PCs with even GPUs ready for vLLM/Aphrodite.

Also it would be much flexible to add public endpoints as a backups, last choice, before appointed local endpoints failed/not responding.

mcharytoniuk Aug 21, 2024
Maintainer

@Koesn Thank you for sharing those ideas! I've scheduled an issue to implement it. The fallback strategy makes a lot of sense to me: #18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllama/TabbyAPI Support #16

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Exllama/TabbyAPI Support #16

Koesn Aug 13, 2024

Replies: 1 comment · 4 replies

mcharytoniuk Aug 20, 2024 Maintainer

Koesn Aug 20, 2024 Author

mcharytoniuk Aug 20, 2024 Maintainer

Koesn Aug 21, 2024 Author

mcharytoniuk Aug 21, 2024 Maintainer

Koesn
Aug 13, 2024

Replies: 1 comment 4 replies

mcharytoniuk
Aug 20, 2024
Maintainer

Koesn Aug 20, 2024
Author

mcharytoniuk Aug 20, 2024
Maintainer

Koesn Aug 21, 2024
Author

mcharytoniuk Aug 21, 2024
Maintainer