Support dynamic LoRA serving #132

nstogner · 2024-08-27T02:14:08Z

KubeAI should be able to serve dynamically loaded LoRA adapters. Eventually KubeAI could produce these adapters through supporting a finetuning API endpoint, however that can be implemented separately.

I see 2 primary options:

Option A

KubeAI handles shipping adapters to different server Pods (i.e. kubectl cp or shared filesystem) and keeping track of which Pods have which adapters.

Option B

Server backends handle dynamic loading of adapters themselves and KubeAI just keeps track of what Pods already have adapters in order to load balance effectively.

Not yet supported in vLLM: vllm-project/vllm#6275
Currently supported by Lorax: https://github.com/predibase/lorax

The text was updated successfully, but these errors were encountered:

ffais · 2024-10-31T16:03:46Z

This feature would be very useful, there are some estimates on the integration?

nstogner · 2024-10-31T16:20:46Z

We have a few competing priorities right now:

LoRA
vLLM Cache-aware routing (should greatly improve perf when model replicas >1)
Scaling a given model across different types of infra (i.e. CPU -> GPU -> TPU)

Will work to define some dates soon. Sounds like LoRA would be your number 1 priority out of those?

ffais · 2024-10-31T16:24:24Z

Yes, serve dynamically loaded LoRA adapters is our number 1 priority.

nstogner · 2024-11-01T20:17:54Z

Can you provide some details on your use case so that we can make sure that we will solve it? Where do you store the adapters? What would be the total expected number of adapter variants you would have for a given model? How are you serving them today?

ffais · 2024-11-04T12:21:33Z

At this moment we have an 2 instance of Lorax deployed with different base model and for each one 4/5 adapters. We're storing all adapters on S3.

nstogner · 2024-11-06T01:55:07Z

Started work on this, we will get this tackled as our next big feature.

This was referenced Aug 28, 2024

Create a proposal for managing model weights and adapters #139

Closed

Model management proposal #140

Merged

This was referenced Nov 4, 2024

WIP: LoRA Adapters #304

Open

add ollama caching #297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic LoRA serving #132

Support dynamic LoRA serving #132

nstogner commented Aug 27, 2024

ffais commented Oct 31, 2024

nstogner commented Oct 31, 2024 •

edited

Loading

ffais commented Oct 31, 2024

nstogner commented Nov 1, 2024

ffais commented Nov 4, 2024 •

edited

Loading

nstogner commented Nov 6, 2024

Support dynamic LoRA serving #132

Support dynamic LoRA serving #132

Comments

nstogner commented Aug 27, 2024

Option A

Option B

ffais commented Oct 31, 2024

nstogner commented Oct 31, 2024 • edited Loading

ffais commented Oct 31, 2024

nstogner commented Nov 1, 2024

ffais commented Nov 4, 2024 • edited Loading

nstogner commented Nov 6, 2024

nstogner commented Oct 31, 2024 •

edited

Loading

ffais commented Nov 4, 2024 •

edited

Loading