Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic LoRA serving #132

Open
nstogner opened this issue Aug 27, 2024 · 6 comments
Open

Support dynamic LoRA serving #132

nstogner opened this issue Aug 27, 2024 · 6 comments

Comments

@nstogner
Copy link
Contributor

KubeAI should be able to serve dynamically loaded LoRA adapters. Eventually KubeAI could produce these adapters through supporting a finetuning API endpoint, however that can be implemented separately.

I see 2 primary options:

Option A

KubeAI handles shipping adapters to different server Pods (i.e. kubectl cp or shared filesystem) and keeping track of which Pods have which adapters.

Option B

Server backends handle dynamic loading of adapters themselves and KubeAI just keeps track of what Pods already have adapters in order to load balance effectively.

Not yet supported in vLLM: vllm-project/vllm#6275
Currently supported by Lorax: https://github.com/predibase/lorax

@ffais
Copy link

ffais commented Oct 31, 2024

This feature would be very useful, there are some estimates on the integration?

@nstogner
Copy link
Contributor Author

nstogner commented Oct 31, 2024

We have a few competing priorities right now:

  1. LoRA
  2. vLLM Cache-aware routing (should greatly improve perf when model replicas >1)
  3. Scaling a given model across different types of infra (i.e. CPU -> GPU -> TPU)

Will work to define some dates soon. Sounds like LoRA would be your number 1 priority out of those?

@ffais
Copy link

ffais commented Oct 31, 2024

Yes, serve dynamically loaded LoRA adapters is our number 1 priority.

@nstogner
Copy link
Contributor Author

nstogner commented Nov 1, 2024

Can you provide some details on your use case so that we can make sure that we will solve it? Where do you store the adapters? What would be the total expected number of adapter variants you would have for a given model? How are you serving them today?

@ffais
Copy link

ffais commented Nov 4, 2024

At this moment we have an 2 instance of Lorax deployed with different base model and for each one 4/5 adapters. We're storing all adapters on S3.

This was referenced Nov 4, 2024
@nstogner
Copy link
Contributor Author

nstogner commented Nov 6, 2024

Started work on this, we will get this tackled as our next big feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants