-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic LoRA serving #132
Comments
This feature would be very useful, there are some estimates on the integration? |
We have a few competing priorities right now:
Will work to define some dates soon. Sounds like LoRA would be your number 1 priority out of those? |
Yes, serve dynamically loaded LoRA adapters is our number 1 priority. |
Can you provide some details on your use case so that we can make sure that we will solve it? Where do you store the adapters? What would be the total expected number of adapter variants you would have for a given model? How are you serving them today? |
At this moment we have an 2 instance of Lorax deployed with different base model and for each one 4/5 adapters. We're storing all adapters on S3. |
Started work on this, we will get this tackled as our next big feature. |
KubeAI should be able to serve dynamically loaded LoRA adapters. Eventually KubeAI could produce these adapters through supporting a finetuning API endpoint, however that can be implemented separately.
I see 2 primary options:
Option A
KubeAI handles shipping adapters to different server Pods (i.e.
kubectl cp
or shared filesystem) and keeping track of which Pods have which adapters.Option B
Server backends handle dynamic loading of adapters themselves and KubeAI just keeps track of what Pods already have adapters in order to load balance effectively.
Not yet supported in vLLM: vllm-project/vllm#6275
Currently supported by Lorax: https://github.com/predibase/lorax
The text was updated successfully, but these errors were encountered: