Replies: 7 comments 6 replies
-
Hi @nivibilla yes absolutely this is the perfect use case for the LiteLLM router. If you want python code to run inference on the server here's how you can do so : https://docs.litellm.ai/docs/routing from litellm import Router
# list of model deployments
model_list = [
{
"model_name": "llama2", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "llama2", # actual model name
"custom_llm_provider": "openai", # tell litellm to route this as an openai endpoint
"api_base": "http://192.168.1.23:8000/v1"
}
},
{
"model_name": "llama2",
"litellm_params": {
"model": "llama2",
"custom_llm_provider": "openai",
"api_base": "http://192.168.1.23:8010/v1"
}
},
{
"model_name": "llama2",
"litellm_params": {
"model": "llama2",
"custom_llm_provider": "openai",
"api_base": "http://192.168.1.23:8001/v1"
}
},
]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response) |
Beta Was this translation helpful? Give feedback.
-
Please let me know if you run into any issues - I want to make sure it works for your use case Here's our discord https://discord.com/invite/wuPM9dRgDw + I reached out on Linkedin |
Beta Was this translation helpful? Give feedback.
-
Hi @ishaan-jaff . Thanks for the code, I understand the routing part. But I see they are endpoints. Is it possible to point them to variables instead? As in if I had How can I route these/do paralell inference over a dataset? I'm clear on how to do route for live Inference but im trying to find a solution for batch inference |
Beta Was this translation helpful? Give feedback.
-
@nivibilla check this out - https://docs.litellm.ai/docs/providers/vllm i believe this would just require changing model name to |
Beta Was this translation helpful? Give feedback.
-
let me know if this solves your problem |
Beta Was this translation helpful? Give feedback.
-
Hi, @nivibilla , do you have solutions now? I am meeting the same need. I also want to use |
Beta Was this translation helpful? Give feedback.
-
@darrenglow I've migrated to using Ray. You can instantiate N number of workers which load a vllm instance. And distribute the data over all available workers. |
Beta Was this translation helpful? Give feedback.
-
Hi,
Is it possible to have multiple vLLM instances on a single machine that has multiple gpus. And then have code to inference instead of a server?
My reason being, currently I manually start 4 different notebooks to batch inference over data using the same model. And I have to partition my dataset and run 4 instances of the same code.
If I could have a router to do this at an individual request level and distribute this to the 4 different model instances throughput would increase by 4x. Btw the reason for this is because doing tensor- paralell 4 does not increase throughput at all. Whereas having 4 individual vLLM instances of a model can increase 4x throughput.
Beta Was this translation helpful? Give feedback.
All reactions