Offline inference over multiple vLLM instances #879

nivibilla · 2023-11-23T04:11:29Z

nivibilla
Nov 23, 2023

Hi,

Is it possible to have multiple vLLM instances on a single machine that has multiple gpus. And then have code to inference instead of a server?

My reason being, currently I manually start 4 different notebooks to batch inference over data using the same model. And I have to partition my dataset and run 4 instances of the same code.

If I could have a router to do this at an individual request level and distribute this to the 4 different model instances throughput would increase by 4x. Btw the reason for this is because doing tensor- paralell 4 does not increase throughput at all. Whereas having 4 individual vLLM instances of a model can increase 4x throughput.

ishaan-jaff · 2023-11-23T04:41:23Z

ishaan-jaff
Nov 23, 2023
Maintainer

Hi @nivibilla yes absolutely this is the perfect use case for the LiteLLM router. If you want python code to run inference on the server here's how you can do so :

https://docs.litellm.ai/docs/routing

from litellm import Router

# list of model deployments 
model_list = [
{ 
    "model_name": "llama2", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "llama2", # actual model name
        "custom_llm_provider": "openai", # tell litellm to route this as an openai endpoint
        "api_base": "http://192.168.1.23:8000/v1"
    }
}, 
{ 
    "model_name": "llama2", 
    "litellm_params": { 
        "model": "llama2", 
        "custom_llm_provider": "openai", 
        "api_base": "http://192.168.1.23:8010/v1"
    }
},
{ 
    "model_name": "llama2", 
    "litellm_params": { 
        "model": "llama2",
        "custom_llm_provider": "openai",
        "api_base": "http://192.168.1.23:8001/v1"
    }
},
]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

0 replies

ishaan-jaff · 2023-11-23T04:42:12Z

ishaan-jaff
Nov 23, 2023
Maintainer

Please let me know if you run into any issues - I want to make sure it works for your use case

Here's our discord https://discord.com/invite/wuPM9dRgDw + I reached out on Linkedin

0 replies

nivibilla · 2023-11-23T08:55:43Z

nivibilla
Nov 23, 2023
Author

Hi @ishaan-jaff . Thanks for the code, I understand the routing part. But I see they are endpoints. Is it possible to point them to variables instead?

As in if I had
llm1 = vLLM(model1)
llm2 = vLLM(model1)
llm3 = vLLM(model1)

How can I route these/do paralell inference over a dataset? I'm clear on how to do route for live Inference but im trying to find a solution for batch inference

0 replies

krrishdholakia · 2023-11-23T21:11:44Z

krrishdholakia
Nov 23, 2023
Maintainer

@nivibilla check this out - https://docs.litellm.ai/docs/providers/vllm

i believe this would just require changing model name to vllm/

1 reply

andakai Apr 1, 2024

Does this support the mode like llm1=vLLM(model) without api? How to support multi instances offline?

krrishdholakia · 2023-11-23T21:12:10Z

krrishdholakia
Nov 23, 2023
Maintainer

let me know if this solves your problem

0 replies

andakai · 2024-04-01T13:33:29Z

andakai
Apr 1, 2024

Hi @ishaan-jaff . Thanks for the code, I understand the routing part. But I see they are endpoints. Is it possible to point them to variables instead?

As in if I had llm1 = vLLM(model1) llm2 = vLLM(model1) llm3 = vLLM(model1)

How can I route these/do paralell inference over a dataset? I'm clear on how to do route for live Inference but im trying to find a solution for batch inference

Hi, @nivibilla , do you have solutions now? I am meeting the same need. I also want to use llm=vLLM(model1), not the api mode.

0 replies

nivibilla · 2024-04-01T14:00:09Z

nivibilla
Apr 1, 2024
Author

@darrenglow I've migrated to using Ray. You can instantiate N number of workers which load a vllm instance. And distribute the data over all available workers.

5 replies

andakai Apr 1, 2024

Is there any good way to achieve load balance?

nivibilla Apr 1, 2024
Author

I'm using it for offline batch Inference, it will distribute accordingly.

See this
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_distributed.py

nivibilla Apr 1, 2024
Author

For offline Inference, the solution above should work. There is also ray serve and rayllm but it's a little more complicated to set up.

andakai Apr 1, 2024

Thanks. But I realize that I have to do inference on multi nodes, so maybe online inference is a must. :(

nivibilla Apr 2, 2024
Author

Then either the solution above. Where you host multiple servers and load balance using litellm or use ray serve/ray llm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline inference over multiple vLLM instances #879

{{title}}

Replies: 7 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Offline inference over multiple vLLM instances #879

nivibilla Nov 23, 2023

Replies: 7 comments · 6 replies

ishaan-jaff Nov 23, 2023 Maintainer

ishaan-jaff Nov 23, 2023 Maintainer

nivibilla Nov 23, 2023 Author

krrishdholakia Nov 23, 2023 Maintainer

andakai Apr 1, 2024

krrishdholakia Nov 23, 2023 Maintainer

andakai Apr 1, 2024

nivibilla Apr 1, 2024 Author

andakai Apr 1, 2024

nivibilla Apr 1, 2024 Author

nivibilla Apr 1, 2024 Author

andakai Apr 1, 2024

nivibilla Apr 2, 2024 Author

nivibilla
Nov 23, 2023

Replies: 7 comments 6 replies

ishaan-jaff
Nov 23, 2023
Maintainer

ishaan-jaff
Nov 23, 2023
Maintainer

nivibilla
Nov 23, 2023
Author

krrishdholakia
Nov 23, 2023
Maintainer

krrishdholakia
Nov 23, 2023
Maintainer

andakai
Apr 1, 2024

nivibilla
Apr 1, 2024
Author

nivibilla Apr 1, 2024
Author

nivibilla Apr 1, 2024
Author

nivibilla Apr 2, 2024
Author