Quick tuto to deploy a vLLM instance using the Official Docker Image.
- Conda
- GPU (Nvidia)
If you are going to using Python with vLLM, it's best practice to set a dedicated environment. You can follow the following steps to set your Python environment. This was testing on Ubuntu only, but it should work well on MacOS and Windows WSL.
-
Clone this repo:
git clone https://github.com/pandego/vLLM-deployment.git cd vllm
-
Setup the environment:
conda env create -f environment.yml conda activate vllm
-
Install dependencies:
poetry install --no-root
-
Let's keep things clean. So first, copy
default.env
into.env
:cp default.env .env
-
You might need to edit the contents of the
.env
file with your HuggingFace Token -
Deploy you container:
docker compose --env-file .env up -d --build -d
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key EMPTY
- You can run the following command to test the vLLM instance. Be sure to change the
model
if necessary:- Using
completions
:curl http://localhost:11435/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Meta-Llama-3-8B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'
- Using
chat/completions
:curl http://localhost:11435/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'
- Using
- Run the following commands to test the deployed vLLM endpoint:
LangChain
example:python vLLM_example_LangChain.py
OpenAI
example:python vLLM_example_OpenAI.py
Et Voilà ! 🎈
- You can check some more arguments in the
helper_args.json
file. - Find more info in the vLLM documentaion here.