Added the microservice of vLLM (#78)

* refine the vllm microservice Signed-off-by: tianyil1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rename the rayllm to ray_serve Signed-off-by: tianyil1 <[email protected]> * refactor the ray service code structure Signed-off-by: tianyil1 <[email protected]> * refine the vllm and readme Signed-off-by: tianyil1 <[email protected]> * update the readme with correct ray service name Signed-off-by: tianyil1 <[email protected]> * update the readme Signed-off-by: tianyil1 <[email protected]> * refine the readme Signed-off-by: tianyil1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: tianyil1 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · May 30, 2024 · f0b0690 · f0b0690
1 parent 3986c4f
commit f0b0690
Show file tree

Hide file tree

Showing 22 changed files with 298 additions and 29 deletions.
diff --git a/comps/llms/README.md b/comps/llms/README.md
@@ -2,9 +2,9 @@
 
 This microservice, designed for Language Model Inference (LLM), processes input consisting of a query string and associated reranked documents. It constructs a prompt based on the query and documents, which is then used to perform inference with a large language model. The service delivers the inference results as output.
 
-A prerequisite for using this microservice is that users must have a Text Generation Inference (TGI) service already running. Users need to set the TGI service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the TGI service for executing language model operations.
+A prerequisite for using this microservice is that users must have a LLM text generation service (etc., TGI, vLLM and Ray) already running. Users need to set the LLM service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the LLM service for executing language model operations.
 
-Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses.
+Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI/vLLM/Ray service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses.
 
 # 🚀1. Start Microservice with Python (Option 1)
 
@@ -16,7 +16,9 @@ To start the LLM microservice, you need to install python packages first.
 pip install -r requirements.txt
 ```
 
-## 1.2 Start TGI Service
+## 1.2 Start LLM Service
+
+### 1.2.1 Start TGI Service
 
 ```bash
 export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
@@ -26,7 +28,24 @@ export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms"
 docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model}
 ```
 
-## 1.3 Verify the TGI Service
+### 1.2.2 Start vLLM Service
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+docker run -it --name vllm_service -p 8008:80 -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -v ./data:/data vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model ${your_hf_llm_model} --port 80"
+```
+
+## 1.2.3 Start Ray Service
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+export TRUST_REMOTE_CODE=True
+docker run -it --runtime=habana --name ray_serve_service -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -p 8008:80 -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path ${your_hf_llm_model} --chat_processor ${your_hf_chatprocessor}"
+```
+
+## 1.3 Verify the LLM Service
+
+### 1.3.1 Verify the TGI Service
 
 ```bash
 curl http://${your_ip}:8008/generate \
@@ -35,16 +54,54 @@ curl http://${your_ip}:8008/generate \
   -H 'Content-Type: application/json'
 ```
 
-## 1.4 Start LLM Service
+### 1.3.2 Verify the vLLM Service
+
+```bash
+curl http://${your_ip}:8008/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": ${your_hf_llm_model},
+  "prompt": "What is Deep Learning?",
+  "max_tokens": 32,
+  "temperature": 0
+  }'
+```
+
+### 1.3.3 Verify the Ray Service
+
+```bash
+curl http://${your_ip}:8008/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": ${your_hf_llm_model},
+  "messages": [
+        {"role": "assistant", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "What is Deep Learning?"},
+    ],
+  "max_tokens": 32,
+  "stream": True
+  }'
+```
+
+## 1.4 Start LLM Service with Python Script
+
+### 1.4.1 Start the TGI Service
 
 ```bash
 export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
 python text-generation/tgi/llm.py
 ```
 
+### 1.4.2 Start the vLLM Service
+
+```bash
+export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
+python text-generation/vllm/llm.py
+```
+
 # 🚀2. Start Microservice with Docker (Option 2)
 
-If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI service with docker.
+If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker.
 
 ## 2.1 Setup Environment Variables
 
@@ -59,13 +116,33 @@ export LANGCHAIN_API_KEY=${your_langchain_api_key}
 export LANGCHAIN_PROJECT="opea/llms"
 ```
 
+In order to start vLLM and LLM services, you need to setup the following environment variables first.
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
+export LLM_MODEL_ID=${your_hf_llm_model}
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=${your_langchain_api_key}
+export LANGCHAIN_PROJECT="opea/llms"
+```
+
 ## 2.2 Build Docker Image
 
+### 2.2.1 TGI
+
 ```bash
 cd ../../
 docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile .
 ```
 
+### 2.2.2 vLLM
+
+```bash
+cd ../../
+docker build -t opea/llm-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm/Dockerfile .
+```
+
 To start a docker container, you have two options:
 
 - A. Run Docker with CLI
@@ -75,17 +152,34 @@ You can choose one as needed.
 
 ## 2.3 Run Docker with CLI (Option A)
 
+### 2.3.1 TGI
+
 ```bash
 docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN opea/llm-tgi:latest
 ```
 
+### 2.3.2 vLLM
+
+```bash
+docker run -d --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e LLM_MODEL_ID=$LLM_MODEL_ID opea/llm-vllm:latest
+```
+
 ## 2.4 Run Docker with Docker Compose (Option B)
 
+### 2.4.1 TGI
+
 ```bash
 cd text-generation/tgi
 docker compose -f docker_compose_llm.yaml up -d
 ```
 
+### 2.4.2 vLLM
+
+```bash
+cd text-generation/vllm
+docker compose -f docker_compose_llm.yaml up -d
+```
+
 # 🚀3. Consume LLM Service
 
 ## 3.1 Check Service Status

diff --git a/comps/llms/text-generation/ray/README.md → .../llms/text-generation/ray_serve/README.md b/comps/llms/text-generation/ray/README.md → .../llms/text-generation/ray_serve/README.md
@@ -21,13 +21,13 @@ export HUGGINGFACEHUB_API_TOKEN=<token>
 And then you can make requests with the OpenAI-compatible APIs like below to check the service status:
 
 ```bash
-curl http://127.0.0.1::8080/v1/chat/completions \
+curl http://127.0.0.1:8080/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
   "model": <model_name>,
   "messages": [
         {"role": "assistant", "content": "You are a helpful assistant."},
-        {"role": "user", "content": args.input_text},
+        {"role": "user", "content": "What is Deep Learning?"},
     ],
   "max_tokens": 32,
   "stream": True

diff --git a/comps/llms/text-generation/ray/__init__.py → ...lms/text-generation/ray_serve/__init__.py b/comps/llms/text-generation/ray/__init__.py → ...lms/text-generation/ray_serve/__init__.py
diff --git a/...ration/ray/api_openai_backend/__init__.py → .../ray_serve/api_openai_backend/__init__.py b/...ration/ray/api_openai_backend/__init__.py → .../ray_serve/api_openai_backend/__init__.py
diff --git a/...ray/api_openai_backend/openai_protocol.py → ...rve/api_openai_backend/openai_protocol.py b/...ray/api_openai_backend/openai_protocol.py → ...rve/api_openai_backend/openai_protocol.py
diff --git a/...on/ray/api_openai_backend/query_client.py → ..._serve/api_openai_backend/query_client.py b/...on/ray/api_openai_backend/query_client.py → ..._serve/api_openai_backend/query_client.py
@@ -15,8 +15,8 @@
 from typing import Dict
 
 from fastapi import HTTPException
-from rayllm.api_openai_backend.openai_protocol import ModelCard, Prompt
-from rayllm.api_openai_backend.request_handler import handle_request
+from ray_serve.api_openai_backend.openai_protocol import ModelCard, Prompt
+from ray_serve.api_openai_backend.request_handler import handle_request
 
 
 class RouterQueryClient:

diff --git a/...ray/api_openai_backend/request_handler.py → ...rve/api_openai_backend/request_handler.py b/...ray/api_openai_backend/request_handler.py → ...rve/api_openai_backend/request_handler.py
@@ -18,7 +18,7 @@
 
 from fastapi import HTTPException, Request, status
 from pydantic import ValidationError as PydanticValidationError
-from rayllm.api_openai_backend.openai_protocol import ErrorResponse, FinishReason, ModelResponse, Prompt
+from ray_serve.api_openai_backend.openai_protocol import ErrorResponse, FinishReason, ModelResponse, Prompt
 from starlette.responses import JSONResponse
 
 

diff --git a/...tion/ray/api_openai_backend/router_app.py → ...ay_serve/api_openai_backend/router_app.py b/...tion/ray/api_openai_backend/router_app.py → ...ay_serve/api_openai_backend/router_app.py
@@ -21,7 +21,7 @@
 from fastapi import Response as FastAPIResponse
 from fastapi import status
 from fastapi.middleware.cors import CORSMiddleware
-from rayllm.api_openai_backend.openai_protocol import (
+from ray_serve.api_openai_backend.openai_protocol import (
     ChatCompletionRequest,
     ChatCompletionResponse,
     ChatCompletionResponseChoice,
@@ -39,8 +39,8 @@
     Prompt,
     UsageInfo,
 )
-from rayllm.api_openai_backend.query_client import RouterQueryClient
-from rayllm.api_openai_backend.request_handler import OpenAIHTTPException, openai_exception_handler
+from ray_serve.api_openai_backend.query_client import RouterQueryClient
+from ray_serve.api_openai_backend.request_handler import OpenAIHTTPException, openai_exception_handler
 from starlette.responses import Response, StreamingResponse
 
 # timeout in 10 minutes. Streaming can take longer than 3 min

diff --git a/...eneration/ray/api_openai_backend/tools.py → ...ion/ray_serve/api_openai_backend/tools.py b/...eneration/ray/api_openai_backend/tools.py → ...ion/ray_serve/api_openai_backend/tools.py
@@ -19,7 +19,7 @@
 from typing import List, Union
 
 import jinja2
-from rayllm.api_openai_backend.openai_protocol import ChatMessage, FunctionCall, ToolCall
+from ray_serve.api_openai_backend.openai_protocol import ChatMessage, FunctionCall, ToolCall
 
 
 class ToolsCallsTemplateContext(Enum):

diff --git a/.../text-generation/ray/api_server_openai.py → ...generation/ray_serve/api_server_openai.py b/.../text-generation/ray/api_server_openai.py → ...generation/ray_serve/api_server_openai.py
@@ -20,9 +20,9 @@
 import ray
 from easydict import EasyDict as edict
 from ray import serve
-from rayllm.api_openai_backend.query_client import RouterQueryClient
-from rayllm.api_openai_backend.router_app import Router, router_app
-from rayllm.ray_serve import LLMServe
+from ray_serve.api_openai_backend.query_client import RouterQueryClient
+from ray_serve.api_openai_backend.router_app import Router, router_app
+from ray_serve.ray_serve import LLMServe
 
 
 def router_application(deployments, max_concurrent_queries):

diff --git a/.../llms/text-generation/ray/build_docker.sh → ...text-generation/ray_serve/build_docker.sh b/.../llms/text-generation/ray/build_docker.sh → ...text-generation/ray_serve/build_docker.sh
@@ -18,7 +18,7 @@ cd docker
 
 docker build \
     -f Dockerfile ../../ \
-    -t rayllm:habana \
+    -t ray_serve:habana \
     --network=host \
     --build-arg http_proxy=${http_proxy} \
     --build-arg https_proxy=${https_proxy} \

diff --git a/comps/llms/text-generation/ray/Dockerfile → ...xt-generation/ray_serve/docker/Dockerfile b/comps/llms/text-generation/ray/Dockerfile → ...xt-generation/ray_serve/docker/Dockerfile
@@ -2,19 +2,19 @@ FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installe
 
 ENV LANG=en_US.UTF-8
 
-WORKDIR /root/rayllm
+WORKDIR /root/ray_serve
 
 # copy the source code to the package directory
-COPY ../ray/ /root/rayllm
+COPY ../ray_serve/ /root/ray_serve
 
-RUN pip install -r /root/rayllm/docker/requirements.txt && \
+RUN pip install -r /root/ray_serve/docker/requirements.txt && \
     pip install --upgrade-strategy eager optimum[habana]
 
 RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
     service ssh restart
 
 ENV no_proxy=localhost,127.0.0.1
-ENV PYTHONPATH=$PYTHONPATH:/root:/root/rayllm
+ENV PYTHONPATH=$PYTHONPATH:/root:/root/ray_serve
 
 # Required by DeepSpeed
 ENV RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES=1

diff --git a/...llms/text-generation/ray/requirements.txt → ...eration/ray_serve/docker/requirements.txt b/...llms/text-generation/ray/requirements.txt → ...eration/ray_serve/docker/requirements.txt
diff --git a/...text-generation/ray/launch_ray_service.sh → ...eneration/ray_serve/launch_ray_service.sh b/...text-generation/ray/launch_ray_service.sh → ...eneration/ray_serve/launch_ray_service.sh
@@ -41,4 +41,4 @@ if [ "$#" -lt 0 ] || [ "$#" -gt 5 ]; then
 fi
 
 # Build the Docker run command based on the number of cards
-docker run -it --runtime=habana --name="rayllm-habana" -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --network=host -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE rayllm:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number $port_number --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
+docker run -it --runtime=habana --name="ChatQnA_server" -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --network=host -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number $port_number --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
diff --git a/comps/llms/text-generation/ray/ray_serve.py → ...ms/text-generation/ray_serve/ray_serve.py b/comps/llms/text-generation/ray/ray_serve.py → ...ms/text-generation/ray_serve/ray_serve.py
@@ -25,8 +25,8 @@
 from fastapi import HTTPException
 from pydantic import BaseModel
 from ray import serve
-from rayllm.api_openai_backend.openai_protocol import ChatMessage, ErrorResponse, ModelResponse
-from rayllm.api_openai_backend.tools import ChatPromptCapture, OpenAIToolsPrompter
+from ray_serve.api_openai_backend.openai_protocol import ChatMessage, ErrorResponse, ModelResponse
+from ray_serve.api_openai_backend.tools import ChatPromptCapture, OpenAIToolsPrompter
 from starlette.requests import Request
 from starlette.responses import JSONResponse, StreamingResponse
 from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

diff --git a/comps/llms/text-generation/vllm/Dockerfile b/comps/llms/text-generation/vllm/Dockerfile
@@ -0,0 +1,37 @@
+# Copyright (c) 2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM langchain/langchain:latest
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    vim
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+USER user
+
+COPY comps /home/user/comps
+
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r /home/user/comps/llms/text-generation/vllm/requirements.txt
+
+ENV PYTHONPATH=$PYTHONPATH:/home/user
+
+WORKDIR /home/user/comps/llms/text-generation/vllm
+
+ENTRYPOINT ["python", "llm.py"]
diff --git a/comps/llms/text-generation/vllm/README.md b/comps/llms/text-generation/vllm/README.md
@@ -23,7 +23,7 @@ export HUGGINGFACEHUB_API_TOKEN=<token>
 And then you can make requests like below to check the service status:
 
 ```bash
-curl http://127.0.0.1::8080/v1/completions \
+curl http://127.0.0.1:8080/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
   "model": <model_name>,

diff --git a/comps/llms/text-generation/vllm/build_docker_cpu.sh b/comps/llms/text-generation/vllm/build_docker_cpu.sh
@@ -16,4 +16,4 @@
 
 git clone https://github.com/vllm-project/vllm.git
 cd ./vllm/
-docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy