Skip to content

Commit

Permalink
Added the microservice of vLLM (#78)
Browse files Browse the repository at this point in the history
* refine the vllm microservice

Signed-off-by: tianyil1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rename the rayllm to ray_serve

Signed-off-by: tianyil1 <[email protected]>

* refactor the ray service code structure

Signed-off-by: tianyil1 <[email protected]>

* refine the vllm and readme

Signed-off-by: tianyil1 <[email protected]>

* update the readme with correct ray service name

Signed-off-by: tianyil1 <[email protected]>

* update the readme

Signed-off-by: tianyil1 <[email protected]>

* refine the readme

Signed-off-by: tianyil1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: tianyil1 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
tianyil1 and pre-commit-ci[bot] authored May 30, 2024
1 parent 3986c4f commit f0b0690
Show file tree
Hide file tree
Showing 22 changed files with 298 additions and 29 deletions.
106 changes: 100 additions & 6 deletions comps/llms/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

This microservice, designed for Language Model Inference (LLM), processes input consisting of a query string and associated reranked documents. It constructs a prompt based on the query and documents, which is then used to perform inference with a large language model. The service delivers the inference results as output.

A prerequisite for using this microservice is that users must have a Text Generation Inference (TGI) service already running. Users need to set the TGI service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the TGI service for executing language model operations.
A prerequisite for using this microservice is that users must have a LLM text generation service (etc., TGI, vLLM and Ray) already running. Users need to set the LLM service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the LLM service for executing language model operations.

Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses.
Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI/vLLM/Ray service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses.

# 🚀1. Start Microservice with Python (Option 1)

Expand All @@ -16,7 +16,9 @@ To start the LLM microservice, you need to install python packages first.
pip install -r requirements.txt
```

## 1.2 Start TGI Service
## 1.2 Start LLM Service

### 1.2.1 Start TGI Service

```bash
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
Expand All @@ -26,7 +28,24 @@ export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms"
docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model}
```

## 1.3 Verify the TGI Service
### 1.2.2 Start vLLM Service

```bash
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
docker run -it --name vllm_service -p 8008:80 -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -v ./data:/data vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model ${your_hf_llm_model} --port 80"
```

## 1.2.3 Start Ray Service

```bash
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
export TRUST_REMOTE_CODE=True
docker run -it --runtime=habana --name ray_serve_service -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -p 8008:80 -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path ${your_hf_llm_model} --chat_processor ${your_hf_chatprocessor}"
```

## 1.3 Verify the LLM Service

### 1.3.1 Verify the TGI Service

```bash
curl http://${your_ip}:8008/generate \
Expand All @@ -35,16 +54,54 @@ curl http://${your_ip}:8008/generate \
-H 'Content-Type: application/json'
```

## 1.4 Start LLM Service
### 1.3.2 Verify the vLLM Service

```bash
curl http://${your_ip}:8008/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": ${your_hf_llm_model},
"prompt": "What is Deep Learning?",
"max_tokens": 32,
"temperature": 0
}'
```

### 1.3.3 Verify the Ray Service

```bash
curl http://${your_ip}:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": ${your_hf_llm_model},
"messages": [
{"role": "assistant", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Deep Learning?"},
],
"max_tokens": 32,
"stream": True
}'
```

## 1.4 Start LLM Service with Python Script

### 1.4.1 Start the TGI Service

```bash
export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
python text-generation/tgi/llm.py
```

### 1.4.2 Start the vLLM Service

```bash
export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
python text-generation/vllm/llm.py
```

# 🚀2. Start Microservice with Docker (Option 2)

If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI service with docker.
If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker.

## 2.1 Setup Environment Variables

Expand All @@ -59,13 +116,33 @@ export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/llms"
```

In order to start vLLM and LLM services, you need to setup the following environment variables first.

```bash
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
export LLM_MODEL_ID=${your_hf_llm_model}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/llms"
```

## 2.2 Build Docker Image

### 2.2.1 TGI

```bash
cd ../../
docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile .
```

### 2.2.2 vLLM

```bash
cd ../../
docker build -t opea/llm-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm/Dockerfile .
```

To start a docker container, you have two options:

- A. Run Docker with CLI
Expand All @@ -75,17 +152,34 @@ You can choose one as needed.

## 2.3 Run Docker with CLI (Option A)

### 2.3.1 TGI

```bash
docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN opea/llm-tgi:latest
```

### 2.3.2 vLLM

```bash
docker run -d --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e LLM_MODEL_ID=$LLM_MODEL_ID opea/llm-vllm:latest
```

## 2.4 Run Docker with Docker Compose (Option B)

### 2.4.1 TGI

```bash
cd text-generation/tgi
docker compose -f docker_compose_llm.yaml up -d
```

### 2.4.2 vLLM

```bash
cd text-generation/vllm
docker compose -f docker_compose_llm.yaml up -d
```

# 🚀3. Consume LLM Service

## 3.1 Check Service Status
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ export HUGGINGFACEHUB_API_TOKEN=<token>
And then you can make requests with the OpenAI-compatible APIs like below to check the service status:

```bash
curl http://127.0.0.1::8080/v1/chat/completions \
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": <model_name>,
"messages": [
{"role": "assistant", "content": "You are a helpful assistant."},
{"role": "user", "content": args.input_text},
{"role": "user", "content": "What is Deep Learning?"},
],
"max_tokens": 32,
"stream": True
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@
from typing import Dict

from fastapi import HTTPException
from rayllm.api_openai_backend.openai_protocol import ModelCard, Prompt
from rayllm.api_openai_backend.request_handler import handle_request
from ray_serve.api_openai_backend.openai_protocol import ModelCard, Prompt
from ray_serve.api_openai_backend.request_handler import handle_request


class RouterQueryClient:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

from fastapi import HTTPException, Request, status
from pydantic import ValidationError as PydanticValidationError
from rayllm.api_openai_backend.openai_protocol import ErrorResponse, FinishReason, ModelResponse, Prompt
from ray_serve.api_openai_backend.openai_protocol import ErrorResponse, FinishReason, ModelResponse, Prompt
from starlette.responses import JSONResponse


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from fastapi import Response as FastAPIResponse
from fastapi import status
from fastapi.middleware.cors import CORSMiddleware
from rayllm.api_openai_backend.openai_protocol import (
from ray_serve.api_openai_backend.openai_protocol import (
ChatCompletionRequest,
ChatCompletionResponse,
ChatCompletionResponseChoice,
Expand All @@ -39,8 +39,8 @@
Prompt,
UsageInfo,
)
from rayllm.api_openai_backend.query_client import RouterQueryClient
from rayllm.api_openai_backend.request_handler import OpenAIHTTPException, openai_exception_handler
from ray_serve.api_openai_backend.query_client import RouterQueryClient
from ray_serve.api_openai_backend.request_handler import OpenAIHTTPException, openai_exception_handler
from starlette.responses import Response, StreamingResponse

# timeout in 10 minutes. Streaming can take longer than 3 min
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from typing import List, Union

import jinja2
from rayllm.api_openai_backend.openai_protocol import ChatMessage, FunctionCall, ToolCall
from ray_serve.api_openai_backend.openai_protocol import ChatMessage, FunctionCall, ToolCall


class ToolsCallsTemplateContext(Enum):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@
import ray
from easydict import EasyDict as edict
from ray import serve
from rayllm.api_openai_backend.query_client import RouterQueryClient
from rayllm.api_openai_backend.router_app import Router, router_app
from rayllm.ray_serve import LLMServe
from ray_serve.api_openai_backend.query_client import RouterQueryClient
from ray_serve.api_openai_backend.router_app import Router, router_app
from ray_serve.ray_serve import LLMServe


def router_application(deployments, max_concurrent_queries):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ cd docker

docker build \
-f Dockerfile ../../ \
-t rayllm:habana \
-t ray_serve:habana \
--network=host \
--build-arg http_proxy=${http_proxy} \
--build-arg https_proxy=${https_proxy} \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@ FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installe

ENV LANG=en_US.UTF-8

WORKDIR /root/rayllm
WORKDIR /root/ray_serve

# copy the source code to the package directory
COPY ../ray/ /root/rayllm
COPY ../ray_serve/ /root/ray_serve

RUN pip install -r /root/rayllm/docker/requirements.txt && \
RUN pip install -r /root/ray_serve/docker/requirements.txt && \
pip install --upgrade-strategy eager optimum[habana]

RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
service ssh restart

ENV no_proxy=localhost,127.0.0.1
ENV PYTHONPATH=$PYTHONPATH:/root:/root/rayllm
ENV PYTHONPATH=$PYTHONPATH:/root:/root/ray_serve

# Required by DeepSpeed
ENV RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES=1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ if [ "$#" -lt 0 ] || [ "$#" -gt 5 ]; then
fi

# Build the Docker run command based on the number of cards
docker run -it --runtime=habana --name="rayllm-habana" -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --network=host -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE rayllm:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number $port_number --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
docker run -it --runtime=habana --name="ChatQnA_server" -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --network=host -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number $port_number --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@
from fastapi import HTTPException
from pydantic import BaseModel
from ray import serve
from rayllm.api_openai_backend.openai_protocol import ChatMessage, ErrorResponse, ModelResponse
from rayllm.api_openai_backend.tools import ChatPromptCapture, OpenAIToolsPrompter
from ray_serve.api_openai_backend.openai_protocol import ChatMessage, ErrorResponse, ModelResponse
from ray_serve.api_openai_backend.tools import ChatPromptCapture, OpenAIToolsPrompter
from starlette.requests import Request
from starlette.responses import JSONResponse, StreamingResponse
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
Expand Down
37 changes: 37 additions & 0 deletions comps/llms/text-generation/vllm/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM langchain/langchain:latest

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r /home/user/comps/llms/text-generation/vllm/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

WORKDIR /home/user/comps/llms/text-generation/vllm

ENTRYPOINT ["python", "llm.py"]
2 changes: 1 addition & 1 deletion comps/llms/text-generation/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ export HUGGINGFACEHUB_API_TOKEN=<token>
And then you can make requests like below to check the service status:

```bash
curl http://127.0.0.1::8080/v1/completions \
curl http://127.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": <model_name>,
Expand Down
2 changes: 1 addition & 1 deletion comps/llms/text-generation/vllm/build_docker_cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@

git clone https://github.com/vllm-project/vllm.git
cd ./vllm/
docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
Loading

0 comments on commit f0b0690

Please sign in to comment.