diff --git a/comps/llms/README.md b/comps/llms/README.md index 96a69ebf0..e8fb9bd82 100644 --- a/comps/llms/README.md +++ b/comps/llms/README.md @@ -2,9 +2,9 @@ This microservice, designed for Language Model Inference (LLM), processes input consisting of a query string and associated reranked documents. It constructs a prompt based on the query and documents, which is then used to perform inference with a large language model. The service delivers the inference results as output. -A prerequisite for using this microservice is that users must have a Text Generation Inference (TGI) service already running. Users need to set the TGI service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the TGI service for executing language model operations. +A prerequisite for using this microservice is that users must have a LLM text generation service (etc., TGI, vLLM and Ray) already running. Users need to set the LLM service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the LLM service for executing language model operations. -Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses. +Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI/vLLM/Ray service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses. # 🚀1. Start Microservice with Python (Option 1) @@ -16,7 +16,9 @@ To start the LLM microservice, you need to install python packages first. pip install -r requirements.txt ``` -## 1.2 Start TGI Service +## 1.2 Start LLM Service + +### 1.2.1 Start TGI Service ```bash export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} @@ -26,7 +28,24 @@ export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms" docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model} ``` -## 1.3 Verify the TGI Service +### 1.2.2 Start vLLM Service + +```bash +export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} +docker run -it --name vllm_service -p 8008:80 -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -v ./data:/data vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model ${your_hf_llm_model} --port 80" +``` + +## 1.2.3 Start Ray Service + +```bash +export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} +export TRUST_REMOTE_CODE=True +docker run -it --runtime=habana --name ray_serve_service -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -p 8008:80 -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path ${your_hf_llm_model} --chat_processor ${your_hf_chatprocessor}" +``` + +## 1.3 Verify the LLM Service + +### 1.3.1 Verify the TGI Service ```bash curl http://${your_ip}:8008/generate \ @@ -35,16 +54,54 @@ curl http://${your_ip}:8008/generate \ -H 'Content-Type: application/json' ``` -## 1.4 Start LLM Service +### 1.3.2 Verify the vLLM Service + +```bash +curl http://${your_ip}:8008/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": ${your_hf_llm_model}, + "prompt": "What is Deep Learning?", + "max_tokens": 32, + "temperature": 0 + }' +``` + +### 1.3.3 Verify the Ray Service + +```bash +curl http://${your_ip}:8008/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": ${your_hf_llm_model}, + "messages": [ + {"role": "assistant", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is Deep Learning?"}, + ], + "max_tokens": 32, + "stream": True + }' +``` + +## 1.4 Start LLM Service with Python Script + +### 1.4.1 Start the TGI Service ```bash export TGI_LLM_ENDPOINT="http://${your_ip}:8008" python text-generation/tgi/llm.py ``` +### 1.4.2 Start the vLLM Service + +```bash +export vLLM_LLM_ENDPOINT="http://${your_ip}:8008" +python text-generation/vllm/llm.py +``` + # 🚀2. Start Microservice with Docker (Option 2) -If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI service with docker. +If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker. ## 2.1 Setup Environment Variables @@ -59,13 +116,33 @@ export LANGCHAIN_API_KEY=${your_langchain_api_key} export LANGCHAIN_PROJECT="opea/llms" ``` +In order to start vLLM and LLM services, you need to setup the following environment variables first. + +```bash +export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} +export vLLM_LLM_ENDPOINT="http://${your_ip}:8008" +export LLM_MODEL_ID=${your_hf_llm_model} +export LANGCHAIN_TRACING_V2=true +export LANGCHAIN_API_KEY=${your_langchain_api_key} +export LANGCHAIN_PROJECT="opea/llms" +``` + ## 2.2 Build Docker Image +### 2.2.1 TGI + ```bash cd ../../ docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile . ``` +### 2.2.2 vLLM + +```bash +cd ../../ +docker build -t opea/llm-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm/Dockerfile . +``` + To start a docker container, you have two options: - A. Run Docker with CLI @@ -75,17 +152,34 @@ You can choose one as needed. ## 2.3 Run Docker with CLI (Option A) +### 2.3.1 TGI + ```bash docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN opea/llm-tgi:latest ``` +### 2.3.2 vLLM + +```bash +docker run -d --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e LLM_MODEL_ID=$LLM_MODEL_ID opea/llm-vllm:latest +``` + ## 2.4 Run Docker with Docker Compose (Option B) +### 2.4.1 TGI + ```bash cd text-generation/tgi docker compose -f docker_compose_llm.yaml up -d ``` +### 2.4.2 vLLM + +```bash +cd text-generation/vllm +docker compose -f docker_compose_llm.yaml up -d +``` + # 🚀3. Consume LLM Service ## 3.1 Check Service Status diff --git a/comps/llms/text-generation/ray/README.md b/comps/llms/text-generation/ray_serve/README.md similarity index 95% rename from comps/llms/text-generation/ray/README.md rename to comps/llms/text-generation/ray_serve/README.md index a0cf3ecdd..1b664bfe1 100644 --- a/comps/llms/text-generation/ray/README.md +++ b/comps/llms/text-generation/ray_serve/README.md @@ -21,13 +21,13 @@ export HUGGINGFACEHUB_API_TOKEN= And then you can make requests with the OpenAI-compatible APIs like below to check the service status: ```bash -curl http://127.0.0.1::8080/v1/chat/completions \ +curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": , "messages": [ {"role": "assistant", "content": "You are a helpful assistant."}, - {"role": "user", "content": args.input_text}, + {"role": "user", "content": "What is Deep Learning?"}, ], "max_tokens": 32, "stream": True diff --git a/comps/llms/text-generation/ray/__init__.py b/comps/llms/text-generation/ray_serve/__init__.py similarity index 100% rename from comps/llms/text-generation/ray/__init__.py rename to comps/llms/text-generation/ray_serve/__init__.py diff --git a/comps/llms/text-generation/ray/api_openai_backend/__init__.py b/comps/llms/text-generation/ray_serve/api_openai_backend/__init__.py similarity index 100% rename from comps/llms/text-generation/ray/api_openai_backend/__init__.py rename to comps/llms/text-generation/ray_serve/api_openai_backend/__init__.py diff --git a/comps/llms/text-generation/ray/api_openai_backend/openai_protocol.py b/comps/llms/text-generation/ray_serve/api_openai_backend/openai_protocol.py similarity index 100% rename from comps/llms/text-generation/ray/api_openai_backend/openai_protocol.py rename to comps/llms/text-generation/ray_serve/api_openai_backend/openai_protocol.py diff --git a/comps/llms/text-generation/ray/api_openai_backend/query_client.py b/comps/llms/text-generation/ray_serve/api_openai_backend/query_client.py similarity index 94% rename from comps/llms/text-generation/ray/api_openai_backend/query_client.py rename to comps/llms/text-generation/ray_serve/api_openai_backend/query_client.py index 4cc03f767..0f6b395fc 100644 --- a/comps/llms/text-generation/ray/api_openai_backend/query_client.py +++ b/comps/llms/text-generation/ray_serve/api_openai_backend/query_client.py @@ -15,8 +15,8 @@ from typing import Dict from fastapi import HTTPException -from rayllm.api_openai_backend.openai_protocol import ModelCard, Prompt -from rayllm.api_openai_backend.request_handler import handle_request +from ray_serve.api_openai_backend.openai_protocol import ModelCard, Prompt +from ray_serve.api_openai_backend.request_handler import handle_request class RouterQueryClient: diff --git a/comps/llms/text-generation/ray/api_openai_backend/request_handler.py b/comps/llms/text-generation/ray_serve/api_openai_backend/request_handler.py similarity index 97% rename from comps/llms/text-generation/ray/api_openai_backend/request_handler.py rename to comps/llms/text-generation/ray_serve/api_openai_backend/request_handler.py index 1b228b8ef..08e100ac5 100644 --- a/comps/llms/text-generation/ray/api_openai_backend/request_handler.py +++ b/comps/llms/text-generation/ray_serve/api_openai_backend/request_handler.py @@ -18,7 +18,7 @@ from fastapi import HTTPException, Request, status from pydantic import ValidationError as PydanticValidationError -from rayllm.api_openai_backend.openai_protocol import ErrorResponse, FinishReason, ModelResponse, Prompt +from ray_serve.api_openai_backend.openai_protocol import ErrorResponse, FinishReason, ModelResponse, Prompt from starlette.responses import JSONResponse diff --git a/comps/llms/text-generation/ray/api_openai_backend/router_app.py b/comps/llms/text-generation/ray_serve/api_openai_backend/router_app.py similarity index 98% rename from comps/llms/text-generation/ray/api_openai_backend/router_app.py rename to comps/llms/text-generation/ray_serve/api_openai_backend/router_app.py index 375ac6dda..908095747 100644 --- a/comps/llms/text-generation/ray/api_openai_backend/router_app.py +++ b/comps/llms/text-generation/ray_serve/api_openai_backend/router_app.py @@ -21,7 +21,7 @@ from fastapi import Response as FastAPIResponse from fastapi import status from fastapi.middleware.cors import CORSMiddleware -from rayllm.api_openai_backend.openai_protocol import ( +from ray_serve.api_openai_backend.openai_protocol import ( ChatCompletionRequest, ChatCompletionResponse, ChatCompletionResponseChoice, @@ -39,8 +39,8 @@ Prompt, UsageInfo, ) -from rayllm.api_openai_backend.query_client import RouterQueryClient -from rayllm.api_openai_backend.request_handler import OpenAIHTTPException, openai_exception_handler +from ray_serve.api_openai_backend.query_client import RouterQueryClient +from ray_serve.api_openai_backend.request_handler import OpenAIHTTPException, openai_exception_handler from starlette.responses import Response, StreamingResponse # timeout in 10 minutes. Streaming can take longer than 3 min diff --git a/comps/llms/text-generation/ray/api_openai_backend/tools.py b/comps/llms/text-generation/ray_serve/api_openai_backend/tools.py similarity index 98% rename from comps/llms/text-generation/ray/api_openai_backend/tools.py rename to comps/llms/text-generation/ray_serve/api_openai_backend/tools.py index 46966a604..1060e25d5 100644 --- a/comps/llms/text-generation/ray/api_openai_backend/tools.py +++ b/comps/llms/text-generation/ray_serve/api_openai_backend/tools.py @@ -19,7 +19,7 @@ from typing import List, Union import jinja2 -from rayllm.api_openai_backend.openai_protocol import ChatMessage, FunctionCall, ToolCall +from ray_serve.api_openai_backend.openai_protocol import ChatMessage, FunctionCall, ToolCall class ToolsCallsTemplateContext(Enum): diff --git a/comps/llms/text-generation/ray/api_server_openai.py b/comps/llms/text-generation/ray_serve/api_server_openai.py similarity index 96% rename from comps/llms/text-generation/ray/api_server_openai.py rename to comps/llms/text-generation/ray_serve/api_server_openai.py index 6db43e875..0a286d26a 100644 --- a/comps/llms/text-generation/ray/api_server_openai.py +++ b/comps/llms/text-generation/ray_serve/api_server_openai.py @@ -20,9 +20,9 @@ import ray from easydict import EasyDict as edict from ray import serve -from rayllm.api_openai_backend.query_client import RouterQueryClient -from rayllm.api_openai_backend.router_app import Router, router_app -from rayllm.ray_serve import LLMServe +from ray_serve.api_openai_backend.query_client import RouterQueryClient +from ray_serve.api_openai_backend.router_app import Router, router_app +from ray_serve.ray_serve import LLMServe def router_application(deployments, max_concurrent_queries): diff --git a/comps/llms/text-generation/ray/build_docker.sh b/comps/llms/text-generation/ray_serve/build_docker.sh similarity index 96% rename from comps/llms/text-generation/ray/build_docker.sh rename to comps/llms/text-generation/ray_serve/build_docker.sh index c49530d71..3fc34448a 100755 --- a/comps/llms/text-generation/ray/build_docker.sh +++ b/comps/llms/text-generation/ray_serve/build_docker.sh @@ -18,7 +18,7 @@ cd docker docker build \ -f Dockerfile ../../ \ - -t rayllm:habana \ + -t ray_serve:habana \ --network=host \ --build-arg http_proxy=${http_proxy} \ --build-arg https_proxy=${https_proxy} \ diff --git a/comps/llms/text-generation/ray/Dockerfile b/comps/llms/text-generation/ray_serve/docker/Dockerfile similarity index 78% rename from comps/llms/text-generation/ray/Dockerfile rename to comps/llms/text-generation/ray_serve/docker/Dockerfile index 9d74cccf9..0b04af85a 100644 --- a/comps/llms/text-generation/ray/Dockerfile +++ b/comps/llms/text-generation/ray_serve/docker/Dockerfile @@ -2,19 +2,19 @@ FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installe ENV LANG=en_US.UTF-8 -WORKDIR /root/rayllm +WORKDIR /root/ray_serve # copy the source code to the package directory -COPY ../ray/ /root/rayllm +COPY ../ray_serve/ /root/ray_serve -RUN pip install -r /root/rayllm/docker/requirements.txt && \ +RUN pip install -r /root/ray_serve/docker/requirements.txt && \ pip install --upgrade-strategy eager optimum[habana] RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \ service ssh restart ENV no_proxy=localhost,127.0.0.1 -ENV PYTHONPATH=$PYTHONPATH:/root:/root/rayllm +ENV PYTHONPATH=$PYTHONPATH:/root:/root/ray_serve # Required by DeepSpeed ENV RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES=1 diff --git a/comps/llms/text-generation/ray/requirements.txt b/comps/llms/text-generation/ray_serve/docker/requirements.txt similarity index 100% rename from comps/llms/text-generation/ray/requirements.txt rename to comps/llms/text-generation/ray_serve/docker/requirements.txt diff --git a/comps/llms/text-generation/ray/launch_ray_service.sh b/comps/llms/text-generation/ray_serve/launch_ray_service.sh similarity index 79% rename from comps/llms/text-generation/ray/launch_ray_service.sh rename to comps/llms/text-generation/ray_serve/launch_ray_service.sh index de775a97e..9371183f5 100755 --- a/comps/llms/text-generation/ray/launch_ray_service.sh +++ b/comps/llms/text-generation/ray_serve/launch_ray_service.sh @@ -41,4 +41,4 @@ if [ "$#" -lt 0 ] || [ "$#" -gt 5 ]; then fi # Build the Docker run command based on the number of cards -docker run -it --runtime=habana --name="rayllm-habana" -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --network=host -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE rayllm:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number $port_number --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker" +docker run -it --runtime=habana --name="ChatQnA_server" -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --network=host -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number $port_number --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker" diff --git a/comps/llms/text-generation/ray/ray_serve.py b/comps/llms/text-generation/ray_serve/ray_serve.py similarity index 99% rename from comps/llms/text-generation/ray/ray_serve.py rename to comps/llms/text-generation/ray_serve/ray_serve.py index b004a61ae..df373ab16 100644 --- a/comps/llms/text-generation/ray/ray_serve.py +++ b/comps/llms/text-generation/ray_serve/ray_serve.py @@ -25,8 +25,8 @@ from fastapi import HTTPException from pydantic import BaseModel from ray import serve -from rayllm.api_openai_backend.openai_protocol import ChatMessage, ErrorResponse, ModelResponse -from rayllm.api_openai_backend.tools import ChatPromptCapture, OpenAIToolsPrompter +from ray_serve.api_openai_backend.openai_protocol import ChatMessage, ErrorResponse, ModelResponse +from ray_serve.api_openai_backend.tools import ChatPromptCapture, OpenAIToolsPrompter from starlette.requests import Request from starlette.responses import JSONResponse, StreamingResponse from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer diff --git a/comps/llms/text-generation/vllm/Dockerfile b/comps/llms/text-generation/vllm/Dockerfile new file mode 100644 index 000000000..9e978d452 --- /dev/null +++ b/comps/llms/text-generation/vllm/Dockerfile @@ -0,0 +1,37 @@ +# Copyright (c) 2024 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +FROM langchain/langchain:latest + +RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ + libgl1-mesa-glx \ + libjemalloc-dev \ + vim + +RUN useradd -m -s /bin/bash user && \ + mkdir -p /home/user && \ + chown -R user /home/user/ + +USER user + +COPY comps /home/user/comps + +RUN pip install --no-cache-dir --upgrade pip && \ + pip install --no-cache-dir -r /home/user/comps/llms/text-generation/vllm/requirements.txt + +ENV PYTHONPATH=$PYTHONPATH:/home/user + +WORKDIR /home/user/comps/llms/text-generation/vllm + +ENTRYPOINT ["python", "llm.py"] diff --git a/comps/llms/text-generation/vllm/README.md b/comps/llms/text-generation/vllm/README.md index 3d0491326..de98e7521 100644 --- a/comps/llms/text-generation/vllm/README.md +++ b/comps/llms/text-generation/vllm/README.md @@ -23,7 +23,7 @@ export HUGGINGFACEHUB_API_TOKEN= And then you can make requests like below to check the service status: ```bash -curl http://127.0.0.1::8080/v1/completions \ +curl http://127.0.0.1:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": , diff --git a/comps/llms/text-generation/vllm/build_docker_cpu.sh b/comps/llms/text-generation/vllm/build_docker_cpu.sh index 89e11e060..3947c9389 100644 --- a/comps/llms/text-generation/vllm/build_docker_cpu.sh +++ b/comps/llms/text-generation/vllm/build_docker_cpu.sh @@ -16,4 +16,4 @@ git clone https://github.com/vllm-project/vllm.git cd ./vllm/ -docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy +docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy diff --git a/comps/llms/text-generation/vllm/docker_compose_llm.yaml b/comps/llms/text-generation/vllm/docker_compose_llm.yaml new file mode 100644 index 000000000..543da06f7 --- /dev/null +++ b/comps/llms/text-generation/vllm/docker_compose_llm.yaml @@ -0,0 +1,47 @@ +# Copyright (c) 2024 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +version: "3.8" + +services: + vllm_service: + image: vllm:cpu + container_name: vllm-service + ports: + - "8008:80" + volumes: + - "./data:/data" + environment: + http_proxy: ${http_proxy} + https_proxy: ${https_proxy} + LLM_MODEL_ID: ${LLM_MODEL_ID} + HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN} + command: cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $LLM_MODEL_ID --port 80 + llm: + image: opea/gen-ai-comps:llm-vllm-server + container_name: llm-vllm-server + ports: + - "9000:9000" + ipc: host + environment: + http_proxy: ${http_proxy} + https_proxy: ${https_proxy} + vLLM_LLM_ENDPOINT: ${vLLM_LLM_ENDPOINT} + LLM_MODEL_ID: ${LLM_MODEL_ID} + HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN} + restart: unless-stopped + +networks: + default: + driver: bridge diff --git a/comps/llms/text-generation/vllm/launch_vllm_service.sh b/comps/llms/text-generation/vllm/launch_vllm_service.sh index a63050ece..1baa6c9e3 100644 --- a/comps/llms/text-generation/vllm/launch_vllm_service.sh +++ b/comps/llms/text-generation/vllm/launch_vllm_service.sh @@ -32,4 +32,4 @@ fi volume=$PWD/data # Build the Docker run command based on the number of cards -docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy vllm-cpu-env /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --port $port_number" +docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --port $port_number" diff --git a/comps/llms/text-generation/vllm/llm.py b/comps/llms/text-generation/vllm/llm.py new file mode 100644 index 000000000..1070c2cda --- /dev/null +++ b/comps/llms/text-generation/vllm/llm.py @@ -0,0 +1,80 @@ +# Copyright (c) 2024 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from fastapi.responses import StreamingResponse +from langchain_community.llms import VLLMOpenAI + +from comps import GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, opea_telemetry, register_microservice + + +@opea_telemetry +def post_process_text(text: str): + if text == " ": + return "data: @#$\n\n" + if text == "\n": + return "data:
\n\n" + if text.isspace(): + return None + new_text = text.replace(" ", "@#$") + return f"data: {new_text}\n\n" + + +@register_microservice( + name="opea_service@llm_vllm", + service_type=ServiceType.LLM, + endpoint="/v1/chat/completions", + host="0.0.0.0", + port=9000, +) +@opea_telemetry +def llm_generate(input: LLMParamsDoc): + llm_endpoint = os.getenv("vLLM_LLM_ENDPOINT", "http://localhost:8080") + llm = VLLMOpenAI( + openai_api_key="EMPTY", + endpoint_url=llm_endpoint + "/v1", + max_tokens=input.max_new_tokens, + model_name=os.getenv("LLM_MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct"), + top_p=input.top_p, + temperature=input.temperature, + presence_penalty=input.repetition_penalty, + streaming=input.streaming, + ) + + if input.streaming: + + def stream_generator(): + chat_response = "" + for text in llm.stream(input.query): + chat_response += text + processed_text = post_process_text(text) + if text and processed_text: + if "" in text: + res = text.split("")[0] + if res != "": + yield res + break + yield processed_text + print(f"[llm - chat_stream] stream response: {chat_response}") + yield "data: [DONE]\n\n" + + return StreamingResponse(stream_generator(), media_type="text/event-stream") + else: + response = llm.invoke(input.query) + return GeneratedDoc(text=response, prompt=input.query) + + +if __name__ == "__main__": + opea_microservices["opea_service@llm_vllm"].start() diff --git a/comps/llms/text-generation/vllm/requirements.txt b/comps/llms/text-generation/vllm/requirements.txt new file mode 100644 index 000000000..7d72d98b5 --- /dev/null +++ b/comps/llms/text-generation/vllm/requirements.txt @@ -0,0 +1,11 @@ +docarray[full] +fastapi +huggingface_hub +langchain==0.1.16 +langserve +opentelemetry-api +opentelemetry-exporter-otlp +opentelemetry-sdk +shortuuid +transformers +vllm