Skip to content

Commit

Permalink
add doc about serving option on dstack (vllm-project#3074)
Browse files Browse the repository at this point in the history
Co-authored-by: Roger Wang <[email protected]>
  • Loading branch information
deep-diver and ywang96 authored May 30, 2024
1 parent a9bcc7a commit 429d897
Show file tree
Hide file tree
Showing 2 changed files with 104 additions and 0 deletions.
103 changes: 103 additions & 0 deletions docs/source/serving/deploying_with_dstack.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
.. _deploying_with_dstack:

Deploying with dstack
============================

.. raw:: html

<p align="center">
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
</p>

vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.

To install dstack client, run:

.. code-block:: console
$ pip install "dstack[all]
$ dstack server
Next, to configure your dstack project, run:

.. code-block:: console
$ mkdir -p vllm-dstack
$ cd vllm-dstack
$ dstack init
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

.. code-block:: yaml
type: service
python: "3.11"
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf
port: 8000
resources:
gpu: 24GB
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
model:
format: openai
type: chat
name: NousResearch/Llama-2-7b-chat-hf
Then, run the following CLI for provisioning:

.. code-block:: console
$ dstack run . -f serve.dstack.yml
⠸ Getting run plan...
Configuration serve.dstack.yml
Project deep-diver-main
User deep-diver
Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
Max price -
Max duration -
Spot policy auto
Retry policy no
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
...
Shown 3 of 193 offers, $5.876 max
Continue? [y/n]: y
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling)
spicy-treefrog-1 provisioning completed (running)
Service is published at ...
After the provisioning, you can interact with the model by using the OpenAI SDK:

.. code-block:: python
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
)
completion = client.chat.completions.create(
model="NousResearch/Llama-2-7b-chat-hf",
messages=[
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming.",
}
]
)
print(completion.choices[0].message.content)
.. note::

dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__
1 change: 1 addition & 0 deletions docs/source/serving/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ Integrations
deploying_with_triton
deploying_with_bentoml
deploying_with_lws
deploying_with_dstack
serving_with_langchain

0 comments on commit 429d897

Please sign in to comment.