forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add doc about serving option on dstack (vllm-project#3074)
Co-authored-by: Roger Wang <[email protected]>
- Loading branch information
1 parent
a9bcc7a
commit 429d897
Showing
2 changed files
with
104 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
.. _deploying_with_dstack: | ||
|
||
Deploying with dstack | ||
============================ | ||
|
||
.. raw:: html | ||
|
||
<p align="center"> | ||
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/> | ||
</p> | ||
|
||
vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment. | ||
|
||
To install dstack client, run: | ||
|
||
.. code-block:: console | ||
$ pip install "dstack[all] | ||
$ dstack server | ||
Next, to configure your dstack project, run: | ||
|
||
.. code-block:: console | ||
$ mkdir -p vllm-dstack | ||
$ cd vllm-dstack | ||
$ dstack init | ||
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: | ||
|
||
.. code-block:: yaml | ||
type: service | ||
python: "3.11" | ||
env: | ||
- MODEL=NousResearch/Llama-2-7b-chat-hf | ||
port: 8000 | ||
resources: | ||
gpu: 24GB | ||
commands: | ||
- pip install vllm | ||
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000 | ||
model: | ||
format: openai | ||
type: chat | ||
name: NousResearch/Llama-2-7b-chat-hf | ||
Then, run the following CLI for provisioning: | ||
|
||
.. code-block:: console | ||
$ dstack run . -f serve.dstack.yml | ||
⠸ Getting run plan... | ||
Configuration serve.dstack.yml | ||
Project deep-diver-main | ||
User deep-diver | ||
Min resources 2..xCPU, 8GB.., 1xGPU (24GB) | ||
Max price - | ||
Max duration - | ||
Spot policy auto | ||
Retry policy no | ||
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE | ||
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 | ||
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 | ||
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 | ||
... | ||
Shown 3 of 193 offers, $5.876 max | ||
Continue? [y/n]: y | ||
⠙ Submitting run... | ||
⠏ Launching spicy-treefrog-1 (pulling) | ||
spicy-treefrog-1 provisioning completed (running) | ||
Service is published at ... | ||
After the provisioning, you can interact with the model by using the OpenAI SDK: | ||
|
||
.. code-block:: python | ||
from openai import OpenAI | ||
client = OpenAI( | ||
base_url="https://gateway.<gateway domain>", | ||
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>" | ||
) | ||
completion = client.chat.completions.create( | ||
model="NousResearch/Llama-2-7b-chat-hf", | ||
messages=[ | ||
{ | ||
"role": "user", | ||
"content": "Compose a poem that explains the concept of recursion in programming.", | ||
} | ||
] | ||
) | ||
print(completion.choices[0].message.content) | ||
.. note:: | ||
|
||
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters