-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add example of minicpm #3854
base: master
Are you sure you want to change the base?
add example of minicpm #3854
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
|
||
|
||
📰 **Update (26 April 2024) -** SkyPilot now also supports the [**MiniCPM-2B**](https://openbmb.vercel.app/?category=Chinese+Blog/) model! Use [serve-2b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-2b.yaml) to serve the 110B model. | ||
|
||
📰 **Update (6 Jun 2024) -** SkyPilot now also supports the [**MiniCPM-1B**](https://openbmb.vercel.app/?category=Chinese+Blog/) model! | ||
|
||
<p align="center"> | ||
<img src="https://i.imgur.com/d7tEhAl.gif" alt="qwen" width="600"/> | ||
</p> | ||
|
||
## References | ||
* [MiniCPM blog](https://openbmb.vercel.app/?category=Chinese+Blog/) | ||
|
||
## Why use SkyPilot to deploy over commercial hosted solutions? | ||
|
||
* Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds. | ||
* Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups. | ||
* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint | ||
* Everything stays in your cloud account (your VMs & buckets) | ||
* Completely private - no one else sees your chat history | ||
|
||
|
||
## Running your own Minicpm with SkyPilot | ||
|
||
After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own minicpm model on vLLM with SkyPilot in 1-click: | ||
|
||
1. Start serving MiniCPM on a single instance with any available GPU in the list specified in [serve-2b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-2b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-1b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-1b.yaml) or [serve-cpmv2_6.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-cpmv2_6.yaml) for a multimodal model): | ||
|
||
```bash | ||
sky launch -c cpm serve-110b.yaml | ||
``` | ||
2. Send a request to the endpoint for completion: | ||
```bash | ||
IP=$(sky status --ip qwen) | ||
|
||
curl http://$IP:8000/v1/completions \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "openbmb/MiniCPM-2B-sft-bf16", | ||
"prompt": "My favorite food is", | ||
"max_tokens": 512 | ||
}' | jq -r '.choices[0].text' | ||
``` | ||
|
||
3. Send a request for chat completion: | ||
```bash | ||
curl http://$IP:8000/v1/chat/completions \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "openbmb/MiniCPM-1B-sft-bf16", | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful and honest chat expert." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "What is the best food?" | ||
} | ||
], | ||
"max_tokens": 512 | ||
}' | jq -r '.choices[0].message.content' | ||
``` | ||
|
||
|
||
## **Optional:** Accessing Qwen with Chat GUI | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we update the name of the model here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Already modified There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Already modified |
||
|
||
It is also possible to access the Qwen service with a GUI using [vLLM](https://github.com/vllm-project/vllm). | ||
|
||
1. Start the chat web UI (change the `--env` flag to the model you are running): | ||
```bash | ||
sky launch -c cpm-gui ./gui.yaml --env MODEL_NAME='openbmb/MiniCPM-2B-sft-bf16' --env ENDPOINT=$(sky serve status --endpoint cpm) | ||
``` | ||
|
||
2. Then, we can access the GUI at the returned gradio link: | ||
``` | ||
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live | ||
``` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Starts a GUI server that connects to the Qwen OpenAI API server. | ||
# | ||
# Refer to llm/qwen/README.md for more details. | ||
# | ||
# Usage: | ||
# | ||
# 1. If you have a endpoint started on a cluster (sky launch): | ||
# `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky status --ip qwen):8000` | ||
# 2. If you have a SkyPilot Service started (sky serve up) called qwen: | ||
# `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)` | ||
# | ||
# After the GUI server is started, you will see a gradio link in the output and | ||
# you can click on it to open the GUI. | ||
|
||
envs: | ||
ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen. | ||
MODEL_NAME: openbmb/MiniCPM-2B-sft-bf16 | ||
|
||
resources: | ||
cpus: 2 | ||
|
||
setup: | | ||
conda activate cpm | ||
if [ $? -ne 0 ]; then | ||
conda create -n cpm python=3.10 -y | ||
conda activate cpm | ||
fi | ||
|
||
# Install Gradio for web UI. | ||
pip install gradio openai | ||
|
||
run: | | ||
conda activate cpm | ||
export PATH=$PATH:/sbin | ||
WORKER_IP=$(hostname -I | cut -d' ' -f1) | ||
CONTROLLER_PORT=21001 | ||
WORKER_PORT=21002 | ||
|
||
echo 'Starting gradio server...' | ||
git clone https://github.com/vllm-project/vllm.git || true | ||
python vllm/examples/gradio_openai_chatbot_webserver.py \ | ||
-m $MODEL_NAME \ | ||
--port 8811 \ | ||
--model-url http://$ENDPOINT/v1 | tee ~/gradio.log |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
envs: | ||
MODEL_NAME: openbmb/MiniCPM-1B-sft-bf16 | ||
|
||
service: | ||
# Specifying the path to the endpoint to check the readiness of the replicas. | ||
readiness_probe: | ||
path: /v1/chat/completions | ||
post_data: | ||
model: $MODEL_NAME | ||
messages: | ||
- role: user | ||
content: Hello! What is your name? | ||
max_tokens: 1 | ||
initial_delay_seconds: 1200 | ||
# How many replicas to manage. | ||
replicas: 2 | ||
|
||
|
||
resources: | ||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} | ||
disk_tier: best | ||
ports: 8000 | ||
|
||
setup: | | ||
conda activate cpm | ||
if [ $? -ne 0 ]; then | ||
conda create -n cpm python=3.10 -y | ||
conda activate cpm | ||
fi | ||
pip install vllm==0.5.4 | ||
pip install flash-attn==2.5.9.post1 | ||
|
||
run: | | ||
conda activate cpm | ||
export PATH=$PATH:/sbin | ||
python -u -m vllm.entrypoints.openai.api_server \ | ||
--host 0.0.0.0 \ | ||
--model $MODEL_NAME \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-num-seqs 16 | tee ~/openai_api_server.log | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
envs: | ||
MODEL_NAME: openbmb/MiniCPM-2B-sft-bf16 | ||
|
||
service: | ||
# Specifying the path to the endpoint to check the readiness of the replicas. | ||
readiness_probe: | ||
path: /v1/chat/completions | ||
post_data: | ||
model: $MODEL_NAME | ||
messages: | ||
- role: user | ||
content: Hello! What is your name? | ||
max_tokens: 1 | ||
initial_delay_seconds: 1200 | ||
# How many replicas to manage. | ||
replicas: 2 | ||
|
||
|
||
resources: | ||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} | ||
disk_tier: best | ||
ports: 8000 | ||
|
||
setup: | | ||
conda activate cpm | ||
if [ $? -ne 0 ]; then | ||
conda create -n cpm python=3.10 -y | ||
conda activate cpm | ||
fi | ||
pip install vllm==0.5.4 | ||
pip install flash-attn==2.5.9.post1 | ||
|
||
run: | | ||
conda activate cpm | ||
export PATH=$PATH:/sbin | ||
python -m vllm.entrypoints.openai.api_server \ | ||
--host 0.0.0.0 \ | ||
--model $MODEL_NAME \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-model-len 1024 | tee ~/openai_api_server.log |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
envs: | ||
MODEL_NAME: openbmb/MiniCPM-V-2_6 | ||
|
||
service: | ||
# Specifying the path to the endpoint to check the readiness of the replicas. | ||
readiness_probe: | ||
path: /v1/chat/completions | ||
post_data: | ||
model: $MODEL_NAME | ||
messages: | ||
- role: user | ||
content: Hello! What is your name? | ||
max_tokens: 1 | ||
initial_delay_seconds: 1200 | ||
# How many replicas to manage. | ||
replicas: 2 | ||
|
||
|
||
resources: | ||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} | ||
disk_tier: best | ||
ports: 8000 | ||
|
||
setup: | | ||
conda activate cpm | ||
if [ $? -ne 0 ]; then | ||
conda create -n cpm python=3.10 -y | ||
conda activate cpm | ||
fi | ||
pip install vllm==0.5.4 | ||
pip install flash-attn==2.5.9.post1 | ||
|
||
run: | | ||
conda activate cpm | ||
export PATH=$PATH:/sbin | ||
python -m vllm.entrypoints.openai.api_server \ | ||
--host 0.0.0.0 \ | ||
--model $MODEL_NAME \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-model-len 1024 | tee ~/openai_api_server.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By adding this here, we need to soft link the readme file at
llms/minicpm/README.md
todocs/source/_gallery_original/llms/
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already modified