Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add example of minicpm #3854

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

----
:fire: *News* :fire:
- [Aug, 2024] Serve [**MiniCPM**](https://github.com/OpenBMB/MiniCPM) on your infra: [**example**](./llm/minicpm/)
- [Jul, 2024] [Finetune](./llm/llama-3_1-finetuning/) and [serve](./llm/llama-3_1/) **Llama 3.1** on your infra
- [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
Expand Down Expand Up @@ -156,6 +157,7 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest
<!-- Keep this section in sync with index.rst in SkyPilot Docs -->
Runnable examples:
- LLMs on SkyPilot
- [MiniCPM](./llm/minicpm/)
- [Llama 3.1 finetuning](./llm/llama-3_1-finetuning/) and [serving](./llm/llama-3_1/)
- [GPT-2 via `llm.c`](./llm/gpt-2/)
- [Llama 3](./llm/llama-3/)
Expand Down
1 change: 1 addition & 0 deletions docs/source/_gallery_original/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Contents
:caption: LLM Models

Mixtral (Mistral AI) <llms/mixtral>
MiniCPM (openbmb) <llms/minicpm/README.md>
Mistral 7B (Mistral AI) <https://docs.mistral.ai/self-deployment/skypilot/>
DBRX (Databricks) <llms/dbrx>
Llama-2 (Meta) <llms/llama-2>
Expand Down
1 change: 1 addition & 0 deletions docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ Runnable examples:
* `Databricks DBRX <https://github.com/skypilot-org/skypilot/tree/master/llm/dbrx>`_
* `Gemma <https://github.com/skypilot-org/skypilot/tree/master/llm/gemma>`_
* `Mixtral 8x7B <https://github.com/skypilot-org/skypilot/tree/master/llm/mixtral>`_; `Mistral 7B <https://docs.mistral.ai/self-deployment/skypilot>`_ (from official Mistral team)
* `MiniCPM <https://github.com/skypilot-org/skypilot/tree/master/llm/minicpm>`_;(from official Openbmb team)
* `Code Llama <https://github.com/skypilot-org/skypilot/tree/master/llm/codellama/>`_
* `vLLM: Serving LLM 24x Faster On the Cloud <https://github.com/skypilot-org/skypilot/tree/master/llm/vllm>`_ (from official vLLM team)
* `SGLang: Fast and Expressive LLM Serving On the Cloud <https://github.com/skypilot-org/skypilot/tree/master//llm/sglang/>`_ (from official SGLang team)
Expand Down
78 changes: 78 additions & 0 deletions llm/minicpm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@


📰 **Update (26 April 2024) -** SkyPilot now also supports the [**MiniCPM-2B**](https://openbmb.vercel.app/?category=Chinese+Blog/) model! Use [serve-2b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-2b.yaml) to serve the 2B model.

📰 **Update (6 Jun 2024) -** SkyPilot now also supports the [**MiniCPM-1B**](https://openbmb.vercel.app/?category=Chinese+Blog/) model!

<p align="center">
<img src="https://i.imgur.com/d7tEhAl.gif" alt="qwen" width="600"/>
</p>

## References
* [MiniCPM blog](https://openbmb.vercel.app/?category=Chinese+Blog/)

## Why use SkyPilot to deploy over commercial hosted solutions?

* Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds.
* Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups.
* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint
* Everything stays in your cloud account (your VMs & buckets)
* Completely private - no one else sees your chat history


## Running your own Minicpm with SkyPilot

After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own minicpm model on vLLM with SkyPilot in 1-click:

1. Start serving MiniCPM on a single instance with any available GPU in the list specified in [serve-2b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-2b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-1b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-1b.yaml) or [serve-cpmv2_6.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-cpmv2_6.yaml) for a multimodal model):

```bash
sky launch -c cpm serve-110b.yaml
```
2. Send a request to the endpoint for completion:
```bash
IP=$(sky status --ip qwen)

curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openbmb/MiniCPM-2B-sft-bf16",
"prompt": "My favorite food is",
"max_tokens": 512
}' | jq -r '.choices[0].text'
```

3. Send a request for chat completion:
```bash
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openbmb/MiniCPM-1B-sft-bf16",
"messages": [
{
"role": "system",
"content": "You are a helpful and honest chat expert."
},
{
"role": "user",
"content": "What is the best food?"
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
```


## **Optional:** Accessing Cpm with Chat GUI

It is also possible to access the Qwen service with a GUI using [vLLM](https://github.com/vllm-project/vllm).

1. Start the chat web UI (change the `--env` flag to the model you are running):
```bash
sky launch -c cpm-gui ./gui.yaml --env MODEL_NAME='openbmb/MiniCPM-2B-sft-bf16' --env ENDPOINT=$(sky serve status --endpoint cpm)
```

2. Then, we can access the GUI at the returned gradio link:
```
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
```
44 changes: 44 additions & 0 deletions llm/minicpm/gui.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Starts a GUI server that connects to the Qwen OpenAI API server.
#
# Refer to llm/qwen/README.md for more details.
#
# Usage:
#
# 1. If you have a endpoint started on a cluster (sky launch):
# `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky status --ip qwen):8000`
# 2. If you have a SkyPilot Service started (sky serve up) called qwen:
# `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)`
#
# After the GUI server is started, you will see a gradio link in the output and
# you can click on it to open the GUI.

envs:
ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen.
MODEL_NAME: openbmb/MiniCPM-2B-sft-bf16

resources:
cpus: 2

setup: |
conda activate cpm
if [ $? -ne 0 ]; then
conda create -n cpm python=3.10 -y
conda activate cpm
fi

# Install Gradio for web UI.
pip install gradio openai

run: |
conda activate cpm
export PATH=$PATH:/sbin
WORKER_IP=$(hostname -I | cut -d' ' -f1)
CONTROLLER_PORT=21001
WORKER_PORT=21002

echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://$ENDPOINT/v1 | tee ~/gradio.log
41 changes: 41 additions & 0 deletions llm/minicpm/serve-1b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
envs:
MODEL_NAME: openbmb/MiniCPM-1B-sft-bf16

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2


resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
conda activate cpm
if [ $? -ne 0 ]; then
conda create -n cpm python=3.10 -y
conda activate cpm
fi
pip install vllm==0.5.4
pip install flash-attn==2.5.9.post1

run: |
conda activate cpm
export PATH=$PATH:/sbin
python -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-num-seqs 16 | tee ~/openai_api_server.log

40 changes: 40 additions & 0 deletions llm/minicpm/serve-2b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
envs:
MODEL_NAME: openbmb/MiniCPM-2B-sft-bf16

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2


resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
conda activate cpm
if [ $? -ne 0 ]; then
conda create -n cpm python=3.10 -y
conda activate cpm
fi
pip install vllm==0.5.4
pip install flash-attn==2.5.9.post1

run: |
conda activate cpm
export PATH=$PATH:/sbin
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
40 changes: 40 additions & 0 deletions llm/minicpm/serve-cpmv2_6.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
envs:
MODEL_NAME: openbmb/MiniCPM-V-2_6

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2


resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
conda activate cpm
if [ $? -ne 0 ]; then
conda create -n cpm python=3.10 -y
conda activate cpm
fi
pip install vllm==0.5.4
pip install flash-attn==2.5.9.post1

run: |
conda activate cpm
export PATH=$PATH:/sbin
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
Loading