diff --git a/llm/yi/README.md b/llm/yi/README.md new file mode 100644 index 00000000000..76fcf6151e6 --- /dev/null +++ b/llm/yi/README.md @@ -0,0 +1,60 @@ +# Serving Yi on Your Own Kubernetes or Cloud + +🤖 The Yi series models are the next generation of open-source large language models trained from scratch by [01.AI](https://www.lingyiwanwu.com/en). + +**Update (Sep 19, 2024) -** SkyPilot now supports the [**Yi**](https://01-ai.github.io/) model(Yi-Coder Yi-1.5)! + +

+ yi +

+ +## Why use SkyPilot to deploy over commercial hosted solutions? + +* Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds. +* Pay absolute minimum — SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups. +* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint +* Everything stays in your Kubernetes or cloud account (your VMs & buckets) +* Completely private - no one else sees your chat history + + +## Running Yi model with SkyPilot + +After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Yi model on vLLM with SkyPilot in 1-click: + +1. Start serving Yi-1.5 34B on a single instance with any available GPU in the list specified in [yi15-34b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yi15-34b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [yicoder-9b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yicoder-9b.yaml) or [other model](https://github.com/skypilot-org/skypilot/tree/master/llm/yi) for a smaller model): + +```console +sky launch -c yi yi15-34b.yaml +``` +2. Send a request to the endpoint for completion: +```bash +ENDPOINT=$(sky status --endpoint 8000 yi) + +curl http://$ENDPOINT/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "01-ai/Yi-1.5-34B-Chat", + "prompt": "Who are you?", + "max_tokens": 512 + }' | jq -r '.choices[0].text' +``` + +3. Send a request for chat completion: +```bash +curl http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "01-ai/Yi-1.5-34B-Chat", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ], + "max_tokens": 512 + }' | jq -r '.choices[0].message.content' +``` diff --git a/llm/yi/yi15-34b.yaml b/llm/yi/yi15-34b.yaml new file mode 100644 index 00000000000..99fe5481d7a --- /dev/null +++ b/llm/yi/yi15-34b.yaml @@ -0,0 +1,20 @@ +envs: + MODEL_NAME: 01-ai/Yi-1.5-34B-Chat + +resources: + accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + disk_size: 1024 + disk_tier: best + memory: 32+ + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yi15-6b.yaml b/llm/yi/yi15-6b.yaml new file mode 100644 index 00000000000..879f5ffea9c --- /dev/null +++ b/llm/yi/yi15-6b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-1.5-6B-Chat + +resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yi15-9b.yaml b/llm/yi/yi15-9b.yaml new file mode 100644 index 00000000000..b7ac40b4e11 --- /dev/null +++ b/llm/yi/yi15-9b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-1.5-9B-Chat + +resources: + accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yicoder-1_5b.yaml b/llm/yi/yicoder-1_5b.yaml new file mode 100644 index 00000000000..383f88b657d --- /dev/null +++ b/llm/yi/yicoder-1_5b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-Coder-1.5B-Chat + +resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yicoder-9b.yaml b/llm/yi/yicoder-9b.yaml new file mode 100644 index 00000000000..28e74b45bb5 --- /dev/null +++ b/llm/yi/yicoder-9b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-Coder-9B-Chat + +resources: + accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log