Skip to content

Commit

Permalink
[LLM] Add docs for serving Llama3 (#3449)
Browse files Browse the repository at this point in the history
* Add scripts for llama-3

* use vllm release instead of github

* Fix llama3 readme and yaml

* Add gallery

* Add accelerators for 8B

* fix gui.yaml

* fix comments

* Add options for 8B in yaml

* use fastchat gradio instead for better prompt

* version for fastchat

* fix --register

* add flash attn

* Fix guis

* Fix installation

* fix chat template

* fix demo

* ADD NEW

* Update llm/llama-3/README.md

Co-authored-by: Zongheng Yang <[email protected]>

* Update llm/llama-3/README.md

Co-authored-by: Zongheng Yang <[email protected]>

* address comments

* address comments

* fix title

* Add logo

* fix logo

* fix

* smaller logo

* smaller

* minor

---------

Co-authored-by: Zongheng Yang <[email protected]>
  • Loading branch information
Michaelvll and concretevitamin authored Apr 19, 2024
1 parent cade827 commit 24fcb44
Show file tree
Hide file tree
Showing 9 changed files with 535 additions and 4 deletions.
1 change: 1 addition & 0 deletions docs/source/_gallery_original/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Contents
Mistral 7B (Mistral AI) <https://docs.mistral.ai/self-deployment/skypilot/>
DBRX (Databricks) <llms/dbrx>
Llama-2 (Meta) <llms/llama-2>
Llama-3 (Meta) <llms/llama-3>
CodeLlama (Meta) <llms/codellama>
Gemma (Google) <llms/gemma>

Expand Down
1 change: 1 addition & 0 deletions docs/source/_gallery_original/llms/llama-3.md
1 change: 1 addition & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ document.addEventListener('DOMContentLoaded', () => {
{ selector: '.toctree-l1 > a', text: 'Running on Kubernetes' },
{ selector: '.toctree-l1 > a', text: 'DBRX (Databricks)' },
{ selector: '.toctree-l1 > a', text: 'Ollama' },
{ selector: '.toctree-l1 > a', text: 'Llama-3 (Meta)' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
6 changes: 4 additions & 2 deletions llm/codellama/gui.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ run: |
"model_name": "codellama/CodeLlama-70b-Instruct-hf",
"api_base": "http://${ENDPOINT}/v1",
"api_key": "empty",
"model_path": "codellama/CodeLlama-70b-Instruct-hf"
"model_path": "codellama/CodeLlama-70b-Instruct-hf",
"anony_only": false,
"api_type": "openai"
}
}
EOF
Expand All @@ -47,4 +49,4 @@ run: |
echo 'Starting gradio server...'
python -u -m fastchat.serve.gradio_web_server --share \
--register-openai-compatible-models ~/model_info.json | tee ~/gradio.log
--register ~/model_info.json | tee ~/gradio.log
2 changes: 1 addition & 1 deletion llm/dbrx/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ Wait until the model is ready (this can take 10+ minutes), as indicated by these
...
(task, pid=17433) INFO 03-28 04:32:50 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
```
:tada: **Congratulations!** :tada: You have now launched the DBRX Instruct LLM on your infra.
🎉 **Congratulations!** 🎉 You have now launched the DBRX Instruct LLM on your infra.

You can play with the model via
- Standard OpenAPI-compatible endpoints (e.g., `/v1/chat/completions`)
Expand Down
354 changes: 354 additions & 0 deletions llm/llama-3/README.md

Large diffs are not rendered by default.

46 changes: 46 additions & 0 deletions llm/llama-3/gui.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Starts a GUI server that connects to the Llama-3 OpenAI API server.
#
# This works with the endpoint.yaml, please refer to llm/llama-3/README.md
# for more details.
#
# Usage:
#
# 1. If you have a endpoint started on a cluster (sky launch):
# `sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky status --endpoint 8081 llama3)`
# 2. If you have a SkyPilot Service started (sky serve up) called llama3:
# `sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint llama3)`
#
# After the GUI server is started, you will see a gradio link in the output and
# you can click on it to open the GUI.

envs:
MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
ENDPOINT: x.x.x.x:3031 # Address of the API server running llama3.

resources:
cpus: 2

setup: |
conda activate llama3
if [ $? -ne 0 ]; then
conda create -n llama3 python=3.10 -y
conda activate llama3
fi
# Install Gradio for web UI.
pip install gradio openai
run: |
conda activate llama3
export PATH=$PATH:/sbin
WORKER_IP=$(hostname -I | cut -d' ' -f1)
CONTROLLER_PORT=21001
WORKER_PORT=21002
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://$ENDPOINT/v1 \
--stop-token-ids 128009,128001 | tee ~/gradio.log
126 changes: 126 additions & 0 deletions llm/llama-3/llama3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Serving Meta Llama-3 on your own infra.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
# ENDPOINT=$(sky status --endpoint 8081 llama3)
#
# # We need to manually specify the stop_token_ids to make sure the model finish
# # on <|eot_id|>.
# curl http://$ENDPOINT/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "meta-llama/Meta-Llama-3-8B-Instruct",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ],
# "stop_token_ids": [128009, 128001]
# }'
#
# Chat with model with Gradio UI:
#
# Running on local URL: http://127.0.0.1:8811
# Running on public URL: https://<hash>.gradio.live
#
# Scale up with SkyServe:
# HF_TOKEN=xxx sky serve up llama3.yaml -n llama3 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
# ENDPOINT=$(sky serve status --endpoint llama3)
# curl -L $ENDPOINT/v1/models
# curl -L http://$ENDPOINT/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "databricks/llama3-instruct",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ]
# }'


envs:
MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
# MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.

service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1

resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
# accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
cpus: 32+
use_spot: True
disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.

setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
pip install vllm==0.4.0.post1
# Install Gradio for web UI.
pip install gradio openai
pip install flash-attn==2.5.7
run: |
conda activate vllm
echo 'Starting vllm api server...'
# https://github.com/vllm-project/vllm/issues/3098
export PATH=$PATH:/sbin
# NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--gpu-memory-utilization 0.95 \
--max-num-seqs 64 \
2>&1 | tee api_server.log &
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
echo 'Waiting for vllm api server to start...'
sleep 5
done
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001
2 changes: 1 addition & 1 deletion llm/qwen/gui.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,4 @@ run: |
echo 'Starting gradio server...'
python -u -m fastchat.serve.gradio_web_server --share \
--register-openai-compatible-models ~/model_info.json | tee ~/gradio.log
--register ~/model_info.json | tee ~/gradio.log

0 comments on commit 24fcb44

Please sign in to comment.