diff --git a/README.md b/README.md index 5c6bc6ca4e..76cd0100ee 100644 --- a/README.md +++ b/README.md @@ -77,6 +77,7 @@ Refer to [torchserve docker](docker/README.md) for details. ## 🏆 Highlighted Examples +* [Serving Llama 2 with TorchServe](examples/LLM/llama2/README.md) * [Chatbot with Llama 2 on Mac 🦙💬](examples/LLM/llama2/chat_app) * [🤗 HuggingFace Transformers](examples/Huggingface_Transformers) with a [Better Transformer Integration/ Flash Attention & Xformer Memory Efficient ](examples/Huggingface_Transformers#Speed-up-inference-with-Better-Transformer) * [Model parallel inference](examples/Huggingface_Transformers#model-parallelism) diff --git a/examples/LLM/llama2/README.md b/examples/LLM/llama2/README.md new file mode 100644 index 0000000000..6959e55c68 --- /dev/null +++ b/examples/LLM/llama2/README.md @@ -0,0 +1,38 @@ +# Llama 2: Next generation of Meta's Language Model +![Llama 2](./images/llama.png) + +TorchServe supports serving Llama 2 in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Llama 2 with an app, to an advanced user of TorchServe using micro batching and streaming response with Llama 2 + +## 🦙💬 Llama 2 Chatbot + +### [Example Link](https://github.com/pytorch/serve/tree/master/examples/LLM/llama2/chat_app) + +This example shows how to deploy a llama2 chat app using TorchServe. +We use [streamlit](https://github.com/streamlit/streamlit) to create the app + +This example is using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python). + +You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time + +![Chatbot Architecture](./chat_app/screenshots/architecture.png) + +## Llama 2 with HuggingFace + +### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/Huggingface_accelerate/llama2) + +This example shows how to serve Llama 2 - 70b model with limited resource using [HuggingFace](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf). It shows the following optimizations + 1) HuggingFace `accelerate`. This option can be activated with `low_cpu_mem_usage=True`. + 2) Quantization from [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) using `load_in_8bit=True` +The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint). + +## Llama 2 on Inferentia + +### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/inferentia2/llama2) + +### [PyTorch Blog](https://pytorch.org/blog/high-performance-llama/) + +This example shows how to serve the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support. + +Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference. + +![Inferentia 2 Software Stack](./images/software_stack_inf2.jpg) diff --git a/examples/LLM/llama2/chat_app/client_app.py b/examples/LLM/llama2/chat_app/client_app.py index a006e6139f..ae637f3e71 100644 --- a/examples/LLM/llama2/chat_app/client_app.py +++ b/examples/LLM/llama2/chat_app/client_app.py @@ -6,7 +6,6 @@ # App title st.set_page_config(page_title="🦙💬 Llama 2 Chatbot") -# Replicate Credentials with st.sidebar: st.title("🦙💬 Llama 2 Chatbot") diff --git a/examples/LLM/llama2/images/llama.png b/examples/LLM/llama2/images/llama.png new file mode 100644 index 0000000000..82673a5e65 Binary files /dev/null and b/examples/LLM/llama2/images/llama.png differ diff --git a/examples/LLM/llama2/images/software_stack_inf2.jpg b/examples/LLM/llama2/images/software_stack_inf2.jpg new file mode 100644 index 0000000000..e4115b69ca Binary files /dev/null and b/examples/LLM/llama2/images/software_stack_inf2.jpg differ diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt index a7e3a176fa..f8fe15e126 100644 --- a/ts_scripts/spellcheck_conf/wordlist.txt +++ b/ts_scripts/spellcheck_conf/wordlist.txt @@ -1117,4 +1117,4 @@ sharding quantized Chatbot LLM - +bitsandbytes