TorchServe supports serving Llama 2 in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Llama 2 with an app, to an advanced user of TorchServe using micro batching and streaming response with Llama 2
This example shows how to deploy a llama2 chat app using TorchServe. We use streamlit to create the app
This example is using llama-cpp-python.
You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time
This example shows how to serve Llama 2 - 70b model with limited resource using HuggingFace. It shows the following optimizations
1) HuggingFace accelerate
. This option can be activated with low_cpu_mem_usage=True
.
2) Quantization from bitsandbytes
using load_in_8bit=True
The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).
This example shows how to serve the Llama 2 model on AWS Inferentia2 for text completion with micro batching and streaming response support.
Inferentia2 uses Neuron SDK which is built on top of PyTorch XLA stack. For large model inference transformers-neuronx
package is used that takes care of model partitioning and running inference.