Llama 2: Next generation of Meta's Language Model

TorchServe supports serving Llama 2 in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Llama 2 with an app, to an advanced user of TorchServe using micro batching and streaming response with Llama 2

🦙💬 Llama 2 Chatbot

Example Link

This example shows how to deploy a llama2 chat app using TorchServe. We use streamlit to create the app

This example is using llama-cpp-python.

You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time

Llama 2 with HuggingFace

Example Link

This example shows how to serve Llama 2 - 70b model with limited resource using HuggingFace. It shows the following optimizations 1) HuggingFace accelerate. This option can be activated with low_cpu_mem_usage=True. 2) Quantization from bitsandbytes using load_in_8bit=True The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).

Llama 2 on Inferentia

Example Link

PyTorch Blog

This example shows how to serve the Llama 2 model on AWS Inferentia2 for text completion with micro batching and streaming response support.

Inferentia2 uses Neuron SDK which is built on top of PyTorch XLA stack. For large model inference transformers-neuronx package is used that takes care of model partitioning and running inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Llama 2: Next generation of Meta's Language Model

🦙💬 Llama 2 Chatbot

Example Link

Llama 2 with HuggingFace

Example Link

Llama 2 on Inferentia

Example Link

PyTorch Blog

Files

README.md

Latest commit

History

README.md

File metadata and controls

Llama 2: Next generation of Meta's Language Model

🦙💬 Llama 2 Chatbot

Example Link

Llama 2 with HuggingFace

Example Link

Llama 2 on Inferentia

Example Link

PyTorch Blog