OSS LLM Inference Template

License

This template is licensed under Apache 2.0 and contains the following open source components:

Transformer Apache 2.0
Falcon Apache 2.0
MLFlow Apache 2.0
NVIDIA EULA

About this template

This project shows how to generate text output from a fine tuned LLM (Falcon-7b finetuned for summarization) using different inference frameworks. This project also has code that deploys the fine tuned LLM as a Model API and an app in Domino.Please note that the execution time to generate output will differ based on the hardware and model you are using, the notebooks in this project were run on 1 V100 GPU that has 24GB of VRAM.

In general, ctranslate2 is a good choice to run LLMs on CPU, GPU accelerators and is highly performant while vLLM is most suited for use cases that require scale as it can be backed by Ray ; the native Huggingface option is good for prototyping and development and small scale use cases that leverage GPUs.

Here are a list of important files in the project that you might need to edit in order to customize this further for your use case.

ft_falcon7b_8bit_lora.ipynb : This notebook contains code to fine tune a LoRA adapter for the Falcon-7b model to perform summarization. The code also logs training metrics to mlflow and can be viewed in the Experiments section of the project.
convert_hf_ct.ipynb : : This notebook contains code to convert a Huggingface model to a ctranslate2 model. ctranslate2 does not support adapters out of the box so we merge it with the model and export it for subsequent use
bench_ct2.ipynb : This notebook contains code that loads a ctranslate2 model and generates output from it.
bench_hf.ipynb : This notebook contains code that loads a Huggingface model and generates output from it.
bench_vllm.ipynb : This notebook contains code that uses vLLM to generate output from a Huggingface model that has the summarization adapter attached to it
app.sh : The shell script needed to run the chat app
app.py : Streamlit app code for the summarization app. This app uses the ctranslate2 model to generate responses
model.py : This file has sample code that shows how to use the ctranslate2 model as an API. Please ensure that the build pods have enough resources to build the API and the Model API has the right resource quota assigned to it

Setup instructions

This project requires the following compute environments to be present. Also please ensure the ‘Automatically make compatible with Domino’ checkbox is selected and create the environment afresh instead of duplicating an existing environment that does not use the same base image and dependencies.

LLM Inference

Hardware Requirements

The notebooks in this project require 1 V100 GPU that has 24GB of VRAM.

Environment Requirements

nvcr.io/nvidia/pytorch:22.12-py3

Dockerfile Instructions

# System-level dependency injection runs as root
USER root:root

# Validate base image pre-requisites
# Complete requirements can be found at
# https://docs.dominodatalab.com/en/latest/user_guide/a00d1b/automatic-adaptation-of-custom-images/#_pre_requisites_for_automatic_custom_image_compatibility_with_domino
RUN /opt/domino/bin/pre-check.sh

# Configure /opt/domino to prepare for Domino executions
RUN /opt/domino/bin/init.sh

# Validate the environment
RUN /opt/domino/bin/validate.sh

RUN pip uninstall --yes torch torchvision torchaudio

RUN pip install torch  --index-url https://download.pytorch.org/whl/cu118

RUN pip uninstall -y protobuf
RUN pip install "protobuf==3.20.3" "mlflow==2.6.0"
RUN pip install -q -U bitsandbytes==0.39.1 "datasets>=2.10.0,<3" "ipywidgets" "ctranslate2==3.17.1"
RUN pip install -q -U py7zr einops tensorboardX transformers peft accelerate deepspeed
RUN pip install --no-cache-dir Flask Flask-Compress Flask-Cors jsonify uWSGI streamlit streamlit-chat "vllm==0.1.3"
RUN pip uninstall --yes transformer-engine

RUN pip uninstall -y apex

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
template_definition		template_definition
tests		tests
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
app.py		app.py
app.sh		app.sh
bench_ct2.ipynb		bench_ct2.ipynb
bench_hf.ipynb		bench_hf.ipynb
bench_vllm.ipynb		bench_vllm.ipynb
convert_hf_ct.ipynb		convert_hf_ct.ipynb
ft_falcon7b_8bit_lora.ipynb		ft_falcon7b_8bit_lora.ipynb
mlflow.png		mlflow.png
model.py		model.py
model_api.png		model_api.png
requirements.txt		requirements.txt
summarization_app.png		summarization_app.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSS LLM Inference Template

License

About this template

Setup instructions

LLM Inference

Hardware Requirements

Environment Requirements

About

Releases

Packages

Contributors 4

Languages

License

dominodatalab/reference-project-llm-inference

Folders and files

Latest commit

History

Repository files navigation

OSS LLM Inference Template

License

About this template

Setup instructions

LLM Inference

Hardware Requirements

Environment Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages