Welcome to this GitHub repository. Here, we provide example scripts to deploy different Huggingface models on Databricks Model Serving. These examples can also guide you in deploying other models following similar steps.
We suggest beginning with the following script. The first notebook uses the "mlflow transformer" flavor to demonstrate the ease and simplicity of deploying models. The second notebook uses "mlflow pyfunc" to illustrate how you can pass additional parameters or can add pre-processing/post-processing with the deployed models.
- GPT2 deployment using mlflow transformer flavor
- GPT2 deployment with mlflow pyfunc
Optimized LLM Serving enables you to take state of the art OSS LLMs and deploy them on Databricks Model Serving with automatic optimizations for improved latency and throughput on GPUs. Currently, we support optimizing the Mosaic MPT model and will continue introducing more models with optimization support.
Use Case | Model | Deployment Script |
---|---|---|
Text generation following instructions | llama-2 | link to script |
Text generation following instructions | mpt-instruct | link to script |
Text generation following instructions | falcon-instruct | link to script |
Text generation following instructions | databricks-dolly | link to script |
Text generation following instructions | flan-t5-xl | link to script |
Text Embeddings | e5-large-v2 | link to script |
Transcription (speech to text) | whisper-large-v2 | link to script |
Image generation | stable-diffusion-2-1 | link to script |
Code generation | replit-code-v1-3b | link to script |
Simple Sentiment Analysis | bert-base-uncased-imdb | link to script |
You can quantize models to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the 16-bit binary floating point (bfloat16). With quantization, you can deploy 13b model on single A10 and a 7b model on T4 GPU.
Note: Quantizing the model can degrade model performance and may not necessarily make it faster.
Please refer to this repository for scripts that detail how to fine-tune LLMs on Databricks: https://github.com/databricks/databricks-ml-examples.
Task | Example Script |
---|---|
Calling Databricks endpoints with langchain | link to script |
Payload logging using Inference Tables | link to script |
Measuring GPU Utilization | link to script |
Installing git Dependencies | link to script |
Before you start, please ensure you meet the following requirements:
-
Ensure that you have Nvidia A10/A100 GPUs to run the script.
-
Ensure that you have MLflow 2.3+ (MLR 13.1 beta) installed.
-
Deployment requires GPU model serving. For more information on GPU model serving, contact the Databricks team or sign up here.
-
Here are some general guidelines for determining GPU requirements when serving a model.
GPU Type | GPU Memory | Approx Max Model Size (bfloat) | Approx Max Model Size (int8) |
---|---|---|---|
T4 | 16 GB | 3b | 7b |
A10 | 24 GB | 7b | 20b |
4x A10 | 96 GB | 30b | 60b |
A100 | 80 GB | 30b | 60b |
4xA100 | 320 GB | 100b |
Clone this repository and navigate to the desired script file. Follow the instructions within the script to deploy the model, ensuring you meet the requirements listed above.
Feel free to contribute to this project by forking this repo and creating pull requests. If you encounter any issues or have any questions, create an issue on this repo, and we'll try our best to respond in a timely manner.
This project is licensed under the terms of the MIT license. For the usage license of the individual models, please check the respective links provided above.