From 18eb67cb42abf88cc0ad56f84c3e489de28dc52b Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 23 Dec 2024 11:15:44 -0500 Subject: [PATCH 01/11] Created using Colab --- ..._Hugging_Face_with_Ray_Serve,_MLflow.ipynb | 1323 +++++++++++++++++ 1 file changed, 1323 insertions(+) create mode 100644 Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb diff --git a/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb b/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb new file mode 100644 index 00000000..4c5200f2 --- /dev/null +++ b/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb @@ -0,0 +1,1323 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true, + "authorship_tag": "ABX9TyPKW7x903JxiHL2pqDZChKh", + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Serving Foundation Models from Hugging Face with Ray Serve, MLflow\n", + "\n", + "Authored by: Jonathan Jin" + ], + "metadata": { + "id": "I17bSxxg1evl" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Introduction\n", + "\n", + "This notebook explores solutions for streamlining the deployment of models from a model registry. For teams that want to productionize many models over time, investments at this \"transition point\" in the AI/ML project lifecycle can meaningful drive down time-to-production. This can be important for a younger, smaller team that may not have the benefit of large swathes of existing infrastructure in place to form a \"golden path\" for serving online models in production.\n", + "\n", + "# Motivation\n", + "\n", + "Optimizing this stage of the model lifecycle is particularly important due to the production-facing aspect of the end result. At this stage, your model becomes, in effect, a microservice. This means that you now need to contend with all elements of service ownership, which can include:\n", + "\n", + "- Standardizing and enforcing API backwards-compatibility;\n", + "- Logging, metrics, and general observability concerns;\n", + "- Etc.\n", + "\n", + "Needing to repeat the same general-purpose setup each time you want to deploy a new model will result in development costs adding up significantly over time for you and your team. On the flip side, given the \"long tail\" of production-model ownership (assuming a productionized model is not likely to be decommissioned anytime soon), streamlining investments here can pay healthy dividends over time.\n", + "\n", + "Given all of the above, we motivate our exploration here with the following user story:\n", + "\n", + "> I would like to deploy a model from a model registry (such as [MLflow](https://mlflow.org/)) using **only the name of the model**. The less boilerplate and scaffolding that I need to replicate each time I want to deploy a new model,the better. I would like the ability to dynamically select between different versions of the model without needing to set up a whole new deployment to accommodate those new versions.\n" + ], + "metadata": { + "id": "IuS0daXP1lIa" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Components\n", + "\n", + "For our exploration here, we'll use the following minimal stack:\n", + "\n", + "- MLflow for model registry;\n", + "- Ray Serve for model serving.\n", + "\n", + "For demonstrative purposes, we'll exclusively use off-the-shelf open-source models from Hugging Face Hub.\n", + "\n", + "We will **not** use GPUs for inference because inference performance is orthogonal to our focus here today. Needless to say, in \"real life,\" you will likely not be able to get away with serving your model with CPU compute.\n", + "\n", + "Let's install our dependencies now." + ], + "metadata": { + "id": "fXlB7AJr2foY" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install \"transformers\" \"mlflow-skinny\" \"ray[serve]\"" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "id": "HfLQGO6E2hnW", + "outputId": "c9634e63-5aaf-4e59-e970-aecb36d25b77" + }, + "execution_count": 65, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.47.1)\n", + "Requirement already satisfied: mlflow-skinny in /usr/local/lib/python3.10/dist-packages (2.19.0)\n", + "Requirement already satisfied: ray[serve] in /usr/local/lib/python3.10/dist-packages (2.40.0)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.16.1)\n", + "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.27.0)\n", + "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)\n", + "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.2)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.11.6)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)\n", + "Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.21.0)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.5)\n", + "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.67.1)\n", + "Requirement already satisfied: cachetools<6,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (5.5.0)\n", + "Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (8.1.7)\n", + "Requirement already satisfied: cloudpickle<4 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (3.1.0)\n", + "Requirement already satisfied: databricks-sdk<1,>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (0.40.0)\n", + "Requirement already satisfied: gitpython<4,>=3.1.9 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (3.1.43)\n", + "Requirement already satisfied: importlib_metadata!=4.7.0,<9,>=3.7.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (8.5.0)\n", + "Requirement already satisfied: opentelemetry-api<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (1.29.0)\n", + "Requirement already satisfied: opentelemetry-sdk<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (1.29.0)\n", + "Requirement already satisfied: protobuf<6,>=3.12.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (4.25.5)\n", + "Requirement already satisfied: sqlparse<1,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (0.5.3)\n", + "Requirement already satisfied: jsonschema in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (4.23.0)\n", + "Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.1.0)\n", + "Requirement already satisfied: aiosignal in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.3.2)\n", + "Requirement already satisfied: frozenlist in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.5.0)\n", + "Requirement already satisfied: watchfiles in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.0.3)\n", + "Requirement already satisfied: aiohttp-cors in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.7.0)\n", + "Requirement already satisfied: opencensus in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.11.4)\n", + "Requirement already satisfied: smart-open in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (7.1.0)\n", + "Requirement already satisfied: aiohttp>=3.7 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (3.11.10)\n", + "Requirement already satisfied: colorful in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.5.6)\n", + "Requirement already satisfied: prometheus-client>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.21.1)\n", + "Requirement already satisfied: uvicorn[standard] in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.34.0)\n", + "Requirement already satisfied: py-spy>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.4.0)\n", + "Requirement already satisfied: starlette in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.41.3)\n", + "Requirement already satisfied: pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (2.10.3)\n", + "Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.115.6)\n", + "Requirement already satisfied: virtualenv!=20.21.1,>=20.0.24 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (20.28.0)\n", + "Requirement already satisfied: grpcio>=1.42.0 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.68.1)\n", + "Requirement already satisfied: memray in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.15.0)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (2.4.4)\n", + "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (4.0.3)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (24.3.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (6.1.0)\n", + "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (0.2.1)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (1.18.3)\n", + "Requirement already satisfied: google-auth~=2.0 in /usr/local/lib/python3.10/dist-packages (from databricks-sdk<1,>=0.20.0->mlflow-skinny) (2.27.0)\n", + "Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from gitpython<4,>=3.1.9->mlflow-skinny) (4.0.11)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (2024.10.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n", + "Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib_metadata!=4.7.0,<9,>=3.7.0->mlflow-skinny) (3.21.0)\n", + "Requirement already satisfied: deprecated>=1.2.6 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-api<3,>=1.9.0->mlflow-skinny) (1.2.15)\n", + "Requirement already satisfied: opentelemetry-semantic-conventions==0.50b0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-sdk<3,>=1.9.0->mlflow-skinny) (0.50b0)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3->ray[serve]) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3->ray[serve]) (2.27.1)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.2.3)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.12.14)\n", + "Requirement already satisfied: distlib<1,>=0.3.7 in /usr/local/lib/python3.10/dist-packages (from virtualenv!=20.21.1,>=20.0.24->ray[serve]) (0.3.9)\n", + "Requirement already satisfied: platformdirs<5,>=3.9.1 in /usr/local/lib/python3.10/dist-packages (from virtualenv!=20.21.1,>=20.0.24->ray[serve]) (4.3.6)\n", + "Requirement already satisfied: anyio<5,>=3.4.0 in /usr/local/lib/python3.10/dist-packages (from starlette->ray[serve]) (3.7.1)\n", + "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema->ray[serve]) (2024.10.1)\n", + "Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema->ray[serve]) (0.35.1)\n", + "Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema->ray[serve]) (0.22.3)\n", + "Requirement already satisfied: jinja2>=2.9 in /usr/local/lib/python3.10/dist-packages (from memray->ray[serve]) (3.1.4)\n", + "Requirement already satisfied: rich>=11.2.0 in /usr/local/lib/python3.10/dist-packages (from memray->ray[serve]) (13.9.4)\n", + "Requirement already satisfied: textual>=0.41.0 in /usr/local/lib/python3.10/dist-packages (from memray->ray[serve]) (1.0.0)\n", + "Requirement already satisfied: opencensus-context>=0.1.3 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[serve]) (0.1.3)\n", + "Requirement already satisfied: six~=1.16 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[serve]) (1.17.0)\n", + "Requirement already satisfied: google-api-core<3.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[serve]) (2.19.2)\n", + "Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from smart-open->ray[serve]) (1.17.0)\n", + "Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (0.14.0)\n", + "Requirement already satisfied: httptools>=0.6.3 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (0.6.4)\n", + "Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (1.0.1)\n", + "Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (0.21.0)\n", + "Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (14.1)\n", + "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette->ray[serve]) (1.3.1)\n", + "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette->ray[serve]) (1.2.2)\n", + "Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from gitdb<5,>=4.0.1->gitpython<4,>=3.1.9->mlflow-skinny) (5.0.1)\n", + "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[serve]) (1.66.0)\n", + "Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /usr/local/lib/python3.10/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[serve]) (1.25.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny) (0.4.1)\n", + "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny) (4.9)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2>=2.9->memray->ray[serve]) (3.0.2)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=11.2.0->memray->ray[serve]) (3.0.0)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=11.2.0->memray->ray[serve]) (2.18.0)\n", + "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=11.2.0->memray->ray[serve]) (0.1.2)\n", + "Requirement already satisfied: mdit-py-plugins in /usr/local/lib/python3.10/dist-packages (from markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[serve]) (0.4.2)\n", + "Requirement already satisfied: linkify-it-py<3,>=1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[serve]) (2.0.3)\n", + "Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny) (0.6.1)\n", + "Requirement already satisfied: uc-micro-py in /usr/local/lib/python3.10/dist-packages (from linkify-it-py<3,>=1->markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[serve]) (1.0.3)\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Register the Model\n", + "\n", + "First, let's define the model that we'll use for our explorations today. For simplicity's sake, we'll use a simple text translation model, where the source and destination languages are configurable at registration time. In effect, this means that different \"versions\" of the model can be registered to translate different languages, but that the underlying model architecture and weights can stay the same." + ], + "metadata": { + "id": "C0UziXBN4Szc" + } + }, + { + "cell_type": "code", + "source": [ + "import mlflow\n", + "from transformers import pipeline\n", + "\n", + "class MyTranslationModel(mlflow.pyfunc.PythonModel):\n", + " def load_context(self, context):\n", + " self.lang_from = context.model_config.get(\"lang_from\", \"en\")\n", + " self.lang_to = context.model_config.get(\"lang_to\", \"de\")\n", + "\n", + " self.input_label: str = context.model_config.get(\"input_label\", \"prompt\")\n", + "\n", + " self.model_ref: str = context.model_config.get(\"hfhub_name\", \"google-t5/t5-base\")\n", + "\n", + " self.pipeline = pipeline(\n", + " f\"translation_{self.lang_from}_to_{self.lang_to}\",\n", + " self.model_ref,\n", + " )\n", + "\n", + " def predict(self, context, model_input, params=None):\n", + " prompt = model_input[self.input_label].tolist()\n", + "\n", + " return self.pipeline(prompt)" + ], + "metadata": { + "id": "D2HsBFUa4nBM" + }, + "execution_count": 66, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "(You might be wondering why we even bothered making the input label configurable. This will be useful to us later.)\n", + "\n", + "Now that our model is defined, let's register an actual version of it. This particular version will use Google's [T5 Base](https://huggingface.co/google-t5/t5-base) model and be configured to translate from **English** to **German**." + ], + "metadata": { + "id": "-PFbVlpdIBHA" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "\n", + "with mlflow.start_run():\n", + " model_info = mlflow.pyfunc.log_model(\n", + " \"translation_model\",\n", + " registered_model_name=\"translation_model\",\n", + " python_model=MyTranslationModel(),\n", + " pip_requirements=[\"transformers\"],\n", + " input_example=pd.DataFrame({\n", + " \"prompt\": [\"Hello my name is Jonathan.\"],\n", + " }),\n", + " model_config={\n", + " \"hfhub_name\": \"google-t5/t5-base\",\n", + " \"lang_from\": \"en\",\n", + " \"lang_to\": \"de\",\n", + " },\n", + " )" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SpGCrnAx6eVf", + "outputId": "11218a74-11fa-471b-cc86-03a150b64f20" + }, + "execution_count": 67, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "Device set to use cpu\n", + "Device set to use cpu\n", + "Registered model 'translation_model' already exists. Creating a new version of this model...\n", + "Created version '14' of model 'translation_model'.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Let's keep track of this exact version. This will be useful later." + ], + "metadata": { + "id": "NaUwo6E0DPbI" + } + }, + { + "cell_type": "code", + "source": [ + "en_to_de_version: str = str(model_info.registered_model_version)" + ], + "metadata": { + "id": "e0o4ICh38Pjy" + }, + "execution_count": 69, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "The registered model metadata contains some useful information for us. Most notably, the registered model version is associated with a strict **signature** that denotes the expected shape of its input and output. This will be useful to us later." + ], + "metadata": { + "id": "Jn0RU7fXDTdD" + } + }, + { + "cell_type": "code", + "source": [ + "model_info.signature" + ], + "metadata": { + "id": "ZKMgYR_jDhOA", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "7f1410df-cde3-4160-eee8-30788a402b3b" + }, + "execution_count": 70, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "inputs: \n", + " ['prompt': string (required)]\n", + "outputs: \n", + " ['translation_text': string (required)]\n", + "params: \n", + " None" + ] + }, + "metadata": {}, + "execution_count": 70 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Serve the Model\n", + "\n", + "Now that our model is registered in MLflow, let's set up our serving scaffolding using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). For now, we'll limit our \"deployment\" to the following behavior:\n", + "\n", + "- Source the seleted model and version from MLflow;\n", + "- Receive inference requests and return inference responses via a simple REST API." + ], + "metadata": { + "id": "iwa3o-0B9FPO" + } + }, + { + "cell_type": "code", + "source": [ + "import mlflow\n", + "import pandas as pd\n", + "\n", + "from ray import serve\n", + "from fastapi import FastAPI\n", + "\n", + "app = FastAPI()\n", + "\n", + "@serve.deployment\n", + "@serve.ingress(app)\n", + "class ModelDeployment:\n", + " def __init__(self, model_name: str = \"translation_model\", default_version: str = \"1\"):\n", + " self.model_name = model_name\n", + " self.default_version = default_version\n", + "\n", + " self.model = mlflow.pyfunc.load_model(f\"models:/{self.model_name}/{self.default_version}\")\n", + "\n", + "\n", + " @app.post(\"/serve\")\n", + " async def serve(self, input_string: str):\n", + " return self.model.predict(pd.DataFrame({\"prompt\": [input_string]}))\n", + "\n", + "deployment = ModelDeployment.bind(default_version=en_to_de_version)" + ], + "metadata": { + "id": "7OZ2lqOS9oqw" + }, + "execution_count": 74, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "You might have notice that hard-coding `\"prompt\"` as the input label here introduces hidden coupling between the registered model's signature and the deployment implementation. We'll come back to this later.\n", + "\n", + "Now, let's run the deployment and play around with it." + ], + "metadata": { + "id": "f018wd2fEia7" + } + }, + { + "cell_type": "code", + "source": [ + "serve.run(deployment, blocking=False)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MudMnivd_DrC", + "outputId": "7f23394f-9f3e-4ce1-c67a-82c59a5bc25f" + }, + "execution_count": 75, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "INFO 2024-12-23 16:00:03,032 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:03,248 controller 27795 -- Deploying new version of Deployment(name='ModelDeployment', app='default') (initial target replicas: 1).\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:03,425 controller 27795 -- Stopping 1 replicas of Deployment(name='ModelDeployment', app='default') with outdated versions.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:03,425 controller 27795 -- Adding 1 replica to Deployment(name='ModelDeployment', app='default').\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:05,548 controller 27795 -- Replica(id='ksuhh6uv', deployment='ModelDeployment', app='default') is stopped.\n", + "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:21.273257: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:21.325581: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:21.341597: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:25.496368: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m WARNING 2024-12-23 16:00:33,573 controller 27795 -- Deployment 'ModelDeployment' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n", + "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m Device set to use cpu\n", + "INFO 2024-12-23 16:00:36,639 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", + "INFO 2024-12-23 16:00:36,642 serve 20385 -- Deployed app 'default' successfully.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DeploymentHandle(deployment='ModelDeployment')" + ] + }, + "metadata": {}, + "execution_count": 75 + } + ] + }, + { + "cell_type": "code", + "source": [ + "import requests\n", + "\n", + "requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " params={\"input_string\": \"The weather is lovely today\"},\n", + ").json()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "VTk1E5pp_gRz", + "outputId": "67a20366-f637-4a0a-8c51-0f71bf5e1ea6" + }, + "execution_count": 77, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m INFO 2024-12-23 16:00:41,540 default_ModelDeployment rekqfhvc 23cc9c43-746c-4575-968e-ee8d14972e6a -- POST /serve/ 307 5.8ms\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'translation_text': 'Das Wetter ist heute nett.'}]" + ] + }, + "metadata": {}, + "execution_count": 77 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "This works fine, but you might have noticed that the REST API does not line up with the model signature. Namely, it uses the label `\"input_string\"` while the served model version itself uses the input label `\"prompt\"`. Similarly, the model can accept multiple inputs values, but the API only accepts one.\n", + "\n", + "If this feels [smelly](https://en.wikipedia.org/wiki/Code_smell) to you, keep reading; we'll come back to this." + ], + "metadata": { + "id": "i3CNI-mmE_22" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Multiple Versions, One Endpoint\n", + "\n", + "Now we've got a basic endpoint set up for our model. Great! However, notice that this deployment is strictly tethered to a single version of this model -- specifically, version `1` of the registered `translation_model`.\n", + "\n", + "Imagine, now, that your team would like to come back and refine this model -- maybe retrain it on new data, or configure it to translate to a new language, e.g. French instead of German. Both would result in a new version of the `translation_model` getting registered. However, with our current deployment implementation, we'd need to set up a whole new endpoint for `translation_model/2`, require our users to remember which address and port corresponds to which version of the model, and so on. In other words: very cumbersome, very error-prone, very [toilsome](https://leaddev.com/velocity/what-toil-and-why-it-damaging-your-engineering-org).\n", + "\n", + "Conversely, imagine a scenario where we could reuse the exact same endpoint -- same signature, same address and port, same query conventions, etc. -- to serve both versions of this model. Our user can simply specify which version of the model they'd like to use, and we can treat one of them as the \"default\" in cases where the user didn't explicitly request one.\n", + "\n", + "This is one area where Ray Serve shines with a feature it calls [model multiplexing](https://docs.ray.io/en/latest/serve/model-multiplexing.html). In effect, this allows you to load up multiple \"versions\" of your model, dynamically hot-swapping them as needed, as well as unloading the versions that don't get used after some time. Very space-efficient, in other words.\n", + "\n", + "Let's try registering another version of the model -- this time, one that translates from English to French. We'll register this under the version `\"2\"`; the model server will retrieve the model version that way.\n", + "\n", + "But first, let's extend the model server with multiplexing support." + ], + "metadata": { + "id": "hsJ65rNNDMVj" + } + }, + { + "cell_type": "code", + "source": [ + "from ray import serve\n", + "from fastapi import FastAPI\n", + "\n", + "app = FastAPI()\n", + "\n", + "@serve.deployment\n", + "@serve.ingress(app)\n", + "class MultiplexedModelDeployment:\n", + "\n", + " @serve.multiplexed(max_num_models_per_replica=2)\n", + " async def get_model(self, version: str):\n", + " return mlflow.pyfunc.load_model(f\"models:/{self.model_name}/{version}\")\n", + "\n", + " def __init__(\n", + " self,\n", + " model_name: str = \"translation_model\",\n", + " default_version: str = en_to_de_version,\n", + " ):\n", + " self.model_name = model_name\n", + " self.default_version = default_version\n", + "\n", + " @app.post(\"/serve\")\n", + " async def serve(self, input_string: str):\n", + " model = await self.get_model(serve.get_multiplexed_model_id())\n", + " return model.predict(pd.DataFrame({\"prompt\": [input_string]}))" + ], + "metadata": { + "id": "d8GcI3WLE3Sc" + }, + "execution_count": 78, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "multiplexed_deployment = MultiplexedModelDeployment.bind(model_name=\"translation_model\")\n", + "serve.run(multiplexed_deployment, blocking=False)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f-gisRU_FKlJ", + "outputId": "a0c7318d-8271-4163-d58d-9ed97df72266" + }, + "execution_count": 79, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "INFO 2024-12-23 16:01:13,932 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:14,037 controller 27795 -- Deploying new version of Deployment(name='MultiplexedModelDeployment', app='default') (initial target replicas: 1).\n", + "\u001b[36m(ProxyActor pid=27796)\u001b[0m INFO 2024-12-23 16:01:14,042 proxy 172.28.0.12 -- Got updated endpoints: {Deployment(name='MultiplexedModelDeployment', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:14,144 controller 27795 -- Removing 1 replica from Deployment(name='ModelDeployment', app='default').\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:14,144 controller 27795 -- Adding 1 replica to Deployment(name='MultiplexedModelDeployment', app='default').\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:16,310 controller 27795 -- Replica(id='rekqfhvc', deployment='ModelDeployment', app='default') is stopped.\n", + "INFO 2024-12-23 16:01:19,109 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", + "INFO 2024-12-23 16:01:19,112 serve 20385 -- Deployed app 'default' successfully.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DeploymentHandle(deployment='MultiplexedModelDeployment')" + ] + }, + "metadata": {}, + "execution_count": 79 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now let's actually register the new model version." + ], + "metadata": { + "id": "Qs7snXhxdlUR" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "\n", + "with mlflow.start_run():\n", + " model_info = mlflow.pyfunc.log_model(\n", + " \"translation_model\",\n", + " registered_model_name=\"translation_model\",\n", + " python_model=MyTranslationModel(),\n", + " pip_requirements=[\"transformers\"],\n", + " input_example=pd.DataFrame({\n", + " \"prompt\": [\n", + " \"Hello my name is Jon.\",\n", + " ],\n", + " }),\n", + " model_config={\n", + " \"hfhub_name\": \"google-t5/t5-base\",\n", + " \"lang_from\": \"en\",\n", + " \"lang_to\": \"fr\",\n", + " },\n", + " )\n", + "\n", + "en_to_fr_version: str = str(model_info.registered_model_version)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K3_essFBEuCo", + "outputId": "b7f4f9e7-62bf-40ae-ed8a-db0110ad2e4f" + }, + "execution_count": 80, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "Device set to use cpu\n", + "Device set to use cpu\n", + "Registered model 'translation_model' already exists. Creating a new version of this model...\n", + "Created version '15' of model 'translation_model'.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now that that's registered, we can query for it via the model server like so..." + ], + "metadata": { + "id": "rxOzkg65dnZW" + } + }, + { + "cell_type": "code", + "source": [ + "import requests\n", + "\n", + "requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " params={\"input_string\": \"The weather is lovely today\"},\n", + " headers={\"serve_multiplexed_model_id\": en_to_fr_version},\n", + ").json()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "EyeLmnPJFuRH", + "outputId": "9dfb8df0-f207-42ae-b78b-db51d8843c15" + }, + "execution_count": 81, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m INFO 2024-12-23 16:01:41,179 default_MultiplexedModelDeployment hnpendkt 1943df13-e56a-47d0-a49f-55fb78aa665b -- POST /serve/ 307 4.3ms\n", + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m INFO 2024-12-23 16:01:43,214 default_MultiplexedModelDeployment hnpendkt ee559e3e-a71d-48aa-8c24-10de5d7ad7df -- Loading model '15'.\n", + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m 2024-12-23 16:01:52.414753: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m 2024-12-23 16:01:52.472202: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m 2024-12-23 16:01:52.491131: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m 2024-12-23 16:01:55.152832: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m Device set to use cpu\n", + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m INFO 2024-12-23 16:02:00,506 default_MultiplexedModelDeployment hnpendkt ee559e3e-a71d-48aa-8c24-10de5d7ad7df -- Successfully loaded model '15' in 17292.0ms.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'translation_text': \"Le temps est beau aujourd'hui\"}]" + ] + }, + "metadata": {}, + "execution_count": 81 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Note how we were able to immediately access the model version **without redeploying the model server**. Ray Serve's multiplexing capabilities allow it to dynamically fetch the model weights in a just-in-time fashion; if I never requested version 2, it never gets loaded. This helps conserve compute resources for the models that **do** get queried. What's even more useful is that, if the number of models loaded up exceeds the configured maximum (`max_num_models_per_replica`), the [least-recently used model version will get evicted](https://docs.ray.io/en/latest/serve/model-multiplexing.html#why-model-multiplexing).\n", + "\n", + "Given that we set `max_num_models_per_replica=2` above, the \"default\" English-to-German version of the model should still be loaded up and readily available to serve requests without any cold-start time. Let's confirm that now:" + ], + "metadata": { + "id": "jVMCS4CedudN" + } + }, + { + "cell_type": "code", + "source": [ + "requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " params={\"input_string\": \"The weather is lovely today\"},\n", + " headers={\"serve_multiplexed_model_id\": en_to_de_version},\n", + ").json()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "jEJFQNlwGGKh", + "outputId": "b847d92e-fe0f-4439-bd87-e5773680c4d1" + }, + "execution_count": 83, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m INFO 2024-12-23 16:02:13,267 default_MultiplexedModelDeployment hnpendkt 8e680170-df74-49ba-856c-a7e9009abaab -- POST /serve/ 307 26.0ms\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'translation_text': 'Das Wetter ist heute nett.'}]" + ] + }, + "metadata": {}, + "execution_count": 83 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Auto-Signature\n", + "\n", + "This is all well and good. However, notice that the following friction point still exists: when defining the server, we need to define a whole new signature for the API itself. At best, this is just some code duplication of the model signature itself (which is registered in MLflow). At worst, this can result in inconsistent APIs across all models that your team or organization owns, which can cause confusion and frustration in your downstream dependencies.\n", + "\n", + "In this particular case, it means that `MultiplexedModelDeployment` is secretly actually **tightly coupled** to the use-case for `translation_model`. What if we wanted to deploy another set of models that don't have to do with language translation? The defined `/serve` API, which returns a JSON object that looks like `{\"translated_text\": \"foo\"}`, would no longer make sense.\n", + "\n", + "To address this issue, **what if the API signature for `MultiplexedModelDeployment` could automatically mirror the signature of the underlying models it's serving**?\n", + "\n", + "Thankfully, with MLflow Model Registry metadata and some Python dynamic-class-creation shenanigans, this is entirely possible.\n", + "\n", + "Let's set things up so that the model server signature is inferred from the registered model itself. Since different versions of an MLflow can have different signatures, we'll use the \"default version\" to \"pin\" the signature; any attempt to multiplex an incompatible-signature model version we will have throw an error.\n", + "\n", + "Since Ray Serve binds the request and response signatures at class-definition time, we will use a Python metaclass to set this as a function of the specified model name and default model version." + ], + "metadata": { + "id": "D8CgPXcsIg5C" + } + }, + { + "cell_type": "code", + "source": [ + "import mlflow\n", + "import pydantic\n", + "\n", + "def schema_to_pydantic(schema: mlflow.types.schema.Schema, *, name: str) -> pydantic.BaseModel:\n", + " return pydantic.create_model(\n", + " name,\n", + " **{\n", + " k: (v.type.to_python(), pydantic.Field(required=True))\n", + " for k, v in schema.input_dict().items()\n", + " }\n", + " )\n", + "\n", + "def get_req_resp_signatures(model_signature: mlflow.models.ModelSignature) -> tuple[pydantic.BaseModel, pydantic.BaseModel]:\n", + " inputs: mlflow.types.schema.Schema = model_signature.inputs\n", + " outputs: mlflow.types.schema.Schema = model_signature.outputs\n", + "\n", + " return (schema_to_pydantic(inputs, name=\"InputModel\"), schema_to_pydantic(outputs, name=\"OutputModel\"))" + ], + "metadata": { + "id": "u9GPbQrnP7OD" + }, + "execution_count": 84, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import mlflow\n", + "\n", + "from fastapi import FastAPI, Response, status\n", + "from ray import serve\n", + "from typing import List\n", + "\n", + "def deployment_from_model_name(model_name: str, default_version: str = \"1\"):\n", + " app = FastAPI()\n", + " model_info = mlflow.models.get_model_info(f\"models:/{model_name}/{default_version}\")\n", + " input_datamodel, output_datamodel = get_req_resp_signatures(model_info.signature)\n", + "\n", + " @serve.deployment\n", + " @serve.ingress(app)\n", + " class DynamicallyDefinedDeployment:\n", + "\n", + " MODEL_NAME: str = model_name\n", + " DEFAULT_VERSION: str = default_version\n", + "\n", + " @serve.multiplexed(max_num_models_per_replica=2)\n", + " async def get_model(self, model_version: str):\n", + " model = mlflow.pyfunc.load_model(f\"models:/{self.MODEL_NAME}/{model_version}\")\n", + "\n", + " if model.metadata.get_model_info().signature != model_info.signature:\n", + " raise ValueError(f\"Requested version {model_version} has signature incompatible with that of default version {self.DEFAULT_VERSION}\")\n", + " return model\n", + "\n", + " # TODO: Extend this to support batching (lists of inputs and outputs)\n", + " @app.post(\"/serve\", response_model=List[output_datamodel])\n", + " async def serve(self, model_input: input_datamodel, response: Response):\n", + " model_id = serve.get_multiplexed_model_id()\n", + " if model_id == \"\":\n", + " model_id = self.DEFAULT_VERSION\n", + "\n", + " try:\n", + " model = await self.get_model(model_id)\n", + " except ValueError:\n", + " response.status_code = status.HTTP_409_CONFLICT\n", + " return [{\"translation_text\": \"FAILED\"}]\n", + "\n", + " return model.predict(model_input.dict())\n", + "\n", + " return DynamicallyDefinedDeployment\n", + "\n", + "deployment = deployment_from_model_name(\"translation_model\", default_version=en_to_fr_version)\n", + "\n", + "serve.run(deployment.bind(), blocking=False)" + ], + "metadata": { + "id": "PgetOY1LKp6m", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "ada066e3-72b3-42af-c284-41118fcb2e20" + }, + "execution_count": 95, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "INFO 2024-12-23 16:06:17,054 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:17,244 controller 27795 -- Deploying new version of Deployment(name='DynamicallyDefinedDeployment', app='default') (initial target replicas: 1).\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:17,368 controller 27795 -- Stopping 1 replicas of Deployment(name='DynamicallyDefinedDeployment', app='default') with outdated versions.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:17,368 controller 27795 -- Adding 1 replica to Deployment(name='DynamicallyDefinedDeployment', app='default').\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:06:19,388 default_DynamicallyDefinedDeployment iwidgax2 -- Unloading model '15'.\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:06:19,394 default_DynamicallyDefinedDeployment iwidgax2 -- Successfully unloaded model '15' in 0.4ms.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:19,538 controller 27795 -- Replica(id='iwidgax2', deployment='DynamicallyDefinedDeployment', app='default') is stopped.\n", + "INFO 2024-12-23 16:06:38,966 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", + "INFO 2024-12-23 16:06:38,968 serve 20385 -- Deployed app 'default' successfully.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DeploymentHandle(deployment='DynamicallyDefinedDeployment')" + ] + }, + "metadata": {}, + "execution_count": 95 + } + ] + }, + { + "cell_type": "code", + "source": [ + "import requests\n", + "\n", + "resp = requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " json={\"prompt\": \"The weather is lovely today\"},\n", + ")\n", + "\n", + "assert resp.ok\n", + "assert resp.status_code == 200\n", + "\n", + "resp.json()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "x911zDhomWMj", + "outputId": "7dc78df7-4f06-4871-d45f-37cfb852ffc5" + }, + "execution_count": 88, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:03:30,503 default_DynamicallyDefinedDeployment iwidgax2 8989a73b-3173-48d0-a0dc-d301363e731c -- POST /serve/ 307 10.8ms\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:03:30,544 default_DynamicallyDefinedDeployment iwidgax2 e00d9137-a259-4954-8a12-81a3314bc5d2 -- Loading model '15'.\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m 2024-12-23 16:03:38.056305: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m 2024-12-23 16:03:38.085864: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m 2024-12-23 16:03:38.098177: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m 2024-12-23 16:03:39.580308: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m Device set to use cpu\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:03:50,242 default_DynamicallyDefinedDeployment iwidgax2 e00d9137-a259-4954-8a12-81a3314bc5d2 -- Successfully loaded model '15' in 19697.5ms.\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m :40: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'translation_text': \"Le temps est beau aujourd'hui\"}]" + ] + }, + "metadata": {}, + "execution_count": 88 + } + ] + }, + { + "cell_type": "code", + "source": [ + "import requests\n", + "\n", + "resp = requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " json={\"prompt\": \"The weather is lovely today\"},\n", + " headers={\"serve_multiplexed_model_id\": str(en_to_fr_version)},\n", + ")\n", + "\n", + "assert resp.ok\n", + "assert resp.status_code == 200\n", + "\n", + "resp.json()" + ], + "metadata": { + "id": "EX7ff2wg5PjL", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "edf0587a-abf5-4160-a621-f9ac4faee6bf" + }, + "execution_count": 89, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:03:57,563 default_DynamicallyDefinedDeployment iwidgax2 df6b7526-edee-486a-a06e-f15407d4e1aa -- POST /serve/ 307 7.2ms\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'translation_text': \"Le temps est beau aujourd'hui\"}]" + ] + }, + "metadata": {}, + "execution_count": 89 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Let's now confirm that the signature-check provision we put in place actually works. For this, let's register this same model with a **slightly** different signature. This should be enough to trigger the failsafe.\n", + "\n", + "(Remember when we made the input label configurable at the start of this exercise? This is where that finally comes into play. 😎)" + ], + "metadata": { + "id": "kwkDDzebG_dd" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "\n", + "with mlflow.start_run():\n", + " incompatible_version = str(mlflow.pyfunc.log_model(\n", + " \"translation_model\",\n", + " registered_model_name=\"translation_model\",\n", + " python_model=MyTranslationModel(),\n", + " pip_requirements=[\"transformers\"],\n", + " input_example=pd.DataFrame({\n", + " \"text_to_translate\": [\n", + " \"Hello my name is Jon.\",\n", + " ],\n", + " }),\n", + " model_config={\n", + " \"input_label\": \"text_to_translate\",\n", + " \"hfhub_name\": \"google-t5/t5-base\",\n", + " \"lang_from\": \"en\",\n", + " \"lang_to\": \"de\",\n", + " },\n", + " ).registered_model_version)" + ], + "metadata": { + "id": "JYydMogXHsOJ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d8cd96f0-58d2-462b-8902-d9a65b604dc0" + }, + "execution_count": 90, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "Device set to use cpu\n", + "Device set to use cpu\n", + "Registered model 'translation_model' already exists. Creating a new version of this model...\n", + "Created version '16' of model 'translation_model'.\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "import requests\n", + "\n", + "resp = requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " json={\"prompt\": \"The weather is lovely today\"},\n", + " headers={\"serve_multiplexed_model_id\": incompatible_version},\n", + ")\n", + "assert not resp.ok\n", + "assert resp.status_code == 409\n", + "\n", + "assert resp.json()[0][\"translation_text\"] == \"FAILED\"" + ], + "metadata": { + "id": "5Yn-5VlIH6gs", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "e22f1791-b013-445c-a2ab-08916c5c1032" + }, + "execution_count": 99, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m INFO 2024-12-23 16:07:41,052 default_DynamicallyDefinedDeployment c6ow5kq8 4847d79e-7b6f-4825-9d05-df0061222108 -- POST /serve/ 307 17.4ms\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m INFO 2024-12-23 16:07:43,253 default_DynamicallyDefinedDeployment c6ow5kq8 80d5bb80-c4e9-4dd5-ae51-f5fd1fe9b50c -- Loading model '16'.\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m Device set to use cpu\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m ERROR 2024-12-23 16:07:49,186 default_DynamicallyDefinedDeployment c6ow5kq8 80d5bb80-c4e9-4dd5-ae51-f5fd1fe9b50c -- Failed to load model '16'. Error: Requested version 16 has signature incompatible with that of default version 15\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "(The technically \"correct\" thing to do here would be to implement a response container that allows for an \"error message\" to be defined as part of the actual response, rather than \"abusing\" the `translation_text` field like we do here. For demonstration purposes, however, this'll do.)" + ], + "metadata": { + "id": "DMhjLZh-jCVa" + } + }, + { + "cell_type": "markdown", + "source": [ + "To fully close things out, let's try registering an entirely different model -- with an entirely different signature -- and deploying that via `deployment_from_model_name()`. This will help us confirm that the entire signature is defined from the loaded model." + ], + "metadata": { + "id": "cCLtQCgsjwPM" + } + }, + { + "cell_type": "code", + "source": [ + "import mlflow\n", + "from transformers import pipeline\n", + "\n", + "class QuestionAnswererModel(mlflow.pyfunc.PythonModel):\n", + " def load_context(self, context):\n", + "\n", + " self.model_context = context.model_config.get(\n", + " \"model_context\",\n", + " \"My name is Hans and I live in Germany.\",\n", + " )\n", + " self.model_name = context.model_config.get(\n", + " \"model_name\",\n", + " \"deepset/roberta-base-squad2\",\n", + " )\n", + "\n", + " self.tokenizer_name = context.model_config.get(\n", + " \"tokenizer_name\",\n", + " \"deepset/roberta-base-squad2\",\n", + " )\n", + "\n", + " self.pipeline = pipeline(\n", + " \"question-answering\",\n", + " model=self.model_name,\n", + " tokenizer=self.tokenizer_name,\n", + " )\n", + "\n", + " def predict(self, context, model_input, params=None):\n", + " resp = self.pipeline(\n", + " question=model_input[\"question\"].tolist(),\n", + " context=self.model_context,\n", + " )\n", + "\n", + " return [resp] if type(resp) is not list else resp" + ], + "metadata": { + "id": "fXUPRszjIGYN" + }, + "execution_count": 124, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "\n", + "with mlflow.start_run():\n", + " model_info = mlflow.pyfunc.log_model(\n", + " \"question_answerer\",\n", + " registered_model_name=\"question_answerer\",\n", + " python_model=QuestionAnswererModel(),\n", + " pip_requirements=[\"transformers\"],\n", + " input_example=pd.DataFrame({\n", + " \"question\": [\n", + " \"Where do you live?\",\n", + " \"What is your name?\",\n", + " ],\n", + " }),\n", + " model_config={\n", + " \"model_context\": \"My name is Hans and I live in Germany.\",\n", + " },\n", + " )" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_p4FrmmhPAuq", + "outputId": "d5293b38-e56b-4b3f-c4e1-9906ba9c4383" + }, + "execution_count": 125, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "Device set to use cpu\n", + "/usr/local/lib/python3.10/dist-packages/mlflow/types/utils.py:435: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n", + " warnings.warn(\n", + "Device set to use cpu\n", + "Registered model 'question_answerer' already exists. Creating a new version of this model...\n", + "Created version '8' of model 'question_answerer'.\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "model_info.signature" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "g-0mQytrKyOc", + "outputId": "dd59ef90-ed96-490a-c27f-8f5dbc023ed3" + }, + "execution_count": 117, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "inputs: \n", + " ['question': string (required)]\n", + "outputs: \n", + " ['score': double (required), 'start': long (required), 'end': long (required), 'answer': string (required)]\n", + "params: \n", + " None" + ] + }, + "metadata": {}, + "execution_count": 117 + } + ] + }, + { + "cell_type": "code", + "source": [ + "from ray import serve\n", + "\n", + "serve.run(\n", + " deployment_from_model_name(\n", + " \"question_answerer\",\n", + " default_version=str(model_info.registered_model_version),\n", + " ).bind(),\n", + " blocking=False\n", + ")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "afpSjdgYPaCw", + "outputId": "b01dcf25-289c-4ed6-f878-172966e88438" + }, + "execution_count": 127, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "INFO 2024-12-23 16:14:03,641 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:03,782 controller 27795 -- Deploying new version of Deployment(name='DynamicallyDefinedDeployment', app='default') (initial target replicas: 1).\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:03,905 controller 27795 -- Stopping 1 replicas of Deployment(name='DynamicallyDefinedDeployment', app='default') with outdated versions.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:03,906 controller 27795 -- Adding 1 replica to Deployment(name='DynamicallyDefinedDeployment', app='default').\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=34337)\u001b[0m INFO 2024-12-23 16:14:05,922 default_DynamicallyDefinedDeployment zeqhtzxj -- Unloading model '4'.\n", + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=34337)\u001b[0m INFO 2024-12-23 16:14:05,923 default_DynamicallyDefinedDeployment zeqhtzxj -- Successfully unloaded model '4' in 1.1ms.\n", + "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:06,047 controller 27795 -- Replica(id='zeqhtzxj', deployment='DynamicallyDefinedDeployment', app='default') is stopped.\n", + "INFO 2024-12-23 16:14:10,755 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", + "INFO 2024-12-23 16:14:10,757 serve 20385 -- Deployed app 'default' successfully.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DeploymentHandle(deployment='DynamicallyDefinedDeployment')" + ] + }, + "metadata": {}, + "execution_count": 127 + } + ] + }, + { + "cell_type": "code", + "source": [ + "import requests\n", + "\n", + "resp = requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " json={\"question\": \"The weather is lovely today\"},\n", + ")\n", + "resp.json()\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MsLq5vbsS84T", + "outputId": "73489ce0-984b-4915-e8e0-27db7a8966ec" + }, + "execution_count": 130, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=35834)\u001b[0m INFO 2024-12-23 16:14:40,551 default_DynamicallyDefinedDeployment z6r4w9bp f766b328-7f11-467b-b6fa-04f6d6c17a84 -- POST /serve/ 307 8.6ms\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'score': 3.255750561947934e-05,\n", + " 'start': 30,\n", + " 'end': 38,\n", + " 'answer': 'Germany.'}]" + ] + }, + "metadata": {}, + "execution_count": 130 + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=35834)\u001b[0m INFO 2024-12-23 16:14:42,857 default_DynamicallyDefinedDeployment z6r4w9bp 74527eca-776e-497f-b478-b4dc8e24f53a -- POST /serve 200 2181.2ms\n" + ] + } + ] + } + ] +} \ No newline at end of file From bb07cda4e4ea1757d1656c475ea27cb90ce4b2d5 Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 23 Dec 2024 16:19:27 +0000 Subject: [PATCH 02/11] Remove unnecessary output --- ..._Hugging_Face_with_Ray_Serve,_MLflow.ipynb | 860 ++++++------------ 1 file changed, 297 insertions(+), 563 deletions(-) diff --git a/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb b/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb index 4c5200f2..1b094e94 100644 --- a/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb +++ b/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb @@ -1,27 +1,10 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [], - "toc_visible": true, - "authorship_tag": "ABX9TyPKW7x903JxiHL2pqDZChKh", - "include_colab_link": true - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } - }, "cells": [ { "cell_type": "markdown", "metadata": { - "id": "view-in-github", - "colab_type": "text" + "colab_type": "text", + "id": "view-in-github" }, "source": [ "\"Open" @@ -29,17 +12,20 @@ }, { "cell_type": "markdown", + "metadata": { + "id": "I17bSxxg1evl" + }, "source": [ "# Serving Foundation Models from Hugging Face with Ray Serve, MLflow\n", "\n", "Authored by: Jonathan Jin" - ], - "metadata": { - "id": "I17bSxxg1evl" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "IuS0daXP1lIa" + }, "source": [ "# Introduction\n", "\n", @@ -58,13 +44,13 @@ "Given all of the above, we motivate our exploration here with the following user story:\n", "\n", "> I would like to deploy a model from a model registry (such as [MLflow](https://mlflow.org/)) using **only the name of the model**. The less boilerplate and scaffolding that I need to replicate each time I want to deploy a new model,the better. I would like the ability to dynamically select between different versions of the model without needing to set up a whole new deployment to accommodate those new versions.\n" - ], - "metadata": { - "id": "IuS0daXP1lIa" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "fXlB7AJr2foY" + }, "source": [ "# Components\n", "\n", @@ -78,16 +64,11 @@ "We will **not** use GPUs for inference because inference performance is orthogonal to our focus here today. Needless to say, in \"real life,\" you will likely not be able to get away with serving your model with CPU compute.\n", "\n", "Let's install our dependencies now." - ], - "metadata": { - "id": "fXlB7AJr2foY" - } + ] }, { "cell_type": "code", - "source": [ - "!pip install \"transformers\" \"mlflow-skinny\" \"ray[serve]\"" - ], + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -96,123 +77,29 @@ "id": "HfLQGO6E2hnW", "outputId": "c9634e63-5aaf-4e59-e970-aecb36d25b77" }, - "execution_count": 65, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.47.1)\n", - "Requirement already satisfied: mlflow-skinny in /usr/local/lib/python3.10/dist-packages (2.19.0)\n", - "Requirement already satisfied: ray[serve] in /usr/local/lib/python3.10/dist-packages (2.40.0)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.16.1)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.27.0)\n", - "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.2)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)\n", - "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.11.6)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)\n", - "Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.21.0)\n", - "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.5)\n", - "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.67.1)\n", - "Requirement already satisfied: cachetools<6,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (5.5.0)\n", - "Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (8.1.7)\n", - "Requirement already satisfied: cloudpickle<4 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (3.1.0)\n", - "Requirement already satisfied: databricks-sdk<1,>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (0.40.0)\n", - "Requirement already satisfied: gitpython<4,>=3.1.9 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (3.1.43)\n", - "Requirement already satisfied: importlib_metadata!=4.7.0,<9,>=3.7.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (8.5.0)\n", - "Requirement already satisfied: opentelemetry-api<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (1.29.0)\n", - "Requirement already satisfied: opentelemetry-sdk<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (1.29.0)\n", - "Requirement already satisfied: protobuf<6,>=3.12.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (4.25.5)\n", - "Requirement already satisfied: sqlparse<1,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from mlflow-skinny) (0.5.3)\n", - "Requirement already satisfied: jsonschema in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (4.23.0)\n", - "Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.1.0)\n", - "Requirement already satisfied: aiosignal in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.3.2)\n", - "Requirement already satisfied: frozenlist in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.5.0)\n", - "Requirement already satisfied: watchfiles in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.0.3)\n", - "Requirement already satisfied: aiohttp-cors in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.7.0)\n", - "Requirement already satisfied: opencensus in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.11.4)\n", - "Requirement already satisfied: smart-open in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (7.1.0)\n", - "Requirement already satisfied: aiohttp>=3.7 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (3.11.10)\n", - "Requirement already satisfied: colorful in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.5.6)\n", - "Requirement already satisfied: prometheus-client>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.21.1)\n", - "Requirement already satisfied: uvicorn[standard] in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.34.0)\n", - "Requirement already satisfied: py-spy>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.4.0)\n", - "Requirement already satisfied: starlette in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.41.3)\n", - "Requirement already satisfied: pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (2.10.3)\n", - "Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (0.115.6)\n", - "Requirement already satisfied: virtualenv!=20.21.1,>=20.0.24 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (20.28.0)\n", - "Requirement already satisfied: grpcio>=1.42.0 in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.68.1)\n", - "Requirement already satisfied: memray in /usr/local/lib/python3.10/dist-packages (from ray[serve]) (1.15.0)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (2.4.4)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (4.0.3)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (24.3.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (6.1.0)\n", - "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (0.2.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.7->ray[serve]) (1.18.3)\n", - "Requirement already satisfied: google-auth~=2.0 in /usr/local/lib/python3.10/dist-packages (from databricks-sdk<1,>=0.20.0->mlflow-skinny) (2.27.0)\n", - "Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from gitpython<4,>=3.1.9->mlflow-skinny) (4.0.11)\n", - "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (2024.10.0)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n", - "Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib_metadata!=4.7.0,<9,>=3.7.0->mlflow-skinny) (3.21.0)\n", - "Requirement already satisfied: deprecated>=1.2.6 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-api<3,>=1.9.0->mlflow-skinny) (1.2.15)\n", - "Requirement already satisfied: opentelemetry-semantic-conventions==0.50b0 in /usr/local/lib/python3.10/dist-packages (from opentelemetry-sdk<3,>=1.9.0->mlflow-skinny) (0.50b0)\n", - "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3->ray[serve]) (0.7.0)\n", - "Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3->ray[serve]) (2.27.1)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4.0)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.10)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.2.3)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.12.14)\n", - "Requirement already satisfied: distlib<1,>=0.3.7 in /usr/local/lib/python3.10/dist-packages (from virtualenv!=20.21.1,>=20.0.24->ray[serve]) (0.3.9)\n", - "Requirement already satisfied: platformdirs<5,>=3.9.1 in /usr/local/lib/python3.10/dist-packages (from virtualenv!=20.21.1,>=20.0.24->ray[serve]) (4.3.6)\n", - "Requirement already satisfied: anyio<5,>=3.4.0 in /usr/local/lib/python3.10/dist-packages (from starlette->ray[serve]) (3.7.1)\n", - "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema->ray[serve]) (2024.10.1)\n", - "Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema->ray[serve]) (0.35.1)\n", - "Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema->ray[serve]) (0.22.3)\n", - "Requirement already satisfied: jinja2>=2.9 in /usr/local/lib/python3.10/dist-packages (from memray->ray[serve]) (3.1.4)\n", - "Requirement already satisfied: rich>=11.2.0 in /usr/local/lib/python3.10/dist-packages (from memray->ray[serve]) (13.9.4)\n", - "Requirement already satisfied: textual>=0.41.0 in /usr/local/lib/python3.10/dist-packages (from memray->ray[serve]) (1.0.0)\n", - "Requirement already satisfied: opencensus-context>=0.1.3 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[serve]) (0.1.3)\n", - "Requirement already satisfied: six~=1.16 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[serve]) (1.17.0)\n", - "Requirement already satisfied: google-api-core<3.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[serve]) (2.19.2)\n", - "Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from smart-open->ray[serve]) (1.17.0)\n", - "Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (0.14.0)\n", - "Requirement already satisfied: httptools>=0.6.3 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (0.6.4)\n", - "Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (1.0.1)\n", - "Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (0.21.0)\n", - "Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]; extra == \"serve\"->ray[serve]) (14.1)\n", - "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette->ray[serve]) (1.3.1)\n", - "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette->ray[serve]) (1.2.2)\n", - "Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from gitdb<5,>=4.0.1->gitpython<4,>=3.1.9->mlflow-skinny) (5.0.1)\n", - "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[serve]) (1.66.0)\n", - "Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /usr/local/lib/python3.10/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[serve]) (1.25.0)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny) (0.4.1)\n", - "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny) (4.9)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2>=2.9->memray->ray[serve]) (3.0.2)\n", - "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=11.2.0->memray->ray[serve]) (3.0.0)\n", - "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=11.2.0->memray->ray[serve]) (2.18.0)\n", - "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=11.2.0->memray->ray[serve]) (0.1.2)\n", - "Requirement already satisfied: mdit-py-plugins in /usr/local/lib/python3.10/dist-packages (from markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[serve]) (0.4.2)\n", - "Requirement already satisfied: linkify-it-py<3,>=1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[serve]) (2.0.3)\n", - "Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth~=2.0->databricks-sdk<1,>=0.20.0->mlflow-skinny) (0.6.1)\n", - "Requirement already satisfied: uc-micro-py in /usr/local/lib/python3.10/dist-packages (from linkify-it-py<3,>=1->markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[serve]) (1.0.3)\n" - ] - } + "outputs": [], + "source": [ + "!pip install \"transformers\" \"mlflow-skinny\" \"ray[serve]\"" ] }, { "cell_type": "markdown", + "metadata": { + "id": "C0UziXBN4Szc" + }, "source": [ "# Register the Model\n", "\n", "First, let's define the model that we'll use for our explorations today. For simplicity's sake, we'll use a simple text translation model, where the source and destination languages are configurable at registration time. In effect, this means that different \"versions\" of the model can be registered to translate different languages, but that the underlying model architecture and weights can stay the same." - ], - "metadata": { - "id": "C0UziXBN4Szc" - } + ] }, { "cell_type": "code", + "execution_count": 66, + "metadata": { + "id": "D2HsBFUa4nBM" + }, + "outputs": [], "source": [ "import mlflow\n", "from transformers import pipeline\n", @@ -235,26 +122,30 @@ " prompt = model_input[self.input_label].tolist()\n", "\n", " return self.pipeline(prompt)" - ], - "metadata": { - "id": "D2HsBFUa4nBM" - }, - "execution_count": 66, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "-PFbVlpdIBHA" + }, "source": [ "(You might be wondering why we even bothered making the input label configurable. This will be useful to us later.)\n", "\n", "Now that our model is defined, let's register an actual version of it. This particular version will use Google's [T5 Base](https://huggingface.co/google-t5/t5-base) model and be configured to translate from **English** to **German**." - ], - "metadata": { - "id": "-PFbVlpdIBHA" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SpGCrnAx6eVf", + "outputId": "11218a74-11fa-471b-cc86-03a150b64f20" + }, + "outputs": [], "source": [ "import pandas as pd\n", "\n", @@ -273,73 +164,49 @@ " \"lang_to\": \"de\",\n", " },\n", " )" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "SpGCrnAx6eVf", - "outputId": "11218a74-11fa-471b-cc86-03a150b64f20" - }, - "execution_count": 67, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "Device set to use cpu\n", - "Device set to use cpu\n", - "Registered model 'translation_model' already exists. Creating a new version of this model...\n", - "Created version '14' of model 'translation_model'.\n" - ] - } ] }, { "cell_type": "markdown", - "source": [ - "Let's keep track of this exact version. This will be useful later." - ], "metadata": { "id": "NaUwo6E0DPbI" - } + }, + "source": [ + "Let's keep track of this exact version. This will be useful later." + ] }, { "cell_type": "code", - "source": [ - "en_to_de_version: str = str(model_info.registered_model_version)" - ], + "execution_count": 69, "metadata": { "id": "e0o4ICh38Pjy" }, - "execution_count": 69, - "outputs": [] + "outputs": [], + "source": [ + "en_to_de_version: str = str(model_info.registered_model_version)" + ] }, { "cell_type": "markdown", - "source": [ - "The registered model metadata contains some useful information for us. Most notably, the registered model version is associated with a strict **signature** that denotes the expected shape of its input and output. This will be useful to us later." - ], "metadata": { "id": "Jn0RU7fXDTdD" - } + }, + "source": [ + "The registered model metadata contains some useful information for us. Most notably, the registered model version is associated with a strict **signature** that denotes the expected shape of its input and output. This will be useful to us later." + ] }, { "cell_type": "code", - "source": [ - "model_info.signature" - ], + "execution_count": 70, "metadata": { - "id": "ZKMgYR_jDhOA", "colab": { "base_uri": "https://localhost:8080/" }, + "id": "ZKMgYR_jDhOA", "outputId": "7f1410df-cde3-4160-eee8-30788a402b3b" }, - "execution_count": 70, "outputs": [ { - "output_type": "execute_result", "data": { "text/plain": [ "inputs: \n", @@ -350,13 +217,20 @@ " None" ] }, + "execution_count": 70, "metadata": {}, - "execution_count": 70 + "output_type": "execute_result" } + ], + "source": [ + "model_info.signature" ] }, { "cell_type": "markdown", + "metadata": { + "id": "iwa3o-0B9FPO" + }, "source": [ "# Serve the Model\n", "\n", @@ -364,13 +238,15 @@ "\n", "- Source the seleted model and version from MLflow;\n", "- Receive inference requests and return inference responses via a simple REST API." - ], - "metadata": { - "id": "iwa3o-0B9FPO" - } + ] }, { "cell_type": "code", + "execution_count": 74, + "metadata": { + "id": "7OZ2lqOS9oqw" + }, + "outputs": [], "source": [ "import mlflow\n", "import pandas as pd\n", @@ -395,29 +271,22 @@ " return self.model.predict(pd.DataFrame({\"prompt\": [input_string]}))\n", "\n", "deployment = ModelDeployment.bind(default_version=en_to_de_version)" - ], - "metadata": { - "id": "7OZ2lqOS9oqw" - }, - "execution_count": 74, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "f018wd2fEia7" + }, "source": [ "You might have notice that hard-coding `\"prompt\"` as the input label here introduces hidden coupling between the registered model's signature and the deployment implementation. We'll come back to this later.\n", "\n", "Now, let's run the deployment and play around with it." - ], - "metadata": { - "id": "f018wd2fEia7" - } + ] }, { "cell_type": "code", - "source": [ - "serve.run(deployment, blocking=False)" - ], + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -425,50 +294,14 @@ "id": "MudMnivd_DrC", "outputId": "7f23394f-9f3e-4ce1-c67a-82c59a5bc25f" }, - "execution_count": 75, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "INFO 2024-12-23 16:00:03,032 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:03,248 controller 27795 -- Deploying new version of Deployment(name='ModelDeployment', app='default') (initial target replicas: 1).\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:03,425 controller 27795 -- Stopping 1 replicas of Deployment(name='ModelDeployment', app='default') with outdated versions.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:03,425 controller 27795 -- Adding 1 replica to Deployment(name='ModelDeployment', app='default').\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:00:05,548 controller 27795 -- Replica(id='ksuhh6uv', deployment='ModelDeployment', app='default') is stopped.\n", - "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:21.273257: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", - "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:21.325581: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", - "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:21.341597: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", - "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m 2024-12-23 16:00:25.496368: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m WARNING 2024-12-23 16:00:33,573 controller 27795 -- Deployment 'ModelDeployment' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n", - "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m Device set to use cpu\n", - "INFO 2024-12-23 16:00:36,639 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", - "INFO 2024-12-23 16:00:36,642 serve 20385 -- Deployed app 'default' successfully.\n" - ] - }, - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "DeploymentHandle(deployment='ModelDeployment')" - ] - }, - "metadata": {}, - "execution_count": 75 - } + "outputs": [], + "source": [ + "serve.run(deployment, blocking=False)" ] }, { "cell_type": "code", - "source": [ - "import requests\n", - "\n", - "requests.post(\n", - " \"http://127.0.0.1:8000/serve/\",\n", - " params={\"input_string\": \"The weather is lovely today\"},\n", - ").json()" - ], + "execution_count": 77, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -476,40 +309,50 @@ "id": "VTk1E5pp_gRz", "outputId": "67a20366-f637-4a0a-8c51-0f71bf5e1ea6" }, - "execution_count": 77, "outputs": [ { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ "\u001b[36m(ServeReplica:default:ModelDeployment pid=32047)\u001b[0m INFO 2024-12-23 16:00:41,540 default_ModelDeployment rekqfhvc 23cc9c43-746c-4575-968e-ee8d14972e6a -- POST /serve/ 307 5.8ms\n" ] }, { - "output_type": "execute_result", "data": { "text/plain": [ "[{'translation_text': 'Das Wetter ist heute nett.'}]" ] }, + "execution_count": 77, "metadata": {}, - "execution_count": 77 + "output_type": "execute_result" } + ], + "source": [ + "import requests\n", + "\n", + "requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " params={\"input_string\": \"The weather is lovely today\"},\n", + ").json()" ] }, { "cell_type": "markdown", + "metadata": { + "id": "i3CNI-mmE_22" + }, "source": [ "This works fine, but you might have noticed that the REST API does not line up with the model signature. Namely, it uses the label `\"input_string\"` while the served model version itself uses the input label `\"prompt\"`. Similarly, the model can accept multiple inputs values, but the API only accepts one.\n", "\n", "If this feels [smelly](https://en.wikipedia.org/wiki/Code_smell) to you, keep reading; we'll come back to this." - ], - "metadata": { - "id": "i3CNI-mmE_22" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "hsJ65rNNDMVj" + }, "source": [ "# Multiple Versions, One Endpoint\n", "\n", @@ -524,13 +367,15 @@ "Let's try registering another version of the model -- this time, one that translates from English to French. We'll register this under the version `\"2\"`; the model server will retrieve the model version that way.\n", "\n", "But first, let's extend the model server with multiplexing support." - ], - "metadata": { - "id": "hsJ65rNNDMVj" - } + ] }, { "cell_type": "code", + "execution_count": 78, + "metadata": { + "id": "d8GcI3WLE3Sc" + }, + "outputs": [], "source": [ "from ray import serve\n", "from fastapi import FastAPI\n", @@ -557,19 +402,11 @@ " async def serve(self, input_string: str):\n", " model = await self.get_model(serve.get_multiplexed_model_id())\n", " return model.predict(pd.DataFrame({\"prompt\": [input_string]}))" - ], - "metadata": { - "id": "d8GcI3WLE3Sc" - }, - "execution_count": 78, - "outputs": [] + ] }, { "cell_type": "code", - "source": [ - "multiplexed_deployment = MultiplexedModelDeployment.bind(model_name=\"translation_model\")\n", - "serve.run(multiplexed_deployment, blocking=False)" - ], + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -577,45 +414,32 @@ "id": "f-gisRU_FKlJ", "outputId": "a0c7318d-8271-4163-d58d-9ed97df72266" }, - "execution_count": 79, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "INFO 2024-12-23 16:01:13,932 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:14,037 controller 27795 -- Deploying new version of Deployment(name='MultiplexedModelDeployment', app='default') (initial target replicas: 1).\n", - "\u001b[36m(ProxyActor pid=27796)\u001b[0m INFO 2024-12-23 16:01:14,042 proxy 172.28.0.12 -- Got updated endpoints: {Deployment(name='MultiplexedModelDeployment', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:14,144 controller 27795 -- Removing 1 replica from Deployment(name='ModelDeployment', app='default').\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:14,144 controller 27795 -- Adding 1 replica to Deployment(name='MultiplexedModelDeployment', app='default').\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:01:16,310 controller 27795 -- Replica(id='rekqfhvc', deployment='ModelDeployment', app='default') is stopped.\n", - "INFO 2024-12-23 16:01:19,109 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", - "INFO 2024-12-23 16:01:19,112 serve 20385 -- Deployed app 'default' successfully.\n" - ] - }, - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "DeploymentHandle(deployment='MultiplexedModelDeployment')" - ] - }, - "metadata": {}, - "execution_count": 79 - } + "outputs": [], + "source": [ + "multiplexed_deployment = MultiplexedModelDeployment.bind(model_name=\"translation_model\")\n", + "serve.run(multiplexed_deployment, blocking=False)" ] }, { "cell_type": "markdown", - "source": [ - "Now let's actually register the new model version." - ], "metadata": { "id": "Qs7snXhxdlUR" - } + }, + "source": [ + "Now let's actually register the new model version." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K3_essFBEuCo", + "outputId": "b7f4f9e7-62bf-40ae-ed8a-db0110ad2e4f" + }, + "outputs": [], "source": [ "import pandas as pd\n", "\n", @@ -638,48 +462,20 @@ " )\n", "\n", "en_to_fr_version: str = str(model_info.registered_model_version)" - ], + ] + }, + { + "cell_type": "markdown", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "K3_essFBEuCo", - "outputId": "b7f4f9e7-62bf-40ae-ed8a-db0110ad2e4f" + "id": "rxOzkg65dnZW" }, - "execution_count": 80, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "Device set to use cpu\n", - "Device set to use cpu\n", - "Registered model 'translation_model' already exists. Creating a new version of this model...\n", - "Created version '15' of model 'translation_model'.\n" - ] - } - ] - }, - { - "cell_type": "markdown", "source": [ "Now that that's registered, we can query for it via the model server like so..." - ], - "metadata": { - "id": "rxOzkg65dnZW" - } + ] }, { "cell_type": "code", - "source": [ - "import requests\n", - "\n", - "requests.post(\n", - " \"http://127.0.0.1:8000/serve/\",\n", - " params={\"input_string\": \"The weather is lovely today\"},\n", - " headers={\"serve_multiplexed_model_id\": en_to_fr_version},\n", - ").json()" - ], + "execution_count": 81, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -687,11 +483,10 @@ "id": "EyeLmnPJFuRH", "outputId": "9dfb8df0-f207-42ae-b78b-db51d8843c15" }, - "execution_count": 81, "outputs": [ { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m INFO 2024-12-23 16:01:41,179 default_MultiplexedModelDeployment hnpendkt 1943df13-e56a-47d0-a49f-55fb78aa665b -- POST /serve/ 307 4.3ms\n", "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m INFO 2024-12-23 16:01:43,214 default_MultiplexedModelDeployment hnpendkt ee559e3e-a71d-48aa-8c24-10de5d7ad7df -- Loading model '15'.\n", @@ -704,37 +499,40 @@ ] }, { - "output_type": "execute_result", "data": { "text/plain": [ "[{'translation_text': \"Le temps est beau aujourd'hui\"}]" ] }, + "execution_count": 81, "metadata": {}, - "execution_count": 81 + "output_type": "execute_result" } + ], + "source": [ + "import requests\n", + "\n", + "requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " params={\"input_string\": \"The weather is lovely today\"},\n", + " headers={\"serve_multiplexed_model_id\": en_to_fr_version},\n", + ").json()" ] }, { "cell_type": "markdown", + "metadata": { + "id": "jVMCS4CedudN" + }, "source": [ "Note how we were able to immediately access the model version **without redeploying the model server**. Ray Serve's multiplexing capabilities allow it to dynamically fetch the model weights in a just-in-time fashion; if I never requested version 2, it never gets loaded. This helps conserve compute resources for the models that **do** get queried. What's even more useful is that, if the number of models loaded up exceeds the configured maximum (`max_num_models_per_replica`), the [least-recently used model version will get evicted](https://docs.ray.io/en/latest/serve/model-multiplexing.html#why-model-multiplexing).\n", "\n", "Given that we set `max_num_models_per_replica=2` above, the \"default\" English-to-German version of the model should still be loaded up and readily available to serve requests without any cold-start time. Let's confirm that now:" - ], - "metadata": { - "id": "jVMCS4CedudN" - } + ] }, { "cell_type": "code", - "source": [ - "requests.post(\n", - " \"http://127.0.0.1:8000/serve/\",\n", - " params={\"input_string\": \"The weather is lovely today\"},\n", - " headers={\"serve_multiplexed_model_id\": en_to_de_version},\n", - ").json()" - ], + "execution_count": 83, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -742,29 +540,38 @@ "id": "jEJFQNlwGGKh", "outputId": "b847d92e-fe0f-4439-bd87-e5773680c4d1" }, - "execution_count": 83, "outputs": [ { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ "\u001b[36m(ServeReplica:default:MultiplexedModelDeployment pid=32383)\u001b[0m INFO 2024-12-23 16:02:13,267 default_MultiplexedModelDeployment hnpendkt 8e680170-df74-49ba-856c-a7e9009abaab -- POST /serve/ 307 26.0ms\n" ] }, { - "output_type": "execute_result", "data": { "text/plain": [ "[{'translation_text': 'Das Wetter ist heute nett.'}]" ] }, + "execution_count": 83, "metadata": {}, - "execution_count": 83 + "output_type": "execute_result" } + ], + "source": [ + "requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " params={\"input_string\": \"The weather is lovely today\"},\n", + " headers={\"serve_multiplexed_model_id\": en_to_de_version},\n", + ").json()" ] }, { "cell_type": "markdown", + "metadata": { + "id": "D8CgPXcsIg5C" + }, "source": [ "## Auto-Signature\n", "\n", @@ -779,13 +586,15 @@ "Let's set things up so that the model server signature is inferred from the registered model itself. Since different versions of an MLflow can have different signatures, we'll use the \"default version\" to \"pin\" the signature; any attempt to multiplex an incompatible-signature model version we will have throw an error.\n", "\n", "Since Ray Serve binds the request and response signatures at class-definition time, we will use a Python metaclass to set this as a function of the specified model name and default model version." - ], - "metadata": { - "id": "D8CgPXcsIg5C" - } + ] }, { "cell_type": "code", + "execution_count": 84, + "metadata": { + "id": "u9GPbQrnP7OD" + }, + "outputs": [], "source": [ "import mlflow\n", "import pydantic\n", @@ -804,15 +613,19 @@ " outputs: mlflow.types.schema.Schema = model_signature.outputs\n", "\n", " return (schema_to_pydantic(inputs, name=\"InputModel\"), schema_to_pydantic(outputs, name=\"OutputModel\"))" - ], - "metadata": { - "id": "u9GPbQrnP7OD" - }, - "execution_count": 84, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PgetOY1LKp6m", + "outputId": "ada066e3-72b3-42af-c284-41118fcb2e20" + }, + "outputs": [], "source": [ "import mlflow\n", "\n", @@ -860,58 +673,11 @@ "deployment = deployment_from_model_name(\"translation_model\", default_version=en_to_fr_version)\n", "\n", "serve.run(deployment.bind(), blocking=False)" - ], - "metadata": { - "id": "PgetOY1LKp6m", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "ada066e3-72b3-42af-c284-41118fcb2e20" - }, - "execution_count": 95, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "INFO 2024-12-23 16:06:17,054 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:17,244 controller 27795 -- Deploying new version of Deployment(name='DynamicallyDefinedDeployment', app='default') (initial target replicas: 1).\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:17,368 controller 27795 -- Stopping 1 replicas of Deployment(name='DynamicallyDefinedDeployment', app='default') with outdated versions.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:17,368 controller 27795 -- Adding 1 replica to Deployment(name='DynamicallyDefinedDeployment', app='default').\n", - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:06:19,388 default_DynamicallyDefinedDeployment iwidgax2 -- Unloading model '15'.\n", - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:06:19,394 default_DynamicallyDefinedDeployment iwidgax2 -- Successfully unloaded model '15' in 0.4ms.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:06:19,538 controller 27795 -- Replica(id='iwidgax2', deployment='DynamicallyDefinedDeployment', app='default') is stopped.\n", - "INFO 2024-12-23 16:06:38,966 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", - "INFO 2024-12-23 16:06:38,968 serve 20385 -- Deployed app 'default' successfully.\n" - ] - }, - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "DeploymentHandle(deployment='DynamicallyDefinedDeployment')" - ] - }, - "metadata": {}, - "execution_count": 95 - } ] }, { "cell_type": "code", - "source": [ - "import requests\n", - "\n", - "resp = requests.post(\n", - " \"http://127.0.0.1:8000/serve/\",\n", - " json={\"prompt\": \"The weather is lovely today\"},\n", - ")\n", - "\n", - "assert resp.ok\n", - "assert resp.status_code == 200\n", - "\n", - "resp.json()" - ], + "execution_count": 88, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -919,11 +685,10 @@ "id": "x911zDhomWMj", "outputId": "7dc78df7-4f06-4871-d45f-37cfb852ffc5" }, - "execution_count": 88, "outputs": [ { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:03:30,503 default_DynamicallyDefinedDeployment iwidgax2 8989a73b-3173-48d0-a0dc-d301363e731c -- POST /serve/ 307 10.8ms\n", "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:03:30,544 default_DynamicallyDefinedDeployment iwidgax2 e00d9137-a259-4954-8a12-81a3314bc5d2 -- Loading model '15'.\n", @@ -937,74 +702,96 @@ ] }, { - "output_type": "execute_result", "data": { "text/plain": [ "[{'translation_text': \"Le temps est beau aujourd'hui\"}]" ] }, + "execution_count": 88, "metadata": {}, - "execution_count": 88 + "output_type": "execute_result" } - ] - }, - { - "cell_type": "code", + ], "source": [ "import requests\n", "\n", "resp = requests.post(\n", " \"http://127.0.0.1:8000/serve/\",\n", " json={\"prompt\": \"The weather is lovely today\"},\n", - " headers={\"serve_multiplexed_model_id\": str(en_to_fr_version)},\n", ")\n", "\n", "assert resp.ok\n", "assert resp.status_code == 200\n", "\n", "resp.json()" - ], + ] + }, + { + "cell_type": "code", + "execution_count": 89, "metadata": { - "id": "EX7ff2wg5PjL", "colab": { "base_uri": "https://localhost:8080/" }, + "id": "EX7ff2wg5PjL", "outputId": "edf0587a-abf5-4160-a621-f9ac4faee6bf" }, - "execution_count": 89, "outputs": [ { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33001)\u001b[0m INFO 2024-12-23 16:03:57,563 default_DynamicallyDefinedDeployment iwidgax2 df6b7526-edee-486a-a06e-f15407d4e1aa -- POST /serve/ 307 7.2ms\n" ] }, { - "output_type": "execute_result", "data": { "text/plain": [ "[{'translation_text': \"Le temps est beau aujourd'hui\"}]" ] }, + "execution_count": 89, "metadata": {}, - "execution_count": 89 + "output_type": "execute_result" } + ], + "source": [ + "import requests\n", + "\n", + "resp = requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " json={\"prompt\": \"The weather is lovely today\"},\n", + " headers={\"serve_multiplexed_model_id\": str(en_to_fr_version)},\n", + ")\n", + "\n", + "assert resp.ok\n", + "assert resp.status_code == 200\n", + "\n", + "resp.json()" ] }, { "cell_type": "markdown", + "metadata": { + "id": "kwkDDzebG_dd" + }, "source": [ "Let's now confirm that the signature-check provision we put in place actually works. For this, let's register this same model with a **slightly** different signature. This should be enough to trigger the failsafe.\n", "\n", "(Remember when we made the input label configurable at the start of this exercise? This is where that finally comes into play. 😎)" - ], - "metadata": { - "id": "kwkDDzebG_dd" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JYydMogXHsOJ", + "outputId": "d8cd96f0-58d2-462b-8902-d9a65b604dc0" + }, + "outputs": [], "source": [ "import pandas as pd\n", "\n", @@ -1026,30 +813,19 @@ " \"lang_to\": \"de\",\n", " },\n", " ).registered_model_version)" - ], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { - "id": "JYydMogXHsOJ", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "d8cd96f0-58d2-462b-8902-d9a65b604dc0" + "id": "5Yn-5VlIH6gs", + "outputId": "e22f1791-b013-445c-a2ab-08916c5c1032" }, - "execution_count": 90, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "Device set to use cpu\n", - "Device set to use cpu\n", - "Registered model 'translation_model' already exists. Creating a new version of this model...\n", - "Created version '16' of model 'translation_model'.\n" - ] - } - ] - }, - { - "cell_type": "code", + "outputs": [], "source": [ "import requests\n", "\n", @@ -1059,51 +835,36 @@ " headers={\"serve_multiplexed_model_id\": incompatible_version},\n", ")\n", "assert not resp.ok\n", - "assert resp.status_code == 409\n", + "resp.status_code == 409\n", "\n", "assert resp.json()[0][\"translation_text\"] == \"FAILED\"" - ], - "metadata": { - "id": "5Yn-5VlIH6gs", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "e22f1791-b013-445c-a2ab-08916c5c1032" - }, - "execution_count": 99, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m INFO 2024-12-23 16:07:41,052 default_DynamicallyDefinedDeployment c6ow5kq8 4847d79e-7b6f-4825-9d05-df0061222108 -- POST /serve/ 307 17.4ms\n", - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m INFO 2024-12-23 16:07:43,253 default_DynamicallyDefinedDeployment c6ow5kq8 80d5bb80-c4e9-4dd5-ae51-f5fd1fe9b50c -- Loading model '16'.\n", - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m Device set to use cpu\n", - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=33782)\u001b[0m ERROR 2024-12-23 16:07:49,186 default_DynamicallyDefinedDeployment c6ow5kq8 80d5bb80-c4e9-4dd5-ae51-f5fd1fe9b50c -- Failed to load model '16'. Error: Requested version 16 has signature incompatible with that of default version 15\n" - ] - } ] }, { "cell_type": "markdown", - "source": [ - "(The technically \"correct\" thing to do here would be to implement a response container that allows for an \"error message\" to be defined as part of the actual response, rather than \"abusing\" the `translation_text` field like we do here. For demonstration purposes, however, this'll do.)" - ], "metadata": { "id": "DMhjLZh-jCVa" - } + }, + "source": [ + "(The technically \"correct\" thing to do here would be to implement a response container that allows for an \"error message\" to be defined as part of the actual response, rather than \"abusing\" the `translation_text` field like we do here. For demonstration purposes, however, this'll do.)" + ] }, { "cell_type": "markdown", - "source": [ - "To fully close things out, let's try registering an entirely different model -- with an entirely different signature -- and deploying that via `deployment_from_model_name()`. This will help us confirm that the entire signature is defined from the loaded model." - ], "metadata": { "id": "cCLtQCgsjwPM" - } + }, + "source": [ + "To fully close things out, let's try registering an entirely different model -- with an entirely different signature -- and deploying that via `deployment_from_model_name()`. This will help us confirm that the entire signature is defined from the loaded model." + ] }, { "cell_type": "code", + "execution_count": 124, + "metadata": { + "id": "fXUPRszjIGYN" + }, + "outputs": [], "source": [ "import mlflow\n", "from transformers import pipeline\n", @@ -1138,15 +899,19 @@ " )\n", "\n", " return [resp] if type(resp) is not list else resp" - ], - "metadata": { - "id": "fXUPRszjIGYN" - }, - "execution_count": 124, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_p4FrmmhPAuq", + "outputId": "d5293b38-e56b-4b3f-c4e1-9906ba9c4383" + }, + "outputs": [], "source": [ "import pandas as pd\n", "\n", @@ -1166,35 +931,11 @@ " \"model_context\": \"My name is Hans and I live in Germany.\",\n", " },\n", " )" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "_p4FrmmhPAuq", - "outputId": "d5293b38-e56b-4b3f-c4e1-9906ba9c4383" - }, - "execution_count": 125, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "Device set to use cpu\n", - "/usr/local/lib/python3.10/dist-packages/mlflow/types/utils.py:435: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n", - " warnings.warn(\n", - "Device set to use cpu\n", - "Registered model 'question_answerer' already exists. Creating a new version of this model...\n", - "Created version '8' of model 'question_answerer'.\n" - ] - } ] }, { "cell_type": "code", - "source": [ - "model_info.signature" - ], + "execution_count": 117, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -1202,10 +943,8 @@ "id": "g-0mQytrKyOc", "outputId": "dd59ef90-ed96-490a-c27f-8f5dbc023ed3" }, - "execution_count": 117, "outputs": [ { - "output_type": "execute_result", "data": { "text/plain": [ "inputs: \n", @@ -1216,13 +955,26 @@ " None" ] }, + "execution_count": 117, "metadata": {}, - "execution_count": 117 + "output_type": "execute_result" } + ], + "source": [ + "model_info.signature" ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "afpSjdgYPaCw", + "outputId": "b01dcf25-289c-4ed6-f878-172966e88438" + }, + "outputs": [], "source": [ "from ray import serve\n", "\n", @@ -1233,54 +985,11 @@ " ).bind(),\n", " blocking=False\n", ")" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "afpSjdgYPaCw", - "outputId": "b01dcf25-289c-4ed6-f878-172966e88438" - }, - "execution_count": 127, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "INFO 2024-12-23 16:14:03,641 serve 20385 -- Connecting to existing Serve app in namespace \"serve\". New http options will not be applied.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:03,782 controller 27795 -- Deploying new version of Deployment(name='DynamicallyDefinedDeployment', app='default') (initial target replicas: 1).\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:03,905 controller 27795 -- Stopping 1 replicas of Deployment(name='DynamicallyDefinedDeployment', app='default') with outdated versions.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:03,906 controller 27795 -- Adding 1 replica to Deployment(name='DynamicallyDefinedDeployment', app='default').\n", - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=34337)\u001b[0m INFO 2024-12-23 16:14:05,922 default_DynamicallyDefinedDeployment zeqhtzxj -- Unloading model '4'.\n", - "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=34337)\u001b[0m INFO 2024-12-23 16:14:05,923 default_DynamicallyDefinedDeployment zeqhtzxj -- Successfully unloaded model '4' in 1.1ms.\n", - "\u001b[36m(ServeController pid=27795)\u001b[0m INFO 2024-12-23 16:14:06,047 controller 27795 -- Replica(id='zeqhtzxj', deployment='DynamicallyDefinedDeployment', app='default') is stopped.\n", - "INFO 2024-12-23 16:14:10,755 serve 20385 -- Application 'default' is ready at http://127.0.0.1:8000/.\n", - "INFO 2024-12-23 16:14:10,757 serve 20385 -- Deployed app 'default' successfully.\n" - ] - }, - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "DeploymentHandle(deployment='DynamicallyDefinedDeployment')" - ] - }, - "metadata": {}, - "execution_count": 127 - } ] }, { "cell_type": "code", - "source": [ - "import requests\n", - "\n", - "resp = requests.post(\n", - " \"http://127.0.0.1:8000/serve/\",\n", - " json={\"question\": \"The weather is lovely today\"},\n", - ")\n", - "resp.json()\n" - ], + "execution_count": 130, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -1288,17 +997,15 @@ "id": "MsLq5vbsS84T", "outputId": "73489ce0-984b-4915-e8e0-27db7a8966ec" }, - "execution_count": 130, "outputs": [ { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=35834)\u001b[0m INFO 2024-12-23 16:14:40,551 default_DynamicallyDefinedDeployment z6r4w9bp f766b328-7f11-467b-b6fa-04f6d6c17a84 -- POST /serve/ 307 8.6ms\n" ] }, { - "output_type": "execute_result", "data": { "text/plain": [ "[{'score': 3.255750561947934e-05,\n", @@ -1307,17 +1014,44 @@ " 'answer': 'Germany.'}]" ] }, + "execution_count": 130, "metadata": {}, - "execution_count": 130 + "output_type": "execute_result" }, { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ "\u001b[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=35834)\u001b[0m INFO 2024-12-23 16:14:42,857 default_DynamicallyDefinedDeployment z6r4w9bp 74527eca-776e-497f-b478-b4dc8e24f53a -- POST /serve 200 2181.2ms\n" ] } + ], + "source": [ + "import requests\n", + "\n", + "resp = requests.post(\n", + " \"http://127.0.0.1:8000/serve/\",\n", + " json={\"question\": \"The weather is lovely today\"},\n", + ")\n", + "resp.json()\n" ] } - ] -} \ No newline at end of file + ], + "metadata": { + "colab": { + "authorship_tag": "ABX9TyPKW7x903JxiHL2pqDZChKh", + "include_colab_link": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From a6e8d1078160f75467a198a9530cdee140c57e8a Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 23 Dec 2024 11:31:17 -0500 Subject: [PATCH 03/11] rename + add to toc --- notebooks/en/_toctree.yml | 7 ++++++- .../en/mlflow_ray_serve.ipynb | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) rename Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb => notebooks/en/mlflow_ray_serve.ipynb (99%) diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index cc99286d..0005f5ad 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -5,6 +5,12 @@ - local: index title: Overview + - title: MLOps Recipes + isExpanded: false + sections: + - local: mlflow_ray_serve.ipynb + title: Signature-Aware Model Serving from MLflow with Ray Serve + - title: LLM Recipes isExpanded: false sections: @@ -101,7 +107,6 @@ - local: fine_tuning_vlm_dpo_smolvlm_instruct title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU - - title: Search Recipes isExpanded: false sections: diff --git a/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb b/notebooks/en/mlflow_ray_serve.ipynb similarity index 99% rename from Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb rename to notebooks/en/mlflow_ray_serve.ipynb index 1b094e94..709bd5ef 100644 --- a/Serving_Foundation_Models_from_Hugging_Face_with_Ray_Serve,_MLflow.ipynb +++ b/notebooks/en/mlflow_ray_serve.ipynb @@ -16,7 +16,7 @@ "id": "I17bSxxg1evl" }, "source": [ - "# Serving Foundation Models from Hugging Face with Ray Serve, MLflow\n", + "# Signature-Aware Model Serving from MLflow with Ray Serve\n", "\n", "Authored by: Jonathan Jin" ] From 949b27ed96c81f268b6f34c815363c72837f26a8 Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 23 Dec 2024 11:41:55 -0500 Subject: [PATCH 04/11] Add conclusion + exercises --- notebooks/en/mlflow_ray_serve.ipynb | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/notebooks/en/mlflow_ray_serve.ipynb b/notebooks/en/mlflow_ray_serve.ipynb index 709bd5ef..61661135 100644 --- a/notebooks/en/mlflow_ray_serve.ipynb +++ b/notebooks/en/mlflow_ray_serve.ipynb @@ -1035,6 +1035,28 @@ ")\n", "resp.json()\n" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "In this notebook, we've leveraged MLflow's built-in support for tracking model signatures to heavily streamline the process of deploying an HTTP server to serve that model in online fashion. We've taken Ray Serve's powerful-but-fiddly primitives to empower ourselves to, in one line, deploy a model server with:\n", + "\n", + "- Version multiplexing;\n", + "- Automatic REST API signature setup;\n", + "- Safeguards to prevent use of model versions with incompatible signatures.\n", + "\n", + "In doing so, we've demonstrated Ray Serve's value and potential as a toolkit upon which you and your team can [\"build your own ML platform\"](https://docs.ray.io/en/latest/serve/index.html#how-does-serve-compare-to).\n", + "\n", + "We've also demonstrated ways to reduce the integration overhead and toil associated with using multiple tools in combination with each other. Seamless integration is a powerful argument in favor of self-contained all-encompassing platforms such as AWS Sagemaker or GCP Vertex AI. We've demonstrated that, with a little clever engineering and principled eye towards the friction points that users -- in this case, MLEs -- care about, we can reap similar benefits without tethering ourselves and our team to expensive vendor contracts.\n", + "\n", + "## Exercises\n", + "\n", + "- The generated API signature is **very similar** to the model signature, but there's still some mismatch. Can you identify where it is? Try fixing it. Hint: What happens when you try passing in multiple questions to the question-answerer endpoint we set up?\n", + "- We use the name `DynamicallyDefinedDeployment` every single time we generate a new deployment, regardless of what model name and version we pass in. Is this a problem? If so, what kind of issues do you foresee this approach creating? Try tweaking `deployment_from_model_name()` to handle those issues." + ] } ], "metadata": { From ad0981fa5abd67115c43da9356cf3e3c8508cef4 Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 23 Dec 2024 11:45:34 -0500 Subject: [PATCH 05/11] add to index --- notebooks/en/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/notebooks/en/index.md b/notebooks/en/index.md index 884789af..2b940e53 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -7,6 +7,7 @@ applications and solving various machine learning tasks using open-source tools Check out the recently added notebooks: +- [Signature-Aware Model Serving from MLflow with Ray Serve](mlflow_ray_serve) - [Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)](multimodal_rag_using_document_retrieval_and_vlms) - [Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)](fine_tuning_vlm_trl) - [Multi-agent RAG System 🤖🤝🤖](multiagent_rag_system) From bda8b343e1b017a1474a41b9879e30343e72730b Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 23 Dec 2024 11:50:21 -0500 Subject: [PATCH 06/11] Update notebooks/en/mlflow_ray_serve.ipynb --- notebooks/en/mlflow_ray_serve.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/en/mlflow_ray_serve.ipynb b/notebooks/en/mlflow_ray_serve.ipynb index 61661135..ea193f23 100644 --- a/notebooks/en/mlflow_ray_serve.ipynb +++ b/notebooks/en/mlflow_ray_serve.ipynb @@ -18,7 +18,7 @@ "source": [ "# Signature-Aware Model Serving from MLflow with Ray Serve\n", "\n", - "Authored by: Jonathan Jin" + "_Authored by: [Jonathan Jin](https://huggingface.co/jinnovation)_" ] }, { From d0884decb4f4e4eae6a46f849b09d5ea760f04c7 Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 30 Dec 2024 15:16:09 -0500 Subject: [PATCH 07/11] copy edits --- notebooks/en/mlflow_ray_serve.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/en/mlflow_ray_serve.ipynb b/notebooks/en/mlflow_ray_serve.ipynb index ea193f23..8253ceeb 100644 --- a/notebooks/en/mlflow_ray_serve.ipynb +++ b/notebooks/en/mlflow_ray_serve.ipynb @@ -90,7 +90,7 @@ "source": [ "# Register the Model\n", "\n", - "First, let's define the model that we'll use for our explorations today. For simplicity's sake, we'll use a simple text translation model, where the source and destination languages are configurable at registration time. In effect, this means that different \"versions\" of the model can be registered to translate different languages, but that the underlying model architecture and weights can stay the same." + "First, let's define the model that we'll use for our exploration today. For simplicity's sake, we'll use a simple text translation model, where the source and destination languages are configurable at registration time. In effect, this means that different \"versions\" of the model can be registered to translate different languages, but the underlying model architecture and weights can stay the same." ] }, { From 97f7e48b2746e0b9382eccb8e0d5c57750501539 Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 30 Dec 2024 15:17:30 -0500 Subject: [PATCH 08/11] Update notebooks/en/_toctree.yml Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- notebooks/en/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 0005f5ad..5b8d50e4 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -8,7 +8,7 @@ - title: MLOps Recipes isExpanded: false sections: - - local: mlflow_ray_serve.ipynb + - local: mlflow_ray_serve title: Signature-Aware Model Serving from MLflow with Ray Serve - title: LLM Recipes From a9a3202f1d06c7b9c57b80a58aaa56221fed9f61 Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 30 Dec 2024 15:18:29 -0500 Subject: [PATCH 09/11] update recently added notebooks --- notebooks/en/index.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/notebooks/en/index.md b/notebooks/en/index.md index 2b940e53..e8b6c0b5 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -8,14 +8,10 @@ applications and solving various machine learning tasks using open-source tools Check out the recently added notebooks: - [Signature-Aware Model Serving from MLflow with Ray Serve](mlflow_ray_serve) -- [Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)](multimodal_rag_using_document_retrieval_and_vlms) -- [Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)](fine_tuning_vlm_trl) -- [Multi-agent RAG System 🤖🤝🤖](multiagent_rag_system) -- [Multimodal RAG with ColQwen2, Reranker, and Quantized VLMs on Consumer GPUs](multimodal_rag_using_document_retrieval_and_reranker_and_vlms) -- [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl) -- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm) - [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct) - +- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm) +- [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl) +- [Multimodal RAG with ColQwen2, Reranker, and Quantized VLMs on Consumer GPUs](multimodal_rag_using_document_retrieval_and_reranker_and_vlms) You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook). From 0e3388b9a813eaa293f043840d57f9d768fa8aae Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 30 Dec 2024 15:20:40 -0500 Subject: [PATCH 10/11] copy edits --- notebooks/en/mlflow_ray_serve.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/notebooks/en/mlflow_ray_serve.ipynb b/notebooks/en/mlflow_ray_serve.ipynb index 8253ceeb..b68d0be5 100644 --- a/notebooks/en/mlflow_ray_serve.ipynb +++ b/notebooks/en/mlflow_ray_serve.ipynb @@ -29,7 +29,7 @@ "source": [ "# Introduction\n", "\n", - "This notebook explores solutions for streamlining the deployment of models from a model registry. For teams that want to productionize many models over time, investments at this \"transition point\" in the AI/ML project lifecycle can meaningful drive down time-to-production. This can be important for a younger, smaller team that may not have the benefit of large swathes of existing infrastructure in place to form a \"golden path\" for serving online models in production.\n", + "This notebook explores solutions for streamlining the deployment of models from a model registry. For teams that want to productionize many models over time, investments at this \"transition point\" in the AI/ML project lifecycle can meaningfully drive down time-to-production. This can be important for a younger, smaller team that may not have the benefit of existing infrastructure to form a \"golden path\" for serving online models in production.\n", "\n", "# Motivation\n", "\n", @@ -43,7 +43,7 @@ "\n", "Given all of the above, we motivate our exploration here with the following user story:\n", "\n", - "> I would like to deploy a model from a model registry (such as [MLflow](https://mlflow.org/)) using **only the name of the model**. The less boilerplate and scaffolding that I need to replicate each time I want to deploy a new model,the better. I would like the ability to dynamically select between different versions of the model without needing to set up a whole new deployment to accommodate those new versions.\n" + "> I would like to deploy a model from a model registry (such as [MLflow](https://mlflow.org/)) using **only the name of the model**. The less boilerplate and scaffolding that I need to replicate each time I want to deploy a new model, the better. I would like the ability to dynamically select between different versions of the model without needing to set up a whole new deployment to accommodate those new versions.\n" ] }, { From 79347dc13b3162e4819191b6e87bacaefee771fe Mon Sep 17 00:00:00 2001 From: Jonathan Jin Date: Mon, 30 Dec 2024 15:28:36 -0500 Subject: [PATCH 11/11] add two new exercises --- notebooks/en/mlflow_ray_serve.ipynb | 2 ++ 1 file changed, 2 insertions(+) diff --git a/notebooks/en/mlflow_ray_serve.ipynb b/notebooks/en/mlflow_ray_serve.ipynb index b68d0be5..d2d93523 100644 --- a/notebooks/en/mlflow_ray_serve.ipynb +++ b/notebooks/en/mlflow_ray_serve.ipynb @@ -1055,6 +1055,8 @@ "## Exercises\n", "\n", "- The generated API signature is **very similar** to the model signature, but there's still some mismatch. Can you identify where it is? Try fixing it. Hint: What happens when you try passing in multiple questions to the question-answerer endpoint we set up?\n", + "- MLflow model signatures allow for [optional inputs](https://mlflow.org/docs/latest/model/signatures.html#required-vs-optional-input-fields). Our current implementation does not account for this. How might we extend the implementation here to support optional inputs?\n", + "- Similarly, MLflow model signatures allow for non-input [\"inference parameters\"](https://mlflow.org/docs/latest/model/signatures.html#model-signatures-with-inference-params), which our current implementation also does not support. How might we extend our implementation here to support inference parameters?\n", "- We use the name `DynamicallyDefinedDeployment` every single time we generate a new deployment, regardless of what model name and version we pass in. Is this a problem? If so, what kind of issues do you foresee this approach creating? Try tweaking `deployment_from_model_name()` to handle those issues." ] }