diff --git a/CHANGELOG.md b/CHANGELOG.md index a56b9074..93f92f3e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,11 +2,19 @@ ## [Unreleased] +### Added + +- :sparkles: Enable to override parameters and the runner at predict time in ``KedroPipelineModel`` ([#445](https://github.com/Galileo-Galilei/kedro-mlflow/issues/445), [#612](https://github.com/Galileo-Galilei/kedro-mlflow/pull/612)) + +### Changed + +- :boom: :pushpin: Pin ``mlflow>=2.7.0`` to support predict parameters for custom models (see above feature) + ## [0.13.4] - 2024-12-14 ### Fixed -- :bug: :ambulance: Ensure `MlflowArtifactDataset` logs in the same run that parameters to when using `mlflow>=2.18` in combination with `ThreadRunner` [#613](https://github.com/Galileo-Galilei/kedro-mlflow/issues/613)) +- :bug: :ambulance: Ensure `MlflowArtifactDataset` logs in the same run that parameters to when using `mlflow>=2.18` in combination with `ThreadRunner` ([#613](https://github.com/Galileo-Galilei/kedro-mlflow/issues/613)) ## [0.13.3] - 2024-10-29 diff --git a/README.md b/README.md index 2a7483c8..b7dfb2ee 100644 --- a/README.md +++ b/README.md @@ -30,27 +30,24 @@ **Important: ``kedro-mlflow`` is only compatible with ``kedro>=0.16.0`` and ``mlflow>=1.0.0``. If you have a project created with an older version of ``Kedro``, see this [migration guide](https://github.com/quantumblacklabs/kedro/blob/master/RELEASE.md#migration-guide-from-kedro-015-to-016).** -``kedro-mlflow`` is available on PyPI, so you can install it with ``pip``: -```console -pip install kedro-mlflow -``` +You can install ``kedro-mlflow`` with several tools and packaging platforms: -If you want to use the most up to date version of the package which is under development and not released yet, you can install the package from github: +| **Logo** | **Platform** |**Command**| +|:-----------------------------------------------------------------:|:------------:|:----------------------------------------------------:| +| ![PyPI logo](https://simpleicons.org/icons/pypi.svg) | PyPI | ``pip install kedro-mlflow`` | +| ![Conda Forge logo](https://simpleicons.org/icons/condaforge.svg) | Conda Forge | ``conda install kedro-mlflow --channel conda-forge`` | +| ![GitHub logo](https://simpleicons.org/icons/github.svg) | GitHub | ``pip install --upgrade git+https://github.com/Galileo-Galilei/kedro-mlflow.git`` | -```console -pip install --upgrade git+https://github.com/Galileo-Galilei/kedro-mlflow.git -``` - -I strongly recommend to use ``conda`` (a package manager) to create an environment and to read [``kedro`` installation guide](https://kedro.readthedocs.io/en/latest/get_started/install.html). +I strongly recommend to use ``conda`` (a package manager) to create a virtual environment and to read [``kedro`` installation guide](https://kedro.readthedocs.io/en/latest/get_started/install.html). # Getting started The documentation contains: -- [A "hello world" example](https://kedro-mlflow.readthedocs.io/en/latest/source/03_getting_started/index.html) which demonstrates how you to **setup your project**, **version parameters** and **datasets**, and browse your runs in the UI. -- A section for [advanced machine learning versioning](https://kedro-mlflow.readthedocs.io/en/latest/source/04_experimentation_tracking/index.html) to show more advanced features (mlflow configuration through the plugin, package and serve a kedro ``Pipeline``...) -- A section to demonstrate how to use `kedro-mlflow` as a [machine learning framework](https://kedro-mlflow.readthedocs.io/en/latest/source/05_framework_ml/index.html) to deliver production ready pipelines and serve them. This section comes with [an example repo](https://github.com/Galileo-Galilei/kedro-mlflow-tutorial) you can clone and try out. +- [A quickstart in 1 mn example](https://kedro-mlflow.readthedocs.io/en/latest/source/03_quickstart/index.html) which demonstrates how you to **setup your project**, **version parameters** and **datasets**, and browse your runs in the UI. +- A section for [advanced machine learning versioning](https://kedro-mlflow.readthedocs.io/en/latest/source/10_experiment_tracking/index.html) to show more advanced features (mlflow configuration through the plugin, package and serve a kedro ``Pipeline``...) +- A section to demonstrate how to use `kedro-mlflow` as a [machine learning framework](https://kedro-mlflow.readthedocs.io/en/latest/source/21_pipeline_serving/index.html) to deliver production ready pipelines and serve them. This section comes with [an example repo](https://github.com/Galileo-Galilei/kedro-mlflow-tutorial) you can clone and try out. Some frequently asked questions on more advanced features: diff --git a/docs/index.rst b/docs/index.rst index b1cc2b73..f7f4e8c0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -7,21 +7,33 @@ Welcome to kedro-mlflow's documentation! ======================================== .. toctree:: - :maxdepth: 6 + :maxdepth: -1 + :caption: Getting started Introduction Installation - Getting Started - Experimentation tracking - Pipeline serving - A mlops framework for continuous model serving - Interactive use - Python objects + Quickstart in 1 mn .. toctree:: - :maxdepth: 6 + :maxdepth: -1 + :caption: Experiment tracking - API documentation + In a kedro project + In a notebook + +.. toctree:: + :maxdepth: -1 + :caption: Pipeline serving + + Custom mlflow model for kedro pipelines + A mlops framework for continuous model serving + +.. toctree:: + :maxdepth: -1 + :caption: Technical documentation + + Python objects + API documentation Indices and tables ================== diff --git a/docs/source/01_introduction/01_introduction.md b/docs/source/01_introduction/01_introduction.md index 802dd56e..323cdee3 100644 --- a/docs/source/01_introduction/01_introduction.md +++ b/docs/source/01_introduction/01_introduction.md @@ -27,7 +27,7 @@ While ``Kedro`` and ``Mlflow`` do not compete in the same field, they provide so | I/O configuration files | - ``catalog.yml``
- ``parameters.yml`` | ``MLproject`` | | Compute abstraction | - ``Pipeline``
- ``Node`` | N/A | | Compute configuration files | - ``hooks.py``
- ``run.py`` | `MLproject` | -| Parameters and data versioning | - ``Journal``
- ``AbstractVersionedDataset`` | - ``log_metric``
- ``log_artifact``
- ``log_param`` | +| Parameters and data versioning | - ``Journal`` (deprecated)
- Experiment tracking (deprecated)
- ``AbstractVersionedDataset`` | - ``log_metric``
- ``log_artifact``
- ``log_param``| | Cli execution | command ``kedro run`` | command ``mlflow run`` | | Code packaging | command ``kedro package`` | N/A | | Model packaging | N/A | - ``Mlflow Models`` (``mlflow.XXX.log_model`` functions)
- ``Mlflow Flavours`` | @@ -39,23 +39,17 @@ We discuss hereafter how the two libraries compete on the different functionalit ``Mlflow`` and ``Kedro`` are essentially overlapping on the way they offer a dedicated configuration files for running the pipeline from CLI. However: -- ``Mlflow`` provides a single configuration file (the ``MLProject``) where all elements are declared (data, parameters and pipelines). Its goal is mainly to enable CLI execution of the project, but it is not very flexible. In my opinion, this file is **production oriented** and is not really intended to use for exploration. +- ``Mlflow`` provides a single configuration file (the ``MLProject``) where all elements are declared (data, parameters and pipelines). Its goal is mainly to enable CLI execution of the project, but it is not very flexible. This file is **production oriented** and is not really intended to use for and development. - ``Kedro`` offers a bunch of files (``catalog.yml``, ``parameters.yml``, ``pipeline.py``) and their associated abstraction (``AbstractDataset``, ``DataCatalog``, ``Pipeline`` and ``node`` objects). ``Kedro`` is much more opinionated: each object has a dedicated place (and only one!) in the template. This makes the framework both **exploration and production oriented**. The downside is that it could make the learning curve a bit sharper since a newcomer has to learn all ``Kedro`` specifications. It also provides a ``kedro-viz`` plugin to visualize the DAG interactively, which is particularly handy in medium-to-big projects. -> **``Kedro`` is a clear winner here, since it provides more functionnalities than ``Mlflow``. It handles very well _by design_ the exploration phase of data science projects when Mlflow is less flexible.** +```{note} +**``Kedro`` is a clear winner here, since it provides more functionnalities than ``Mlflow``. It handles very well _by design_ the exploration phase of data science projects when Mlflow is less flexible.** +``` ### Versioning: Kedro 1 - 1 Mlflow -** This section will be updated soon with the brand new experiment tracking functionality of kedro** - -The ``Kedro`` ``Journal`` aimed at reproducibility (it was removed in ``kedro==0.18``), but is not focused on machine learning. The `Journal` keeps track of two elements: - -- the CLI arguments, including *on the fly* parameters. This makes the command used to run the pipeline fully reproducible. -- the ``AbstractVersionedDataset`` for which versioning is activated. It consists in copying the data whom ``versioned`` argument is ``True`` when the ``save`` method of the ``AbstractVersionedDataset`` is called. -This approach suffers from two main drawbacks: - - the configuration is assumed immutable (including parameters), which is not realistic ni machine learning projects where they are very volatile. To fix this, the ``git sha`` has been recently added to the ``Journal``, but it has still some bugs in my experience (including the fact that the current ``git sha`` is logged even if the pipeline is ran with uncommitted change, which prevents reproducibility). This is still recent and will likely evolve in the future. - - there is no support for browsing old runs, which prevents [cleaning the database with old and unused datasets](https://github.com/quantumblacklabs/kedro/issues/406), compare runs between each other... +Kedro ahas made a bunch of attempts in the world of experiment tracking, with the ``Journal`` in early days (``kedro<=0.18``), then with an [experiment tracking functionality](https://docs.kedro.org/projects/kedro-viz/en/v9.2.0/experiment_tracking.html) which kept track of the parameters but which will be removed in ``kedro>=0.20`` due to the lack of traction (https://github.com/kedro-org/kedro-viz/issues/2202). On the other hand, ``Mlflow``: @@ -64,7 +58,9 @@ On the other hand, ``Mlflow``: - [comes with a *User Interface* (UI)](https://mlflow.org/docs/latest/tracking.html#id7) which enable to browse / filter / sort the runs, display graphs of the metrics, render plots... This make the run management much easier than in ``Kedro``. - has a command to reproduce exactly the run from a given ``git sha``, [which is not possible in ``Kedro``](https://github.com/quantumblacklabs/kedro/issues/297). -> **``Mlflow`` is a clear winner here, because _UI_ and _run querying_ are must-have for machine learning projects. It is more mature than ``Kedro`` for versioning and more focused on machine learning.** +```{note} +**``Mlflow`` is a clear winner here, because _UI_ and _run querying_ are must-have for machine learning projects. It is more mature than ``Kedro`` for versioning and more focused on machine learning.** +``` ### Model packaging and service: Kedro 1 - 2 Mlflow @@ -79,8 +75,10 @@ On the other hand, ``Mlflow``: When a stored model meets these requirements, ``Mlflow`` provides built-in tools to serve the model (as an API or for batch prediction) on many machine learning tools (Microsoft Azure ML, Amazon Sagemaker, Apache SparkUDF) and locally. -> **``Mlflow`` is currently the only tool which adresses model serving. This is currently not the top priority for ``Kedro``, but may come in the future ([through Kedro Server maybe?](https://github.com/quantumblacklabs/kedro/issues/143))** +```{note} +``Mlflow`` is currently the only tool which adresses model serving. Some [plugins address model deployment and serving](https://docs.kedro.org/en/stable/extend_kedro/plugins.html#community-developed-plugins) in the Kedro ecosystem, but they are not as well maintained as the core framework. +``` ### Conclusion: Use Kedro and add Mlflow for machine learning projects -In my opinion, ``Kedro``'s will to enforce software engineering best practice makes it really useful for machine learning teams. It is extremely well documented and the support is excellent, which makes it very user friendly even for people with no computer science background. However, it lacks some machine learning-specific functionalities (better versioning, model service), and it is where ``Mlflow`` fills the gap. +``Kedro``'s will to enforce software engineering best practice makes it really useful for machine learning teams. It is extremely well documented and the support is excellent, which makes it very user friendly even for people with no computer science background. However, it lacks some machine learning-specific functionalities (better versioning, model service), and it is where ``Mlflow`` fills the gap. diff --git a/docs/source/01_introduction/02_motivation.md b/docs/source/01_introduction/02_motivation.md index 4bf199a6..5b7caeb4 100644 --- a/docs/source/01_introduction/02_motivation.md +++ b/docs/source/01_introduction/02_motivation.md @@ -4,7 +4,7 @@ Basically, you should use `kedro-mlflow` in **any `Kedro` project which involves machine learning** / deep learning. As stated in the [introduction](./01_introduction.md), `Kedro`'s current versioning (as of version `0.16.6`) is not sufficient for machine learning projects: it lacks a UI and a ``run`` management system. Besides, the `KedroPipelineModel` ability to serve a kedro pipeline as an API or a batch in one line of code is a great addition for collaboration and transition to production. -If you do not use ``Kedro`` or if you do pure data processing which do not involve *machine learning*, this plugin is not what you are seeking for ;) +If you do not use ``Kedro`` or if you do pure data processing which does not involve *machine learning*, this plugin is not what you are seeking for ;-) ## Why should I use kedro-mlflow? diff --git a/docs/source/02_installation/01_installation.md b/docs/source/02_installation/01_installation.md index b72d8e1c..48d6f4b7 100644 --- a/docs/source/02_installation/01_installation.md +++ b/docs/source/02_installation/01_installation.md @@ -42,7 +42,7 @@ Requires: pip-tools, cachetools, fsspec, toposort, anyconfig, PyYAML, click, plu ## Install the plugin -The current version of the plugin is compatible with ``kedro>=0.16.0``. Since Kedro tries to enforce backward compatibility, it will very likely remain compatible with further versions. +There are version of the plugin compatible up to ``kedro>=0.16.0`` and ``mlflow>=0.8.0``. ``kedro-mlflow`` stops adding features to a minor version 2 to 6 months after a new kedro release. ### Install from PyPI @@ -70,7 +70,7 @@ Type ``kedro info`` in a terminal to check the installation. If it has succeede | |/ / _ \/ _` | '__/ _ \ | < __/ (_| | | | (_) | |_|\_\___|\__,_|_| \___/ -v0.16. +v0.. kedro allows teams to create analytics projects. It is developed as part of @@ -95,9 +95,4 @@ Usage: kedro mlflow [OPTIONS] COMMAND [ARGS]... Options: -h, --help Show this message and exit. - -Commands: - new Create a new kedro project with updated template. ``` - -*Note: For now, the `kedro mlflow new` command is not implemented. You must use `kedro new` to create a project, and then call `kedro mlflow init` inside this new project.* diff --git a/docs/source/02_installation/02_setup.md b/docs/source/02_installation/02_setup.md index d7629d64..5029934b 100644 --- a/docs/source/02_installation/02_setup.md +++ b/docs/source/02_installation/02_setup.md @@ -15,7 +15,7 @@ In order to use the ``kedro-mlflow`` plugin, you need to setup its configuration ### Setting up the ``kedro-mlflow`` configuration file -``kedro-mlflow`` is [configured](../07_python_objects/05_Configuration.md) through an ``mlflow.yml`` file. The recommended way to initialize the `mlflow.yml` is by using [the ``kedro-mlflow`` CLI](../07_python_objects/04_CLI.md), but you can create it manually. +``kedro-mlflow`` is [configured](../30_python_objects/05_Configuration.md) through an ``mlflow.yml`` file. The recommended way to initialize the `mlflow.yml` is by using [the ``kedro-mlflow`` CLI](../30_python_objects/04_CLI.md), but you can create it manually. ```{note} Since ``kedro-mlflow>=0.11.2``, the configuration file is optional. However, the plugin will use default ``mlflow`` configuration. Specifically, the runs will be stored in a ``mlruns`` folder at the root fo the kedro project since no ``mlflow_tracking_uri`` is configured. diff --git a/docs/source/02_installation/03_migration_guide.md b/docs/source/02_installation/03_migration_guide.md index 128b758e..33b8c8fb 100644 --- a/docs/source/02_installation/03_migration_guide.md +++ b/docs/source/02_installation/03_migration_guide.md @@ -117,9 +117,9 @@ Be aware that if you have saved a pipeline as a mlflow model with `pipeline_ml_f ```json { - predictions: + "predictions": { - + "" } } ``` @@ -128,7 +128,7 @@ to: ```json { - + "" } ``` diff --git a/docs/source/02_installation/index.rst b/docs/source/02_installation/index.rst index bbb824f5..968517fd 100644 --- a/docs/source/02_installation/index.rst +++ b/docs/source/02_installation/index.rst @@ -4,6 +4,7 @@ Introduction .. toctree:: :maxdepth: 4 + Install the plugin <01_installation.md> Setup your kedro project <02_setup.md> Migration guide between versions <03_migration_guide.md> diff --git a/docs/source/03_getting_started/00_intro_tutorial.md b/docs/source/03_quickstart/00_intro_tutorial.md similarity index 100% rename from docs/source/03_getting_started/00_intro_tutorial.md rename to docs/source/03_quickstart/00_intro_tutorial.md diff --git a/docs/source/03_getting_started/01_example_project.md b/docs/source/03_quickstart/01_example_project.md similarity index 97% rename from docs/source/03_getting_started/01_example_project.md rename to docs/source/03_quickstart/01_example_project.md index 4c28dab1..2c4b652e 100644 --- a/docs/source/03_getting_started/01_example_project.md +++ b/docs/source/03_quickstart/01_example_project.md @@ -5,9 +5,9 @@ Create a conda environment and install ``kedro-mlflow`` (this will automatically install ``kedro>=0.16.0``). ```console -conda create -n km_example python=3.9 --yes +conda create -n km_example python=3.10 --yes conda activate km_example -pip install kedro-mlflow==0.13.4 +pip install kedro-mlflow ``` ## Install the toy project diff --git a/docs/source/03_getting_started/02_first_steps.md b/docs/source/03_quickstart/02_first_steps.md similarity index 92% rename from docs/source/03_getting_started/02_first_steps.md rename to docs/source/03_quickstart/02_first_steps.md index f73859f8..24cf114d 100644 --- a/docs/source/03_getting_started/02_first_steps.md +++ b/docs/source/03_quickstart/02_first_steps.md @@ -2,10 +2,14 @@ ## Initialize kedro-mlflow -First, you need to initialize your project and add the plugin-specific configuration file with this command: +```{note} +This step is optional if you use ``kedro>=0.11.2``. If you do not create a ``mlflow.yml`` configuration file, ``kedro-mlflow`` will use the defaults. However this is heavily recommended because in professional setup you often need some specific enterprise configuration. +``` + +You can initialize your project with the plugin-specific configuration file with this command: ```console -kedro mlflow init +kedro mlflow init --env=local ``` You will see the following message: @@ -18,6 +22,7 @@ The ``conf/local`` folder is updated and you can see the `mlflow.yml` file: ![initialized_project](../imgs/initialized_project.png) + *Optional: If you have configured your own mlflow server, you can specify the tracking uri in the ``mlflow.yml`` (replace the highlighted line below):* ![mlflow_yml](../imgs/mlflow_yml.png) @@ -109,9 +114,6 @@ You should see the following graph: which indicates clearly which parameters are logged (in the red boxes with the "parameter" icon). -### Journal information - -The informations provided by the ``Kedro``'s ``Journal`` are also recorded as ``tags`` in the mlflow ui in order to make reproducible. In particular, the exact command used for running the pipeline and the kedro version used are stored. ### Artifacts @@ -159,4 +161,4 @@ This works for any type of file (including images with ``MatplotlibWriter``) and Above vanilla example is just the beginning of your experience with ``kedro-mlflow``. Check out the next sections to see how `kedro-mlflow`: - offers advanced capabilities for machine learning versioning -- can help to create standardize pipelines for deployment in production +- offers a way to create custom mlflow model from your kedro pipelines to deploy effortlessly in production diff --git a/docs/source/03_getting_started/index.rst b/docs/source/03_quickstart/index.rst similarity index 100% rename from docs/source/03_getting_started/index.rst rename to docs/source/03_quickstart/index.rst diff --git a/docs/source/05_pipeline_serving/02_custom_kedro_pipeline_model.md b/docs/source/05_pipeline_serving/02_custom_kedro_pipeline_model.md deleted file mode 100644 index 1cfa31d8..00000000 --- a/docs/source/05_pipeline_serving/02_custom_kedro_pipeline_model.md +++ /dev/null @@ -1,47 +0,0 @@ -## Register a pipeline to mlflow with ``KedroPipelineModel`` custom mlflow model - -``kedro-mlflow`` has a ``KedroPipelineModel`` class (which inherits from ``mlflow.pyfunc.PythonModel``) which can turn any kedro ``Pipeline`` object to a Mlflow Model. - -To convert a ``Pipeline`` to a mlflow model, you need to create a ``KedroPipelineModel`` and then log it to mlflow. An example is given in below snippet: - -```python -from pathlib import Path -from kedro.framework.session import KedroSession -from kedro.framework.startup import bootstrap_project - -bootstrap_project(r"") -session = KedroSession.create(project_path=r"") - -# "pipeline" is the Pipeline object you want to convert to a mlflow model - -context = session.load_context() # this setups mlflow configuration -catalog = context.catalog -pipeline = context.pipelines[""] -input_name = "instances" - - -# artifacts are all the inputs of the inference pipelines that are persisted in the catalog - -# (optional) get the schema of the input dataset -input_data = catalog.load(input_name) -model_signature = infer_signature(model_input=input_data) - -# you can optionally pass other arguments, like the "copy_mode" to be used for each dataset -kedro_pipeline_model = KedroPipelineModel( - pipeline=pipeline, catalog=catalog, input_name=input_name -) - -artifacts = kedro_pipeline_model.extract_pipeline_artifacts() - -mlflow.pyfunc.log_model( - artifact_path="model", - python_model=kedro_pipeline_model, - artifacts=artifacts, - conda_env={"python": "3.10.0", dependencies: ["kedro==0.18.11"]}, - model_signature=model_signature, -) -``` - -Note that you need to provide the ``log_model`` function a bunch of non trivial-to-retrieve informations (the conda environment, the "artifacts" i.e. the persisted data you need to reuse like tokenizers / ml models / encoders, the model signature i.e. the columns names and types...). The ``KedroPipelineModel`` object has methods like `extract_pipeline_artifacts` to help you, but it needs some work on your side. - -> Saving Kedro pipelines as Mlflow Model objects is convenient and enable pipeline serving. However, it does not does not solve the decorrelation between training and inference: each time one triggers a training pipeline, (s)he must think to save it immediately afterwards. `kedro-mlflow` offers a convenient API to simplify this workflow, as described in the following sections. diff --git a/docs/source/05_pipeline_serving/03_cli_modelify.md b/docs/source/05_pipeline_serving/03_cli_modelify.md deleted file mode 100644 index be7bf334..00000000 --- a/docs/source/05_pipeline_serving/03_cli_modelify.md +++ /dev/null @@ -1,11 +0,0 @@ -## Register a pipeline to mlflow with ``KedroPipelineModel`` custom mlflow model - -You can log a Kedro ``Pipeline`` to mlflow as a custom model through the CLI with ``modelify`` command: - -```bash -kedro mlflow modelify --pipeline= --input-name -``` - -This command will create a new run with an artifact named ``model``. Open the user interface with ``kedro mlflow ui`` to check the result. You can also: -- specify the run id in which you want to log the pipeline with the ``--run-id`` argument, and its name with the --run-name argument. -- pass almost all arguments accepted by [``mlflow.pyfunc.log_model``](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.log_model), see the list of all accepted arguments in the [API documentation](https://kedro-mlflow.readthedocs.io/en/stable/source/08_API/kedro_mlflow.framework.cli.html#modelify) diff --git a/docs/source/05_pipeline_serving/04_hook_pipeline_ml.md b/docs/source/05_pipeline_serving/04_hook_pipeline_ml.md deleted file mode 100644 index d78121c7..00000000 --- a/docs/source/05_pipeline_serving/04_hook_pipeline_ml.md +++ /dev/null @@ -1,75 +0,0 @@ -## Automatically log an inference after running the training pipeline - -For consistency, you may want to log an inference pipeline (including some data preprocessing and prediction post processing) after you ran a training pipeline, with all the artifacts newly generated (the new model, encoders, vectorizers...). - -### Getting started - -1. Install ``kedro-mlflow`` ``MlflowHook`` (this is done automatically if you have installed ``kedro-mlflow`` in a ``kedro>=0.16.5`` project) -2. Turn your training pipeline in a ``PipelineML`` object with ``pipeline_ml_factory`` function in your ``pipeline_registry.py``: - - ```python - # pipeline_registry.py for kedro>=0.17.2 (hooks.py for ``kedro>=0.16.5, <0.17.2) - - from kedro_mlflow_tutorial.pipelines.ml_app.pipeline import create_ml_pipeline - - - def register_pipelines(self) -> Dict[str, Pipeline]: - ml_pipeline = create_ml_pipeline() - training_pipeline_ml = pipeline_ml_factory( - training=ml_pipeline.only_nodes_with_tags("training"), - inference=ml_pipeline.only_nodes_with_tags("inference"), - input_name="instances", - log_model_kwargs=dict( - artifact_path="kedro_mlflow_tutorial", - conda_env={ - "python": 3.10, - "dependencies": [f"kedro_mlflow_tutorial=={PROJECT_VERSION}"], - }, - signature="auto", - ), - ) - - return {"training": training_pipeline_ml} - ``` - -3. Persist your artifacts locally in the ``catalog.yml`` - - ```yaml - label_encoder: - type: pickle.PickleDataset # <- This must be any Kedro Dataset other than "MemoryDataset" - filepath: data/06_models/label_encoder.pkl # <- This must be a local path, no matter what is your mlflow storage (S3 or other) - ``` - -4. Launch your training pipeline: - - ```bash - kedro run --pipeline=training - ``` - - **The inference pipeline will _automagically_ be logged as a mlflow model at the end!** - -5. Go to the UI, retrieve the run id of your "inference pipeline" model and use it as you want, e.g. in the `catalog.yml`: - - ```yaml - # catalog.yml - - pipeline_inference_model: - type: kedro_mlflow.io.models.MlflowModelTrackingDataset - flavor: mlflow.pyfunc - pyfunc_workflow: python_model - artifact_path: kedro_mlflow_tutorial # the name of your mlflow folder = the model_name in pipeline_ml_factory - run_id: - ``` - -### Complete step by step demo project with code - -A step by step tutorial with code is available in the [kedro-mlflow-tutorial repository on github](https://github.com/Galileo-Galilei/kedro-mlflow-tutorial#serve-the-inference-pipeline-to-a-end-user). - -You have also other resources to understand the rationale: -- an explanation of the [``PipelineML`` class in the python objects section](../07_python_objects/03_Pipelines.md) -- detailed explanations [on this issue](https://github.com/Galileo-Galilei/kedro-mlflow/issues/16). -- an example of use in a user project [in this repo](https://github.com/laurids-reichardt/kedro-examples/blob/kedro-mlflow-hotfix2/text-classification/src/text_classification/pipelines/pipeline.py). - -### Motivation - -You can find more about the motivations in . diff --git a/docs/source/05_pipeline_serving/05_deployment_patterns.md b/docs/source/05_pipeline_serving/05_deployment_patterns.md deleted file mode 100644 index 7ff50aa7..00000000 --- a/docs/source/05_pipeline_serving/05_deployment_patterns.md +++ /dev/null @@ -1,3 +0,0 @@ -## Deployment patterns for kedro pipelines - -A step by step tutorial with code is available in the [kedro-mlflow-tutorial repository on github](https://github.com/Galileo-Galilei/kedro-mlflow-tutorial#serve-the-inference-pipeline-to-an-end-user) which explains how to serve the pipeline as an API or a batch. diff --git a/docs/source/05_pipeline_serving/index.rst b/docs/source/05_pipeline_serving/index.rst deleted file mode 100644 index 0b30c877..00000000 --- a/docs/source/05_pipeline_serving/index.rst +++ /dev/null @@ -1,11 +0,0 @@ -Introduction -============ - -.. toctree:: - :maxdepth: 4 - - Reminder on Mlflow Models <01_mlflow_models.md> - Log a Pipeline as model with ``KedroPipelineModel`` <02_custom_kedro_pipeline_model.md> - Log a Pipeline as model with the CLI <03_cli_modelify.md> - Automatically log inference pipeline after training <04_hook_pipeline_ml.md> - Deployments patterns for ``KedroPipelineModel`` models <05_deployment_patterns.md> diff --git a/docs/source/08_API/kedro_mlflow.extras.extensions.rst b/docs/source/08_API/kedro_mlflow.extras.extensions.rst deleted file mode 100644 index 82011ab3..00000000 --- a/docs/source/08_API/kedro_mlflow.extras.extensions.rst +++ /dev/null @@ -1,7 +0,0 @@ -Notebook -====================================================== - -.. automodule:: kedro_mlflow.extras.extensions.ipython - :members: - :undoc-members: - :show-inheritance: diff --git a/docs/source/08_API/kedro_mlflow.framework.hooks.rst b/docs/source/08_API/kedro_mlflow.framework.hooks.rst deleted file mode 100644 index f7ca39ce..00000000 --- a/docs/source/08_API/kedro_mlflow.framework.hooks.rst +++ /dev/null @@ -1,18 +0,0 @@ -Hooks -====== - -Node Hook ------------ - -.. automodule:: kedro_mlflow.framework.hooks.node_hook - :members: - :undoc-members: - :show-inheritance: - -Pipeline Hook -------------- - -.. automodule:: kedro_mlflow.framework.hooks.pipeline_hook - :members: - :undoc-members: - :show-inheritance: diff --git a/docs/source/04_experimentation_tracking/01_configuration.md b/docs/source/10_experiment_tracking/01_configuration.md similarity index 77% rename from docs/source/04_experimentation_tracking/01_configuration.md rename to docs/source/10_experiment_tracking/01_configuration.md index 5eb36eb0..de17119f 100644 --- a/docs/source/04_experimentation_tracking/01_configuration.md +++ b/docs/source/10_experiment_tracking/01_configuration.md @@ -17,32 +17,90 @@ The rationale behind the separation of the backend store and the artifacts store ## The ``mlflow.yml`` file -The ``mlflow.yml`` file contains all configuration you can pass either to kedro or mlflow through the plugin. Note that you can duplicate `mlflow.yml` file in as many environments (i.e. `conf/` folders) as you need. To create a ``mlflow.yml`` file in a kedro configuration environment, use ``kedro mlflow init --env=``. +The ``mlflow.yml`` file contains all configuration you can pass either to kedro or mlflow through the plugin. Note that you can duplicate `mlflow.yml` file in as many environments (i.e. `conf/` folders) as you need. To create a ``mlflow.yml`` file in a kedro configuration environment, use ``kedro mlflow init --env=``. You 'll get the following result: + +```yaml +# SERVER CONFIGURATION ------------------- + +# `mlflow_tracking_uri` is the path where the runs will be recorded. +# For more informations, see https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded +# kedro-mlflow accepts relative path from the project root. +# For instance, default `mlruns` will create a mlruns folder +# at the root of the project + +# All credentials needed for mlflow must be stored in credentials .yml as a dict +# they will be exported as environment variable +# If you want to set some credentials, e.g. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY +# > in `credentials.yml`: +# your_mlflow_credentials: +# AWS_ACCESS_KEY_ID: 132456 +# AWS_SECRET_ACCESS_KEY: 132456 +# > in this file `mlflow.yml`: +# credentials: mlflow_credentials + +server: + mlflow_tracking_uri: null # if null, will use mlflow.get_tracking_uri() as a default + mlflow_registry_uri: null # if null, mlflow_tracking_uri will be used as mlflow default + credentials: null # must be a valid key in credentials.yml which refers to a dict of sensitive mlflow environment variables (password, tokens...). See top of the file. + request_header_provider: # this is only useful to deal with expiring token, see https://github.com/Galileo-Galilei/kedro-mlflow/issues/357 + type: null # The path to a class : my_project.pipelines.module.MyClass. Should inherit from https://github.com/mlflow/mlflow/blob/master/mlflow/tracking/request_header/abstract_request_header_provider.py#L4 + pass_context: False # should the class be instantiated with "kedro_context" argument? + init_kwargs: {} # any kwargs to pass to the class when it is instantiated + +tracking: + # You can specify a list of pipeline names for which tracking will be disabled + # Running "kedro run --pipeline=" will not log parameters + # in a new mlflow run + + disable_tracking: + pipelines: [] + + experiment: + name: {{ python_package }} + restore_if_deleted: True # if the experiment`name` was previously deleted experiment, should we restore it? + + run: + id: null # if `id` is None, a new run will be created + name: null # if `name` is None, pipeline name will be used for the run name. You can use "${km.random_name:}" to generate a random name (mlflow's default) + nested: True # if `nested` is False, you won't be able to launch sub-runs inside your nodes + params: + dict_params: + flatten: False # if True, parameter which are dictionary will be splitted in multiple parameters when logged in mlflow, one for each key. + recursive: True # Should the dictionary flattening be applied recursively (i.e for nested dictionaries)? Not use if `flatten_dict_params` is False. + sep: "." # In case of recursive flattening, what separator should be used between the keys? E.g. {hyperaparam1: {p1:1, p2:2}} will be logged as hyperaparam1.p1 and hyperaparam1.p2 in mlflow. + long_params_strategy: fail # One of ["fail", "tag", "truncate" ] If a parameter is above mlflow limit (currently 250), what should kedro-mlflow do? -> fail, set as a tag instead of a parameter, or truncate it to its 250 first letters? + + +# UI-RELATED PARAMETERS ----------------- + +ui: + port: "5000" # the port to use for the ui. Use mlflow default with 5000. + host: "127.0.0.1" # the host to use for the ui. Use mlflow efault of "127.0.0.1". +``` ```{note} If no ``mlflow.yml`` file is found in the environment, ``kedro-mlflow`` will still work and use all ``mlflow.yml`` default values as configuration. ``` ```{important} -If the kedro run is started in a process where a mlflow run is already active, ``kedro-mlflow`` will ignore all the configuration in ``mlflow.yml`` and use the active run. The mlflow run will NOT be closed at the end of the kedro run. This enable using ``kedro-mlflow`` with an orchestrator (e.g airflow, AzureML...) which starts the mlflow run and configuraiton itself. +If the kedro run is started in a process where a mlflow run is already active, ``kedro-mlflow`` will ignore all the configuration in ``mlflow.yml`` and use the active run. The mlflow run will NOT be closed at the end of the kedro run. This enables using ``kedro-mlflow`` with an orchestrator (e.g airflow, AzureML...) which starts the mlflow run itself. ``` ### Configure the tracking server - #### Configure the tracking and registry uri ``kedro-mlflow`` needs the tracking uri of your mlflow tracking server to operate properly. The ``mlflow.yml`` file must have the ``mlflow_tracking_uri`` key with a [valid mlflow_tracking_uri associated](https://mlflow.org/docs/latest/tracking.html#where-runs-are-recorded) value. The ``mlflow.yml`` default have this keys set to ``null``. This means that it will look for a ``MLFLOW_TRACKING_URI`` environment variable, and if it is not set, it will create a ``mlruns`` folder locally at the root of your kedro project. This enables you to use the plugin without any setup of a mlflow tracking server. -Unlike mlflow, `kedro-mlflow` allows the `mlflow_tracking_uri` to be a relative path. It will convert it to an absolute uri automatically. +```{tip} +Unlike mlflow, `kedro-mlflow` allows the `mlflow_tracking_uri` to be a relative path. It will convert it to an absolute uri automatically and prefix it with `file:///`. +``` ```yaml server: - mlflow_tracking_uri: mlruns + mlflow_tracking_uri: mlruns # or http://path/your/server ``` -This is the **only mandatory key in the `mlflow.yml` file**, but there are many others described hereafter that provide fine-grained control on your mlflow setup. - You can also specify the registry uri: ```yaml @@ -50,7 +108,7 @@ server: mlflow_registry_uri: sqlite:///path/to/registry.db ``` -```{note} +```{important} Unlike the ``mlflow_tracking_uri``, the ``mlflow_registry_uri`` must be an *absolute* path prefixed with the [database dialect](https://mlflow.org/docs/latest/tracking.html#backend-stores) of your database, likely ``sqlite:///`` for a local database. ``` diff --git a/docs/source/04_experimentation_tracking/02_version_parameters.md b/docs/source/10_experiment_tracking/02_version_parameters.md similarity index 60% rename from docs/source/04_experimentation_tracking/02_version_parameters.md rename to docs/source/10_experiment_tracking/02_version_parameters.md index 1af7f45a..41cb0b8f 100644 --- a/docs/source/04_experimentation_tracking/02_version_parameters.md +++ b/docs/source/10_experiment_tracking/02_version_parameters.md @@ -2,19 +2,19 @@ ## Automatic parameters versioning -Parameters versioning is automatic when the ``MlflowNodeHook`` is added to [the hook list of the ``ProjectContext``](https://kedro-mlflow.readthedocs.io/en/latest/source/02_installation/02_setup.html#declaring-kedro-mlflow-hooks). The `mlflow.yml` configuration file has a parameter called ``flatten_dict_params`` which enables to [log as distinct parameters the (key, value) pairs of a ```Dict`` parameter](../07_python_objects/02_Hooks.md). +Parameters versioning is automatic when the ``MlflowHook`` is added to [the hook list of the ``ProjectContext``](https://kedro-mlflow.readthedocs.io/en/latest/source/02_installation/02_setup.html#declaring-kedro-mlflow-hooks). The `mlflow.yml` configuration file has a parameter called ``flatten_dict_params`` which enables to [log as distinct parameters the (key, value) pairs of a ```Dict`` parameter](../30_python_objects/02_Hooks.md). You **do not need any additional configuration** to benefit from parameters versioning. -## How does ``MlflowNodeHook`` operates under the hood? +## How does ``MlflowHook`` operates under the hood? The [medium post which introduces hooks](https://medium.com/quantumblack/introducing-kedro-hooks-fd5bc4c03ff5) explains in detail the differents execution steps ``Kedro`` executes when the user calls the ``kedro run`` command. ![](../imgs/hook_registration_process.png) -The `MlflowNodeHook` registers the parameters before each node (entry point number 3 on above picture) by calling `mlflow.log_parameter(param_name, param_value)` on each parameters of the node. +The `MlflowHook` registers the parameters before each node (entry point number 3 on above picture) by calling `mlflow.log_parameter(param_name, param_value)` on each parameters of the node. -## Frequently Asked Questions +## Frequently asked questions ### Will parameters be recorded if the pipeline fails during execution? diff --git a/docs/source/04_experimentation_tracking/03_version_datasets.md b/docs/source/10_experiment_tracking/03_version_datasets.md similarity index 94% rename from docs/source/04_experimentation_tracking/03_version_datasets.md rename to docs/source/10_experiment_tracking/03_version_datasets.md index 679cfab1..7e876547 100644 --- a/docs/source/04_experimentation_tracking/03_version_datasets.md +++ b/docs/source/10_experiment_tracking/03_version_datasets.md @@ -79,7 +79,11 @@ The location where artifact will be stored does not depends of the logging funct - how to [configure a mlflow tracking server](https://www.mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers) - how to [configure an artifact store](https://www.mlflow.org/docs/latest/tracking.html#id10) with cloud storage. -Setting the `mlflow_tracking_uri` key of `mlflow.yml` to the url of this server is the only additional configuration you need to send your datasets to this remote server. Note that you still need to specify a **local** path for the underlying dataset, mlflow will take care of the upload to the server by itself. +**Setting the `mlflow_tracking_uri` key of `mlflow.yml` to the url of this properly configured server** is the only additional configuration you need to send your datasets to this remote server. + +```{important} +You still need to specify a **local** path for the underlying dataset (even to store it on a remote storage), mlflow will take care of the upload to the server by itself. +``` You can refer to [this issue](https://github.com/Galileo-Galilei/kedro-mlflow/issues/15) for further details. diff --git a/docs/source/04_experimentation_tracking/04_version_models.md b/docs/source/10_experiment_tracking/04_version_models.md similarity index 98% rename from docs/source/04_experimentation_tracking/04_version_models.md rename to docs/source/10_experiment_tracking/04_version_models.md index efd35c45..1341869f 100644 --- a/docs/source/04_experimentation_tracking/04_version_models.md +++ b/docs/source/10_experiment_tracking/04_version_models.md @@ -21,11 +21,11 @@ my_sklearn_model: flavor: mlflow.sklearn ``` -More informations on available parameters are available in the [dedicated section](../07_python_objects/01_DataSets.md#mlflowmodeltrackingdataset). +More informations on available parameters are available in the [dedicated section](../30_python_objects/01_DataSets.md#mlflowmodeltrackingdataset). You are now able to use ``my_sklearn_model`` in your nodes. Since this model is registered in mlflow, you can also leverage the [mlflow model serving abilities](https://www.mlflow.org/docs/latest/cli.html#mlflow-models-serve) or [predicting on batch abilities](https://www.mlflow.org/docs/latest/cli.html#mlflow-models-predict), as well as the [mlflow models registry](https://www.mlflow.org/docs/latest/model-registry.html) to manage the lifecycle of this model. -## Frequently asked questions? +## Frequently asked questions ### How is it working under the hood? diff --git a/docs/source/04_experimentation_tracking/05_version_metrics.md b/docs/source/10_experiment_tracking/05_version_metrics.md similarity index 100% rename from docs/source/04_experimentation_tracking/05_version_metrics.md rename to docs/source/10_experiment_tracking/05_version_metrics.md diff --git a/docs/source/04_experimentation_tracking/06_mlflow_ui.md b/docs/source/10_experiment_tracking/06_mlflow_ui.md similarity index 100% rename from docs/source/04_experimentation_tracking/06_mlflow_ui.md rename to docs/source/10_experiment_tracking/06_mlflow_ui.md diff --git a/docs/source/04_experimentation_tracking/index.rst b/docs/source/10_experiment_tracking/index.rst similarity index 100% rename from docs/source/04_experimentation_tracking/index.rst rename to docs/source/10_experiment_tracking/index.rst diff --git a/docs/source/06_interactive_use/01_notebook_use.md b/docs/source/11_interactive_use/01_notebook_use.md similarity index 99% rename from docs/source/06_interactive_use/01_notebook_use.md rename to docs/source/11_interactive_use/01_notebook_use.md index 26039081..8051e40c 100644 --- a/docs/source/06_interactive_use/01_notebook_use.md +++ b/docs/source/11_interactive_use/01_notebook_use.md @@ -22,16 +22,16 @@ Open your notebook / ipython session with the Kedro CLI: kedro jupyter notebook ``` - Or if you are on JupyterLab, -```notebook +``` %load_ext kedro.ipython ``` Kedro [creates a bunch of global variables](https://kedro.readthedocs.io/en/stable/tools_integration/ipython.html#use-kedro-with-ipython-and-jupyter), including a `session`, a ``context`` and a ``catalog`` which are automatically accessible. When the context was created, ``kedro-mlflow`` automatically: + - loaded and setup (create the tracking uri, export credentials...) the mlflow configuration of your `mlflow.yml` - import ``mlflow`` which is now accessible in your notebook @@ -50,7 +50,9 @@ session.run( to_outputs="data_7", ) ``` + but it is not very likely in a notebook. + - if you need to interact manually with the mlflow server, you can use ``context.mlflow.server._mlflow_client``. ## Guidelines and best practices suggestions @@ -58,6 +60,7 @@ but it is not very likely in a notebook. During experimentation phase, you will likely not run entire pipelines (or sub pipelines filtered out between some inputs and outputs). Hence, you cannot benefit from Kedro's ``hooks`` (and hence from ``kedro-mlflow`` tracking). From this moment on, perfect reproducbility is impossible to achieve: there is no chance that you manage to maintain a perfectly linear workflow, as you will go back and forth modifying parameters and code to create your model. I suggest to : + - **focus on versioning parameters and metrics**. The goal is to finetune your hyperparameters and to be able to remember later the best setup. It is not very important to this stage to version all parameters (e.g. preprocessing ones) nor models (after all you will need an entire pipeline to predict and it is very unlikely that you will need to reuse these experiment models one day.) It may be interesting to use ``mlflow.autolog()`` feature to have a easy basic setup. - **transition quickly to kedro pipelines**. For instance, when you preprocessing is roughly defined, try to put it in kedro pipelines. You can then use notebooks to experiment / perfom hyperparameter tuning while keeping preprocessing "fixed" to enhance reproducibility. You can run this pipeline interactively with : diff --git a/docs/source/06_interactive_use/index.rst b/docs/source/11_interactive_use/index.rst similarity index 100% rename from docs/source/06_interactive_use/index.rst rename to docs/source/11_interactive_use/index.rst diff --git a/docs/source/05_pipeline_serving/01_mlflow_models.md b/docs/source/21_pipeline_serving/01_mlflow_models.md similarity index 81% rename from docs/source/05_pipeline_serving/01_mlflow_models.md rename to docs/source/21_pipeline_serving/01_mlflow_models.md index 2f32363d..f805cdc7 100644 --- a/docs/source/05_pipeline_serving/01_mlflow_models.md +++ b/docs/source/21_pipeline_serving/01_mlflow_models.md @@ -5,6 +5,7 @@ [Mlflow Models are a standardised agnostic format to store machine learning models](https://www.mlflow.org/docs/latest/models.html). They intend to be standalone to be as portable as possible to be deployed virtually anywhere and mlflow provides built-in CLI commands to deploy a mlflow model to most common cloud platforms or to create an API. A Mlflow Model is composed of: + - a ``MLModel`` file which is a configuration file to indicate to mlflow how to load the model. This file may also contain the ``Signature`` of the model (i.e. the ``Schema`` of the input and output of your model, including the columns names and order) as well as example data. - a ``conda.yml`` file which contains the specifications of the virtual conda environment inside which the model should run. It contains the packages versions necessary for your model to be executed. - a ``model.pkl`` (or a ``python_function.pkl`` for custom model) file containing the trained model. @@ -15,9 +16,12 @@ Mlflow enable to create custom models "flavors" to convert any object to a Mlflo ## Pre-requisite for serving a pipeline You can log any Kedro ``Pipeline`` matching the following requirements: + - one of its input must be a ``pandas.DataFrame``, a ``spark.DataFrame`` or a ``numpy.array``. This is the **input which contains the data to predict on**. This can be any Kedro ``AbstractDataset`` which loads data in one of the previous three formats. It can also be a ``MemoryDataset`` and not be persisted in the ``catalog.yml``. -- all its other inputs must be persisted on disk (e.g. if the machine learning model must already be trained and saved so we can export it). +- all its other inputs must be persisted on disk (e.g. if the machine learning model must already be trained and saved so we can export it) or declared as "parameters" in the model ``Signature``. ```{note} -If the pipeline has parameters, they will be persisted before exporting the model, which implies that you will not be able to modify them at runtime. This is a limitation of ``mlflow<2.6.0``, recently relaxed and that will be adressed by https://github.com/Galileo-Galilei/kedro-mlflow/issues/445. +If the pipeline has parameters : +- For ``mlflow<2.7.0`` the parameters need to be persisted before exporting the model, which implies that you will not be able to modify them at runtime. This is a limitation of ``mlflow<2.6.0`` +- For ``mlflow>=2.7.0`` , they can be declared in the signature and modified at runtime. See https://github.com/Galileo-Galilei/kedro-mlflow/issues/445 for more information. ``` diff --git a/docs/source/21_pipeline_serving/02_scikit_learn_like_pipeline.md b/docs/source/21_pipeline_serving/02_scikit_learn_like_pipeline.md new file mode 100644 index 00000000..e9b23e89 --- /dev/null +++ b/docs/source/21_pipeline_serving/02_scikit_learn_like_pipeline.md @@ -0,0 +1,117 @@ +# Scikit-learn like Kedro pipelines - Automatically log the inference pipeline after training + +For consistency, you may want to **log an inference pipeline** (including some data preprocessing and prediction post processing) **automatically after you ran a training pipeline**, with all the artifacts generated during training (the new model, encoders, vectorizers...). + +```{hint} +You can think of ``pipeline_ml_factory`` as "**scikit-learn like pipeline in kedro**". Running ``kedro run -p training`` performs the scikit-learn's ``pipeline.fit()`` operation, storing all components (e.g. a model) we need to reuse further as mlflow artifacts and the inference pipeline as code. Hence, you can later use this mlflow model which will perform the scikit-learn's ``pipeline.predict(new_data)`` operation by running the entire kedro inference pipeline. +``` + +## Getting started with pipeline_ml_factory + +```{note} +Below code assume that for inference, you want to skip some nodes that are training specific, e.g. you don't want to train the model, you just want to predict with it ; you don't want to fit and transform with you encoder, but only transform. Make sure these 2 steps ("train" and "predict", or "fit and "transform") are separated in 2 differnt nodes in your pipeline, so you can skip the train / transform step at inference time. +``` + +You can configure your project as follows: + +1. Install ``kedro-mlflow`` ``MlflowHook`` (this is done automatically if you have installed ``kedro-mlflow`` in a ``kedro>=0.16.5`` project) +2. Turn your training pipeline in a ``PipelineML`` object with ``pipeline_ml_factory`` function in your ``pipeline_registry.py``: + + ```python + # pipeline_registry.py for kedro>=0.17.2 (hooks.py for ``kedro>=0.16.5, <0.17.2) + + from kedro_mlflow_tutorial.pipelines.ml_app.pipeline import create_ml_pipeline + + + def register_pipelines(self) -> Dict[str, Pipeline]: + ml_pipeline = create_ml_pipeline() + training_pipeline_ml = pipeline_ml_factory( + training=ml_pipeline.only_nodes_with_tags( + "training" + ), # nodes : encode_labels + preprocess + train_model + predict + postprocess + evaluate + inference=ml_pipeline.only_nodes_with_tags( + "inference" + ), # nodes : preprocess + predict + postprocess + input_name="instances", + log_model_kwargs=dict( + artifact_path="kedro_mlflow_tutorial", + conda_env={ + "python": 3.10, + "dependencies": [f"kedro_mlflow_tutorial=={PROJECT_VERSION}"], + }, + signature="auto", + ), + ) + + return {"training": training_pipeline_ml} + ``` + +3. Persist all your artifacts locally in the ``catalog.yml`` + + ```yaml + label_encoder: + type: pickle.PickleDataset # <- This must be any Kedro Dataset other than "MemoryDataset" + filepath: data/06_models/label_encoder.pkl # <- This must be a local path, no matter what is your mlflow storage (S3 or other) + ``` + + and as well for your model if necessary. + +4. Launch your training pipeline: + + ```bash + kedro run --pipeline=training + ``` + + **The inference pipeline will _automagically_ be logged as a custom mlflow model"** (a ``KedroPipelineModel``) **at the end of the training pipeline!**. + +5. Go to the UI, retrieve the run id of your "inference pipeline" model and use it as you want, e.g. in the `catalog.yml`: + + ```yaml + # catalog.yml + + pipeline_inference_model: + type: kedro_mlflow.io.models.MlflowModelTrackingDataset + flavor: mlflow.pyfunc + pyfunc_workflow: python_model + artifact_path: kedro_mlflow_tutorial # the name of your mlflow folder = the model_name in pipeline_ml_factory + run_id: + ``` + + Now you can run the entire inference pipeline inside a node as part of another pipeline. + +## Advanced configuration for pipeline_ml_factory + +### Register the model as a new version in the mlflow registry + +The ``log_model_kwargs`` argument is passed to the underlying [mlflow.pyfunc.log_model](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.log_model). Specifically, it accepts a ``registered_model_name`` argument : + +```python +pipeline_ml_factory( + training=ml_pipeline.only_nodes_with_tags("training"), + inference=ml_pipeline.only_nodes_with_tags("inference"), + input_name="instances", + log_model_kwargs=dict( + artifact_path="kedro_mlflow_tutorial", + registered_model_name="my_inference_pipeline", # a new version of "my_infernce_pipeline" model will be registered each time you run the "training" pipeline + conda_env={ + "python": 3.10, + "dependencies": [f"kedro_mlflow_tutorial=={PROJECT_VERSION}"], + }, + signature="auto", + ), +) +``` + +## Complete step by step demo project with code + +A step by step tutorial with code is available in the [kedro-mlflow-tutorial repository on github](https://github.com/Galileo-Galilei/kedro-mlflow-tutorial#serve-the-inference-pipeline-to-a-end-user). + +You have also other resources to understand the rationale: + +- an explanation of the [``PipelineML`` class in the python objects section](../07_python_objects/03_Pipelines.md) +- detailed explanations [on this issue](https://github.com/Galileo-Galilei/kedro-mlflow/issues/16) and [this discussion](https://github.com/Galileo-Galilei/kedro-mlflow/discussions/229). +- an example of use in a user project [in this repo](https://github.com/laurids-reichardt/kedro-examples/blob/kedro-mlflow-hotfix2/text-classification/src/text_classification/pipelines/pipeline.py). + +## Motivation + +You can find more about the motivations in . diff --git a/docs/source/21_pipeline_serving/03_deployment_patterns.md b/docs/source/21_pipeline_serving/03_deployment_patterns.md new file mode 100644 index 00000000..1204b6cd --- /dev/null +++ b/docs/source/21_pipeline_serving/03_deployment_patterns.md @@ -0,0 +1,65 @@ +# Deployment patterns for kedro pipelines + +A step by step tutorial with code is available in the [kedro-mlflow-tutorial repository on github](https://github.com/Galileo-Galilei/kedro-mlflow-tutorial#serve-the-inference-pipeline-to-an-end-user) which explains how to serve the pipeline as an API or a batch. + +## Deploying a KedroPipelineModel + +### Reuse from a python script + +See tutorial: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial?tab=readme-ov-file#scenario-1-reuse-from-a-python-script + +### Reuse in a kedro pipeline + +See tutorial: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial?tab=readme-ov-file#scenario-2-reuse-in-a-kedro-pipeline + +### Serve the model with mlflow + +See tutorial: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial?tab=readme-ov-file#scenario-3-serve-the-model-with-mlflow + +## Pass parameters at runtime to a Kedro PipelineModel + +### Pipeline parameters + +Since ``kedro-mlflow>0.14.0``, you can pass parameters when predicting with a ``KedroPipelineModel`` object. + +We assume you've trained a model with ``pipeline_factory_function``. First, load the model, e.g. through the catalog or as described in the previous section: + +```yaml +# catalog.yml +pipeline_inference_model: + type: kedro_mlflow.io.models.MlflowModelTrackingDataset + flavor: mlflow.pyfunc + pyfunc_workflow: python_model + artifact_path: kedro_mlflow_tutorial # the name of your mlflow folder = the model_name in pipeline_ml_factory + run_id: +``` + +Then, pass params as a dict under the ``params`` argument of the ``predict`` method: + +```python +catalog.load("pipeline_inference_model") # You can also load it in a node "as usual" +predictions = model.predict(input_data, params={"my_param": ""}) +``` + +```{warning} +This will only work if ``my_param`` is a parameter (i.e. prefixed with ``params:``) of the inference pipeline. +``` + +```{tip} +Available params are visible in the model signature in the UI +``` + +### Configuring the runner + +Assuming the syntax of previous section, a special key in "params" is reserved for the kedro runner: + +```python +catalog.load("pipeline_inference_model") +predictions = model.predict( + input_data, params={"my_param": "", "runner": "ThreadRunner"} +) +``` + +```{tip} +You can pass any kedro runner, or even a custom runner by using the path to the module: ``params={"runner": "my_package.my_module.MyRunner"}`` +``` diff --git a/docs/source/21_pipeline_serving/04_custom_kedro_pipeline_model.md b/docs/source/21_pipeline_serving/04_custom_kedro_pipeline_model.md new file mode 100644 index 00000000..c347437a --- /dev/null +++ b/docs/source/21_pipeline_serving/04_custom_kedro_pipeline_model.md @@ -0,0 +1,84 @@ +# Custom registering of a ``KedroPipelineModel`` + +```{warning} +The goal of this section is to give tool to machine learning engineer or platform engineer to reuse the objects and customize the workflow. This is specially useful in case you need high customisation or fine grained control of the kedro objects or the mlflow model attributes. This is **very unlikely you need this section** if you are using a kedro project "in the standard way" as a data scientist, in which case you should refer to the section [scikit-learn like pipeline in kedro](https://kedro-mlflow.readthedocs.io/en/stable/source/). +``` + +## Log a pipeline to mlflow programatically with ``KedroPipelineModel`` custom mlflow model + +```{hint} +When using the ``KedroPipelineModel`` programatically, we focus only on the ``inference`` pipeline. We assume That you already ran the ``training`` pipeline previously, and that you now want to log the ``inference`` pipeline in mlflow manually by retrieveing all the needed objects to create the custom model. +``` + +``kedro-mlflow`` has a ``KedroPipelineModel`` class (which inherits from ``mlflow.pyfunc.PythonModel``) which can turn any kedro ``Pipeline`` object to a Mlflow Model. + +To convert a ``Pipeline`` to a mlflow model, you need to create a ``KedroPipelineModel`` and then log it to mlflow. An example is given in below snippet: + +```python +from pathlib import Path +from kedro.framework.session import KedroSession +from kedro.framework.startup import bootstrap_project + +bootstrap_project(r"") +session = KedroSession.create(project_path=r"") + +# "pipeline" is the Pipeline object you want to convert to a mlflow model + +context = session.load_context() # this setups mlflow configuration +catalog = context.catalog +pipeline = context.pipelines[""] +input_name = "instances" + + +# artifacts are all the inputs of the inference pipelines that are persisted in the catalog + +# (optional) get the schema of the input dataset +input_data = catalog.load(input_name) +model_signature = infer_signature( + model_input=input_data +) # if you want to pass parameters in "predict", you should specify them in the signature + +# you can optionally pass other arguments, like the "copy_mode" to be used for each dataset +kedro_pipeline_model = KedroPipelineModel( + pipeline=pipeline, catalog=catalog, input_name=input_name +) + +artifacts = kedro_pipeline_model.extract_pipeline_artifacts() + +mlflow.pyfunc.log_model( + artifact_path="model", + python_model=kedro_pipeline_model, + artifacts=artifacts, + conda_env={"python": "3.10.0", dependencies: ["kedro==0.18.11"]}, + model_signature=model_signature, +) +``` + +```{important} +Note that you need to provide the ``log_model`` function a bunch of non trivial-to-retrieve informations (the conda environment, the "artifacts" i.e. the persisted data you need to reuse like tokenizers / ml models / encoders, the model signature i.e. the columns names and types and the predict parameters...). The ``KedroPipelineModel`` object has methods like `extract_pipeline_artifacts` to help you, but it needs some work on your side. +``` + +```{note} +Saving Kedro pipelines as Mlflow Model objects is convenient and enable pipeline serving. However, it does not does not solve the decorrelation between training and inference: each time one triggers a training pipeline, (s)he must think to save it immediately afterwards. `kedro-mlflow` offers a convenient API through hooks to simplify this workflow, as described in the section [scikit-learn like pipeline in kedro](https://kedro-mlflow.readthedocs.io/en/stable/source/) . +``` + +## Log a pipeline to mlflow with the CLI + +```{note} +This command is mainly a helper to relog a model manually without retraining (e.g. because you slighlty modify the preprocessing or post processing and don't want to train again.) +``` + +```{warning} +We **assume that you already ran the ``training`` pipeline previously**, which created persisted artifacts. Now you want to trigger logging the ``inference`` pipeline in mlflow trhough the CLI. This is dangerous because the commmand does not check that your pipeline is working correctly or that the perssited model has not been modified. +``` + +You can log a Kedro ``Pipeline`` to mlflow as a custom model through the CLI with ``modelify`` command: + +```bash +kedro mlflow modelify --pipeline= --input-name +``` + +This command will create a new run with an artifact named ``model`` and persist it the code fo your pipeline and all its inputs as artifacts (hence they should have been created *before* running this command, e.g. the model should already be persisted on the disk). Open the user interface with ``kedro mlflow ui`` to check the result. You can also: + +- specify the run id in which you want to log the pipeline with the ``--run-id`` argument, and its name with the ``--run-name`` argument. +- pass almost all arguments accepted by [``mlflow.pyfunc.log_model``](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.log_model), see the list of all accepted arguments in the [API documentation](https://kedro-mlflow.readthedocs.io/en/stable/source/08_API/kedro_mlflow.framework.cli.html#modelify) diff --git a/docs/source/21_pipeline_serving/index.rst b/docs/source/21_pipeline_serving/index.rst new file mode 100644 index 00000000..d21e490b --- /dev/null +++ b/docs/source/21_pipeline_serving/index.rst @@ -0,0 +1,10 @@ +Introduction +============ + +.. toctree:: + :maxdepth: 4 + + Reminder on Mlflow Models <01_mlflow_models.md> + Scikit-learn like kedro pipelines with ``KedroPipelineModel`` <02_scikit_learn_like_pipeline.md> + Deployments patterns for ``KedroPipelineModel`` models <03_deployment_patterns.md> + Advanced logging for ``KedroPipelineModel`` <04_custom_kedro_pipeline_model.md> diff --git a/docs/source/05_framework_ml/01_why_framework.md b/docs/source/22_framework_ml/01_why_framework.md similarity index 100% rename from docs/source/05_framework_ml/01_why_framework.md rename to docs/source/22_framework_ml/01_why_framework.md diff --git a/docs/source/05_framework_ml/02_ml_project_components.md b/docs/source/22_framework_ml/02_ml_project_components.md similarity index 100% rename from docs/source/05_framework_ml/02_ml_project_components.md rename to docs/source/22_framework_ml/02_ml_project_components.md diff --git a/docs/source/05_framework_ml/03_framework_solutions.md b/docs/source/22_framework_ml/03_framework_solutions.md similarity index 100% rename from docs/source/05_framework_ml/03_framework_solutions.md rename to docs/source/22_framework_ml/03_framework_solutions.md diff --git a/docs/source/05_framework_ml/index.rst b/docs/source/22_framework_ml/index.rst similarity index 66% rename from docs/source/05_framework_ml/index.rst rename to docs/source/22_framework_ml/index.rst index 86ef1314..dfc0e02f 100644 --- a/docs/source/05_framework_ml/index.rst +++ b/docs/source/22_framework_ml/index.rst @@ -6,4 +6,4 @@ Introduction Why we need a mlops framework for development lifecycle <01_why_framework.md> The architecture of a machine learning project <02_ml_project_components.md> - An efficient tool for model serving and training / inference synchronization <03_framework_solutions.md> + A framework for training / inference synchronization <03_framework_solutions.md> diff --git a/docs/source/07_python_objects/01_DataSets.md b/docs/source/30_python_objects/01_DataSets.md similarity index 99% rename from docs/source/07_python_objects/01_DataSets.md rename to docs/source/30_python_objects/01_DataSets.md index a1da1c6a..97bab481 100644 --- a/docs/source/07_python_objects/01_DataSets.md +++ b/docs/source/30_python_objects/01_DataSets.md @@ -1,4 +1,4 @@ -# New ``DataSet`` +# New ``Dataset``s ## ``MlflowArtifactDataset`` diff --git a/docs/source/07_python_objects/02_Hooks.md b/docs/source/30_python_objects/02_Hooks.md similarity index 100% rename from docs/source/07_python_objects/02_Hooks.md rename to docs/source/30_python_objects/02_Hooks.md diff --git a/docs/source/07_python_objects/03_Pipelines.md b/docs/source/30_python_objects/03_Pipelines.md similarity index 100% rename from docs/source/07_python_objects/03_Pipelines.md rename to docs/source/30_python_objects/03_Pipelines.md diff --git a/docs/source/07_python_objects/04_CLI.md b/docs/source/30_python_objects/04_CLI.md similarity index 98% rename from docs/source/07_python_objects/04_CLI.md rename to docs/source/30_python_objects/04_CLI.md index b43e8aba..e77e302d 100644 --- a/docs/source/07_python_objects/04_CLI.md +++ b/docs/source/30_python_objects/04_CLI.md @@ -15,7 +15,7 @@ ``kedro mlflow ui``: this command opens the mlflow UI (basically launches the ``mlflow ui`` command ) -`ui` accepts the port and host arguments of [``mlflow ui`` command](https://www.mlflow.org/docs/latest/cli.html#mlflow-ui). The default values used will be the ones defined in the [``mlflow.yml`` configuration file under the `ui`](../04_experimentation_tracking/01_configuration.md#configure-the-user-interface). +`ui` accepts the port and host arguments of [``mlflow ui`` command](https://www.mlflow.org/docs/latest/cli.html#mlflow-ui). The default values used will be the ones defined in the [``mlflow.yml`` configuration file under the `ui`](../10_experimentation_tracking/01_configuration.md#configure-the-user-interface). If you provide the arguments at runtime, they wil take priority over the ``mlflow.yml``, e.g. if you have: diff --git a/docs/source/07_python_objects/05_Configuration.md b/docs/source/30_python_objects/05_Configuration.md similarity index 100% rename from docs/source/07_python_objects/05_Configuration.md rename to docs/source/30_python_objects/05_Configuration.md diff --git a/docs/source/07_python_objects/index.rst b/docs/source/30_python_objects/index.rst similarity index 100% rename from docs/source/07_python_objects/index.rst rename to docs/source/30_python_objects/index.rst diff --git a/docs/source/08_API/kedro_mlflow.config.rst b/docs/source/31_API/kedro_mlflow.config.rst similarity index 100% rename from docs/source/08_API/kedro_mlflow.config.rst rename to docs/source/31_API/kedro_mlflow.config.rst diff --git a/docs/source/08_API/kedro_mlflow.framework.cli.rst b/docs/source/31_API/kedro_mlflow.framework.cli.rst similarity index 100% rename from docs/source/08_API/kedro_mlflow.framework.cli.rst rename to docs/source/31_API/kedro_mlflow.framework.cli.rst diff --git a/docs/source/31_API/kedro_mlflow.framework.hooks.rst b/docs/source/31_API/kedro_mlflow.framework.hooks.rst new file mode 100644 index 00000000..f1848d0e --- /dev/null +++ b/docs/source/31_API/kedro_mlflow.framework.hooks.rst @@ -0,0 +1,10 @@ +Hooks +====== + +Node Hook +----------- + +.. automodule:: kedro_mlflow.framework.hooks.mlflow_hook + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/08_API/kedro_mlflow.io.rst b/docs/source/31_API/kedro_mlflow.io.rst similarity index 93% rename from docs/source/08_API/kedro_mlflow.io.rst rename to docs/source/31_API/kedro_mlflow.io.rst index 32426d3d..61e2b33d 100644 --- a/docs/source/08_API/kedro_mlflow.io.rst +++ b/docs/source/31_API/kedro_mlflow.io.rst @@ -23,7 +23,7 @@ Metrics DataSet :show-inheritance: -.. automodule:: kedro_mlflow.io.metrics.mlflow_metrics_dataset +.. automodule:: kedro_mlflow.io.metrics.mlflow_metrics_history_dataset :members: :undoc-members: :show-inheritance: diff --git a/docs/source/08_API/kedro_mlflow.mlflow.rst b/docs/source/31_API/kedro_mlflow.mlflow.rst similarity index 100% rename from docs/source/08_API/kedro_mlflow.mlflow.rst rename to docs/source/31_API/kedro_mlflow.mlflow.rst diff --git a/docs/source/08_API/kedro_mlflow.pipeline.rst b/docs/source/31_API/kedro_mlflow.pipeline.rst similarity index 100% rename from docs/source/08_API/kedro_mlflow.pipeline.rst rename to docs/source/31_API/kedro_mlflow.pipeline.rst diff --git a/docs/source/08_API/kedro_mlflow.rst b/docs/source/31_API/kedro_mlflow.rst similarity index 86% rename from docs/source/08_API/kedro_mlflow.rst rename to docs/source/31_API/kedro_mlflow.rst index f27c18cc..059c18f5 100644 --- a/docs/source/08_API/kedro_mlflow.rst +++ b/docs/source/31_API/kedro_mlflow.rst @@ -9,5 +9,4 @@ kedro\_mlflow package kedro_mlflow.pipeline kedro_mlflow.mlflow kedro_mlflow.config - kedro_mlflow.extras.extensions kedro_mlflow.framework.hooks diff --git a/kedro_mlflow/framework/hooks/mlflow_hook.py b/kedro_mlflow/framework/hooks/mlflow_hook.py index 9db324b8..a50e5e54 100644 --- a/kedro_mlflow/framework/hooks/mlflow_hook.py +++ b/kedro_mlflow/framework/hooks/mlflow_hook.py @@ -399,7 +399,18 @@ def after_pipeline_run( if isinstance(model_signature, str): if model_signature == "auto": input_data = catalog.load(pipeline.input_name) - model_signature = infer_signature(model_input=input_data) + + # all pipeline params will be overridable at predict time: https://mlflow.org/docs/latest/model/signatures.html#model-signatures-with-inference-params + # I add the special "runner" parameter to be able to choose it at runtime + pipeline_params = { + ds_name[7:]: catalog.load(ds_name) + for ds_name in pipeline.inference.inputs() + if ds_name.startswith("params:") + } | {"runner": "SequentialRunner"} + model_signature = infer_signature( + model_input=input_data, + params=pipeline_params, + ) mlflow.pyfunc.log_model( python_model=kedro_pipeline_model, @@ -427,7 +438,7 @@ def on_pipeline_error( catalog: DataCatalog, ): """Hook invoked when the pipeline execution fails. - All the mlflow runs must be closed to avoid interference with further execution. + All the mlflow runs must be closed to avoid interference with further execution. Args: error: (Not used) The uncaught exception thrown during the pipeline run. diff --git a/kedro_mlflow/mlflow/kedro_pipeline_model.py b/kedro_mlflow/mlflow/kedro_pipeline_model.py index 5c8d2ce7..c630b7b1 100644 --- a/kedro_mlflow/mlflow/kedro_pipeline_model.py +++ b/kedro_mlflow/mlflow/kedro_pipeline_model.py @@ -6,6 +6,7 @@ from kedro.io import DataCatalog, MemoryDataset from kedro.pipeline import Pipeline from kedro.runner import AbstractRunner, SequentialRunner +from kedro.utils import load_obj from kedro_datasets.pickle import PickleDataset from mlflow.pyfunc import PythonModel @@ -196,17 +197,47 @@ def load_context(self, context): updated_catalog._datasets[name]._filepath = Path(uri) self.loaded_catalog.save(name=name, data=updated_catalog.load(name)) - def predict(self, context, model_input): + def predict(self, context, model_input, params=None): # we create an empty hook manager but do NOT register hooks # because we want this model be executable outside of a kedro project + + # params can pass + # TODO globals + # TODO runtime + # TODO parameters -> I'd prefer not have them, but it would require catalog to be able to not be fully resolved if we want to pass runtime and globals + # TODO hooks + # TODO runner + + params = params or {} + + runner_class = params.pop("runner", "SequentialRunner") + + # we don't want to recreate the runner object on each predict + # because reimporting comes with a performance penalty in a serving setup + # so if it is the default we just use the existing runner + runner = ( + self.runner + if runner_class == type(self.runner).__name__ + else load_obj( + runner_class, "kedro.runner" + )() # do not forget to instantiate the class with ending () + ) + hook_manager = _create_hook_manager() + # _register_hooks(hook_manager, predict_params.hooks) + + for name, value in params.items(): + # no need to check if params are in the catalog, because mlflow already checks that the params matching the signature + param = f"params:{name}" + self._logger.info(f"Using {param}={value} for the prediction") + self.loaded_catalog.save(name=param, data=value) self.loaded_catalog.save( name=self.input_name, data=model_input, ) - run_output = self.runner.run( + run_output = runner.run( pipeline=self.pipeline, catalog=self.loaded_catalog, hook_manager=hook_manager, diff --git a/requirements.txt b/requirements.txt index 732f9f72..dbaffa5e 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ kedro>=0.19.0, <0.20.0 kedro_datasets -mlflow>=1.29.0, <3.0.0 +mlflow>=2.7.0, <3.0.0 pydantic>=1.0.0, <3.0.0 diff --git a/tests/framework/hooks/test_hook_pipeline_ml.py b/tests/framework/hooks/test_hook_pipeline_ml.py index 82c8d0b4..70e4575a 100644 --- a/tests/framework/hooks/test_hook_pipeline_ml.py +++ b/tests/framework/hooks/test_hook_pipeline_ml.py @@ -45,7 +45,7 @@ def preprocess_fun(data): return data def train_fun(data, param): - return 2 + return 1 def predict_fun(model, data): return data * model @@ -105,7 +105,7 @@ def remove_stopwords(data, stopwords): return data def train_fun_hyperparam(data, hyperparam): - return 2 + return 1 def predict_fun(model, data): return data * model @@ -156,10 +156,36 @@ def convert_probs_to_pred(data, threshold): return pipeline_ml_with_parameters +@pytest.fixture +def catalog_with_parameters(kedro_project_with_mlflow_conf): + catalog_with_parameters = DataCatalog( + { + "data": MemoryDataset(pd.DataFrame(data=[0.5], columns=["a"])), + "cleaned_data": MemoryDataset(), + "params:stopwords": MemoryDataset(["Hello", "Hi"]), + "params:penalty": MemoryDataset(0.1), + "model": PickleDataset( + filepath=( + kedro_project_with_mlflow_conf / "data" / "model.csv" + ).as_posix() + ), + "params:threshold": MemoryDataset(0.5), + } + ) + return catalog_with_parameters + + @pytest.fixture def dummy_signature(dummy_catalog, dummy_pipeline_ml): input_data = dummy_catalog.load(dummy_pipeline_ml.input_name) - dummy_signature = infer_signature(input_data) + params_dict = { + key: dummy_catalog.load(key) + for key in dummy_pipeline_ml.inference.inputs() + if key.startswith("params:") + } + dummy_signature = infer_signature( + model_input=input_data, params={**params_dict, "runner": "SequentialRunner"} + ) return dummy_signature @@ -303,7 +329,7 @@ def test_mlflow_hook_save_pipeline_ml( assert trained_model.metadata.signature.to_dict() == { "inputs": '[{"type": "long", "name": "a", "required": true}]', "outputs": None, - "params": None, + "params": '[{"name": "runner", "default": "SequentialRunner", "shape": null, "type": "string"}]', } @@ -434,6 +460,7 @@ def test_mlflow_hook_save_pipeline_ml_with_default_copy_mode_assign( def test_mlflow_hook_save_pipeline_ml_with_parameters( kedro_project_with_mlflow_conf, # a fixture to be in a kedro project pipeline_ml_with_parameters, + catalog_with_parameters, dummy_run_params, ): # config_with_base_mlflow_conf is a conftest fixture @@ -441,21 +468,6 @@ def test_mlflow_hook_save_pipeline_ml_with_parameters( with KedroSession.create(project_path=kedro_project_with_mlflow_conf) as session: context = session.load_context() - catalog_with_parameters = DataCatalog( - { - "data": MemoryDataset(pd.DataFrame(data=[1], columns=["a"])), - "cleaned_data": MemoryDataset(), - "params:stopwords": MemoryDataset(["Hello", "Hi"]), - "params:penalty": MemoryDataset(0.1), - "model": PickleDataset( - filepath=( - kedro_project_with_mlflow_conf / "data" / "model.csv" - ).as_posix() - ), - "params:threshold": MemoryDataset(0.5), - } - ) - mlflow_hook = MlflowHook() mlflow_hook.after_context_created(context) @@ -687,3 +699,148 @@ def test_mlflow_hook_save_pipeline_ml_with_dataset_factory( trained_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/artifacts") # the real test is that the model is loaded without error assert trained_model is not None + + +def test_mlflow_hook_save_and_load_pipeline_ml_with_inference_parameters( + kedro_project_with_mlflow_conf, # a fixture to be in a kedro project + pipeline_ml_with_parameters, + catalog_with_parameters, + dummy_run_params, +): + bootstrap_project(kedro_project_with_mlflow_conf) + with KedroSession.create(project_path=kedro_project_with_mlflow_conf) as session: + context = session.load_context() + + mlflow_hook = MlflowHook() + mlflow_hook.after_context_created(context) + + runner = SequentialRunner() + mlflow_hook.after_catalog_created( + catalog=catalog_with_parameters, + # `after_catalog_created` is not using any of arguments bellow, + # so we are setting them to empty values. + conf_catalog={}, + conf_creds={}, + feed_dict={}, + save_version="", + load_versions="", + ) + mlflow_hook.before_pipeline_run( + run_params=dummy_run_params, + pipeline=pipeline_ml_with_parameters, + catalog=catalog_with_parameters, + ) + runner.run( + pipeline_ml_with_parameters, catalog_with_parameters, session._hook_manager + ) + + current_run_id = mlflow.active_run().info.run_id + + # This is what we want to test: parameters should be passed by defautl to the signature + mlflow_hook.after_pipeline_run( + run_params=dummy_run_params, + pipeline=pipeline_ml_with_parameters, + catalog=catalog_with_parameters, + ) + + # test 1 : parameters should have been logged + trained_model = mlflow.pyfunc.load_model(f"runs:/{current_run_id}/model") + + # The "threshold" parameter of the inference pipeline should be in the signature + # { + # key: dummy_catalog.load(key) + # for key in dummy_pipeline_ml.inference.inputs() + # if key.startswith("params:") + # } + + assert ( + '{"name": "threshold", "default": 0.5, "shape": null, "type": "double"}' + in trained_model.metadata.signature.to_dict()["params"] + ) + + # test 2 : we get different results when passing parameters + + inference_data = pd.DataFrame(data=[0.2, 0.6, 0.9], columns=["a"]) + + assert all( + trained_model.predict(inference_data) + == pd.DataFrame([0, 1, 1]).values # no param = 0.5, the default + ) + + assert all( + trained_model.predict( + inference_data, + params={"threshold": 0.8}, + ) + == pd.DataFrame([0, 0, 1]).values # 0.6 is now below threshold + ) + + +def test_mlflow_hook_save_and_load_pipeline_ml_specify_runner( + kedro_project_with_mlflow_conf, # a fixture to be in a kedro project + pipeline_ml_with_parameters, + catalog_with_parameters, + dummy_run_params, +): + bootstrap_project(kedro_project_with_mlflow_conf) + with KedroSession.create(project_path=kedro_project_with_mlflow_conf) as session: + context = session.load_context() + + mlflow_hook = MlflowHook() + mlflow_hook.after_context_created(context) + + runner = SequentialRunner() + mlflow_hook.after_catalog_created( + catalog=catalog_with_parameters, + # `after_catalog_created` is not using any of arguments bellow, + # so we are setting them to empty values. + conf_catalog={}, + conf_creds={}, + feed_dict={}, + save_version="", + load_versions="", + ) + mlflow_hook.before_pipeline_run( + run_params=dummy_run_params, + pipeline=pipeline_ml_with_parameters, + catalog=catalog_with_parameters, + ) + runner.run( + pipeline_ml_with_parameters, catalog_with_parameters, session._hook_manager + ) + + current_run_id = mlflow.active_run().info.run_id + + # This is what we want to test: parameters should be passed by defautl to the signature + mlflow_hook.after_pipeline_run( + run_params=dummy_run_params, + pipeline=pipeline_ml_with_parameters, + catalog=catalog_with_parameters, + ) + + # test : parameters should have been logged + trained_model = mlflow.pyfunc.load_model(f"runs:/{current_run_id}/model") + + # test 1 : the parameters in the signature should have the runner with a default "SequentialRunner" + assert ( + '{"name": "runner", "default": "SequentialRunner", "shape": null, "type": "string"}' + in trained_model.metadata.signature.to_dict()["params"] + ) + + inference_data = pd.DataFrame(data=[0.2, 0.6, 0.9], columns=["a"]) + + # raise an error with a non existing runner + with pytest.raises( + AttributeError, + match="module 'kedro.runner' has no attribute 'non_existing_runner'", + ): + trained_model.predict( + inference_data, params={"runner": "non_existing_runner"} + ) + + # second test : run with another runner (iI should test that it is indeed the other one which is picked) + # the log clearly shows it + assert all( + trained_model.predict(inference_data, params={"runner": "ThreadRunner"}) + == pd.DataFrame([0, 1, 1]).values # no param = 0.5, the default + )