AzureMLPipelineDataSet not compatible with pipeline_ml_factory method from kedro-mlflow #53

jpoullet2000 · 2023-04-26T07:14:30Z

The pipeline_ml_factory method in kedro-mlflow is a useful method to store artifacts (transformers, models) automatically (using kedro-mlflow hook). However, this method calls the method extract_pipeline_artifacts which requires the _filepath attribute to be available (see here).
AzureMLPipelineDataSet class does not provide this attribute.
Wouldn't it be possible to add it to the class attributes?
Do you have any other suggestion to store the Mlflow Pipeline ?

The text was updated successfully, but these errors were encountered:

marrrcin · 2023-04-26T13:47:19Z

If adding _filepath helps, then no problem. We're open to PRs :)
@Galileo-Galilei any other suggestions?

marrrcin · 2023-04-28T13:34:37Z

Added _filepath in

kedro-azureml/kedro_azureml/datasets/pipeline_dataset.py

Line 100 in 2e5836b

def _filepath(self) -> str:

It's already released in 0.4.0. @jpoullet2000 please let me know if it fixes the problem.

FYI @tomasvanpottelbergh

jpoullet2000 · 2023-05-02T14:37:17Z

currently out of office. I'll come back to you in 2 weeks.

Galileo-Galilei · 2023-05-09T18:50:13Z

Sorry for the late reply, I was on holidays too. Just to understand, what does this dataset is intended to do?

Actually, kedro-mlflow should only check the filepath for the datasets it needs to use as artifacts for mlflow. So either this is a bug (kedro-mlflow does check the filepath on a dataset it should not) or this solution won't work (kedro-mlflow won't complain, but if there is no data at the given filepath, it will not be able to log it in mlflow nor to fetch it at inference time). What does your pipeline look like? What are you trying to do?

jpoullet2000 · 2023-05-15T08:23:46Z

Hi. Sorry for the late reply. The goal is to store a mlflow pipeline while running a azureml pipeline wrapping a kedro pipeline. I'd like to use the pipeline_ml_factory method for that. It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline. That's the reason why I wanted to use the AzureMLPipelineDataSet which should pass the data from one node to the other. But I'm still not convinced that it solves the issue (still testing)..

jpoullet2000 · 2023-05-15T09:01:03Z

As illustration with a simple pipeline.
View with kedro viz:

When I try to run the pipeline etl_ml_pipeline corresponding to the code:

    etl_ml_pipeline = create_etl_ml_pipeline()
    inference_pipeline_etl_ml = etl_ml_pipeline.only_nodes_with_tags("inference")
    training_pipeline_etl_ml = pipeline_ml_factory(
        training=etl_ml_pipeline.only_nodes_with_tags("training"),
        inference=inference_pipeline_etl_ml,
        input_name="X_test",
        log_model_kwargs=dict(
            artifact_path="poc_kedro_azureml_mlflow",
            # conda_env="src/requirements.txt",
            conda_env={
                "python": python_version(),
                "build_dependencies": ["pip"],
                "dependencies": [
                    f"poc_kedro_azureml_mlflow=={PROJECT_VERSION}",
                    {"pip": dependencies},
                ],
            },
            signature="auto",
        ),
    )

I get the following error

KedroMlflowPipelineMLError: The following inputs are free for the inference 
pipeline:
    - scaler
     - rf_model. 
No free input is allowed. Please make sure that 'inference.inputs()' are all in 
'training.all_outputs() + training.inputs()'except 'input_name' and parameters 
which starts with 'params:'

The pipeline code is

from kedro.pipeline import Pipeline, node
from poc_kedro_azureml_mlflow.pipelines.etl_ml_app.nodes import (
    split_data,
    scale_data_fit,
    scale_data_transform,
    train_rf_model,
    predict,
)


def create_pipeline(**kwargs) -> Pipeline:
    training_pipeline = Pipeline(
        [
            node(
                split_data,
                ["iris_data", "parameters"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                tags=["training", "etl_app"],
                name="split_data",
            ),
            node(
                scale_data_fit,
                "X_train",
                outputs=["X_train_scaled", "scaler"],
                tags=["training"],
                name="scale_data_fit",
            ),
            node(
                train_rf_model,
                ["X_train_scaled", "y_train"],
                "rf_model",
                tags="training",
                name="training_rf_model",
            ),
        ],
    )
    inference_pipeline = Pipeline(
        [
            node(
                scale_data_transform,
                ["X_test", "scaler"],
                outputs="X_test_scaled",
                tags=["inference"],
                name="scale_data_transform",
            ),
            node(
                predict,
                ["X_test_scaled", "rf_model"],
                "rf_predictions",
                tags="inference",
                name="predict_rf_model",
            ),
        ]
    )

    return training_pipeline + inference_pipeline

marrrcin · 2023-05-17T08:24:28Z

It's specific to kedro-mlflow, any hints @Galileo-Galilei ?

It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline.

As for that - we indeed split the kedro nodes into Azure ML nodes, but I don't understand the "are not shared between training and inference". Data is shared via Kedro's Data Catalog, so when any node needs to load something, it goes to the Data Catalog. While running on Azure ML, if the entry is missing from the catalog, our plugin automatically loads the data from the temporary storage set in azureml.yml

kedro-azureml/kedro_azureml/config.py

Line 95 in a040b3c

temporary_storage:

If you've opted in the preview feature pipeline_data_passing, then the data will be passed via Azure ML-mounted files.

Maybe it's a problem in kedro-mlflow, that it cannot recognize that the data is passed implicitly. Have you tried explicitly defining your inputs/outputs (e.g. scaler, rf_model etc) in the catalog? If they will be defined, then there should be no issue with loading them from any node.

Galileo-Galilei · 2023-05-17T21:29:18Z

Hum, I'll have a deep dive in the code in the coming days, but I already have some comments:

your kedro-viz graph should likely not work "as is" with plain kedro-mlflow so I am a bit confused : your scalre and RF models are (as the error message say) NOT input of your predict_with_mlflow function while they should be (how can you predict without the model?)
the pipeline_ml_factory "magic" comes from a kedro hook which retrieves the artifact at the end of the pipeline; if the kedro nodes are converted to azure nodes, you loose the benefits of kedro hooks and there is no reason that it will work out of the box.

jpoullet2000 · 2023-05-18T20:06:11Z

PIpeline Inference Model contains both the scaler and RF model and is generated by the pipeline_ml_factory.
I also have the feeling that the kedro hooks at the pipeline level are not usable with kedro-azureml. @marrrcin , can you confirm ?

marrrcin added the bug Something isn't working label Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AzureMLPipelineDataSet not compatible with pipeline_ml_factory method from kedro-mlflow #53

AzureMLPipelineDataSet not compatible with pipeline_ml_factory method from kedro-mlflow #53

jpoullet2000 commented Apr 26, 2023 •

edited

Loading

marrrcin commented Apr 26, 2023

marrrcin commented Apr 28, 2023 •

edited

Loading

jpoullet2000 commented May 2, 2023

Galileo-Galilei commented May 9, 2023

jpoullet2000 commented May 15, 2023

jpoullet2000 commented May 15, 2023

marrrcin commented May 17, 2023

Galileo-Galilei commented May 17, 2023 •

edited

Loading

jpoullet2000 commented May 18, 2023

AzureMLPipelineDataSet not compatible with pipeline_ml_factory method from kedro-mlflow #53

AzureMLPipelineDataSet not compatible with pipeline_ml_factory method from kedro-mlflow #53

Comments

jpoullet2000 commented Apr 26, 2023 • edited Loading

marrrcin commented Apr 26, 2023

marrrcin commented Apr 28, 2023 • edited Loading

jpoullet2000 commented May 2, 2023

Galileo-Galilei commented May 9, 2023

jpoullet2000 commented May 15, 2023

jpoullet2000 commented May 15, 2023

marrrcin commented May 17, 2023

Galileo-Galilei commented May 17, 2023 • edited Loading

jpoullet2000 commented May 18, 2023

jpoullet2000 commented Apr 26, 2023 •

edited

Loading

marrrcin commented Apr 28, 2023 •

edited

Loading

Galileo-Galilei commented May 17, 2023 •

edited

Loading