Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureMLPipelineDataSet not compatible with pipeline_ml_factory method from kedro-mlflow #53

Open
jpoullet2000 opened this issue Apr 26, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@jpoullet2000
Copy link

jpoullet2000 commented Apr 26, 2023

The pipeline_ml_factory method in kedro-mlflow is a useful method to store artifacts (transformers, models) automatically (using kedro-mlflow hook). However, this method calls the method extract_pipeline_artifacts which requires the _filepath attribute to be available (see here).
AzureMLPipelineDataSet class does not provide this attribute.
Wouldn't it be possible to add it to the class attributes?
Do you have any other suggestion to store the Mlflow Pipeline ?

@marrrcin
Copy link
Contributor

If adding _filepath helps, then no problem. We're open to PRs :)
@Galileo-Galilei any other suggestions?

@marrrcin
Copy link
Contributor

marrrcin commented Apr 28, 2023

Added _filepath in

def _filepath(self) -> str:

It's already released in 0.4.0. @jpoullet2000 please let me know if it fixes the problem.

FYI @tomasvanpottelbergh

@marrrcin marrrcin added the bug Something isn't working label Apr 28, 2023
@jpoullet2000
Copy link
Author

currently out of office. I'll come back to you in 2 weeks.

@Galileo-Galilei
Copy link

Sorry for the late reply, I was on holidays too. Just to understand, what does this dataset is intended to do?

Actually, kedro-mlflow should only check the filepath for the datasets it needs to use as artifacts for mlflow. So either this is a bug (kedro-mlflow does check the filepath on a dataset it should not) or this solution won't work (kedro-mlflow won't complain, but if there is no data at the given filepath, it will not be able to log it in mlflow nor to fetch it at inference time). What does your pipeline look like? What are you trying to do?

@jpoullet2000
Copy link
Author

Hi. Sorry for the late reply. The goal is to store a mlflow pipeline while running a azureml pipeline wrapping a kedro pipeline. I'd like to use the pipeline_ml_factory method for that. It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline. That's the reason why I wanted to use the AzureMLPipelineDataSet which should pass the data from one node to the other. But I'm still not convinced that it solves the issue (still testing)..

@jpoullet2000
Copy link
Author

As illustration with a simple pipeline.
View with kedro viz:
image

When I try to run the pipeline etl_ml_pipeline corresponding to the code:

    etl_ml_pipeline = create_etl_ml_pipeline()
    inference_pipeline_etl_ml = etl_ml_pipeline.only_nodes_with_tags("inference")
    training_pipeline_etl_ml = pipeline_ml_factory(
        training=etl_ml_pipeline.only_nodes_with_tags("training"),
        inference=inference_pipeline_etl_ml,
        input_name="X_test",
        log_model_kwargs=dict(
            artifact_path="poc_kedro_azureml_mlflow",
            # conda_env="src/requirements.txt",
            conda_env={
                "python": python_version(),
                "build_dependencies": ["pip"],
                "dependencies": [
                    f"poc_kedro_azureml_mlflow=={PROJECT_VERSION}",
                    {"pip": dependencies},
                ],
            },
            signature="auto",
        ),
    )

I get the following error

KedroMlflowPipelineMLError: The following inputs are free for the inference 
pipeline:
    - scaler
     - rf_model. 
No free input is allowed. Please make sure that 'inference.inputs()' are all in 
'training.all_outputs() + training.inputs()'except 'input_name' and parameters 
which starts with 'params:'

image

The pipeline code is

from kedro.pipeline import Pipeline, node
from poc_kedro_azureml_mlflow.pipelines.etl_ml_app.nodes import (
    split_data,
    scale_data_fit,
    scale_data_transform,
    train_rf_model,
    predict,
)


def create_pipeline(**kwargs) -> Pipeline:
    training_pipeline = Pipeline(
        [
            node(
                split_data,
                ["iris_data", "parameters"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                tags=["training", "etl_app"],
                name="split_data",
            ),
            node(
                scale_data_fit,
                "X_train",
                outputs=["X_train_scaled", "scaler"],
                tags=["training"],
                name="scale_data_fit",
            ),
            node(
                train_rf_model,
                ["X_train_scaled", "y_train"],
                "rf_model",
                tags="training",
                name="training_rf_model",
            ),
        ],
    )
    inference_pipeline = Pipeline(
        [
            node(
                scale_data_transform,
                ["X_test", "scaler"],
                outputs="X_test_scaled",
                tags=["inference"],
                name="scale_data_transform",
            ),
            node(
                predict,
                ["X_test_scaled", "rf_model"],
                "rf_predictions",
                tags="inference",
                name="predict_rf_model",
            ),
        ]
    )

    return training_pipeline + inference_pipeline

@marrrcin
Copy link
Contributor

It's specific to kedro-mlflow, any hints @Galileo-Galilei ?

It seems that the issue comes from the fact that kedro-azureml decomposes the kedro nodes in azureml nodes and the transformers and models (pickle files) are not shared between the training pipeline and the inference pipeline.

As for that - we indeed split the kedro nodes into Azure ML nodes, but I don't understand the "are not shared between training and inference". Data is shared via Kedro's Data Catalog, so when any node needs to load something, it goes to the Data Catalog. While running on Azure ML, if the entry is missing from the catalog, our plugin automatically loads the data from the temporary storage set in azureml.yml

temporary_storage:

If you've opted in the preview feature pipeline_data_passing, then the data will be passed via Azure ML-mounted files.

Maybe it's a problem in kedro-mlflow, that it cannot recognize that the data is passed implicitly. Have you tried explicitly defining your inputs/outputs (e.g. scaler, rf_model etc) in the catalog? If they will be defined, then there should be no issue with loading them from any node.

@Galileo-Galilei
Copy link

Galileo-Galilei commented May 17, 2023

Hum, I'll have a deep dive in the code in the coming days, but I already have some comments:

  1. your kedro-viz graph should likely not work "as is" with plain kedro-mlflow so I am a bit confused : your scalre and RF models are (as the error message say) NOT input of your predict_with_mlflow function while they should be (how can you predict without the model?)
  2. the pipeline_ml_factory "magic" comes from a kedro hook which retrieves the artifact at the end of the pipeline; if the kedro nodes are converted to azure nodes, you loose the benefits of kedro hooks and there is no reason that it will work out of the box.

@jpoullet2000
Copy link
Author

PIpeline Inference Model contains both the scaler and RF model and is generated by the pipeline_ml_factory.
I also have the feeling that the kedro hooks at the pipeline level are not usable with kedro-azureml. @marrrcin , can you confirm ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants