Specifying run_id
in catalog rewrites history
#566
Labels
enhancement
New feature or request
need-design-decision
Several ways of implementation are possible and one must be chosen
Hello!
First of all thanks for open sourcing your plugin, which is very well documented and a great addition to the MLOps ecosystem. We work for a medium sized organization and we are building our MLOps toolset around kedro, mlflow and your plugin.
There is a problem that we are facing now and we are unsure about the way to solve it. Maybe you have thought about it and there is a clean solution within your plugin, but we don't see it and we are trying to add some extra functionality. But I want to expose the problem because it looks like something that should be common and maybe there is a better way to deal with it.
Typically we have the following pipelines:
Of course, the output of
preprocessing
is the input oftrain
. They are different pipelines and in fact, in many of our projects, even the runtime environment is different (preprocessing
uses spark andtrain
uses pandas/numpy/xgboost and other python libraries that use in-memory computation).But in many projects we have several versions of
preprocessing
(because we might have different ways of cleaning the data, we might discard or not some particular data source and so on). We connect the two pipelines usingrun_id
.So let's say that we are happy with a particular execution of
preprocessing
. Then, in our catalog, we will add therun_id
to specify that our training dataset is the one generated by that particular run:Now, the problem is that the class
kedro_mlflow.io.artifacts.MlflowArtifactDataset
overwrites the path with this specificrun_id
if you try to execute thepreprocessing
pipeline. And this is not what we want: we would like to have the possibility of running thepreprocessing
pipeline again (and saving the result in the newrun_id
generated bymlflow
), but what happens is that if you runpreprocessing
then the result overwrites the output of therun_id
specified in the catalog, therefore "altering the history". We would like that thisrun_id
specified in the catalog only affects the version of thedf_train
that is read when executing thetrain
pipeline, but not the one that is written when runningpreprocessing
.For a minimal example, let me add a dumb
preprocessing
andtrain
pipeline:preprocessing
pipeline:train
pipeline:As I said, if I try to execute the
train
pipeline, it will correctly take thedf_train
corresponding to therun_id
that I specified in thecatalog
. However, if I execute another run of thepreprocessing
pipeline, for example with different parameters, instead of storing the result in a newrun_id
, it will overwrite therun_id
specified in the catalog.Sorry for the long message, but I tried to make the issue as clear as possible.
Thanks in advance!
The text was updated successfully, but these errors were encountered: