-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize predict processes for ML models #396
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,45 @@ | ||||||
{ | ||||||
"id": "predict_ml_model_probabilities", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like above, this process name looks a bit weird to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about the recent discussion (prefix is based on the primary input), it should probably be |
||||||
"summary": "Predict class probabilities using a ML model", | ||||||
"description": "Applies a machine learning model to an array and predicts (class) probabilities for them.", | ||||||
m-mohr marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
"categories": [ | ||||||
"machine learning", | ||||||
"reducer" | ||||||
], | ||||||
"experimental": true, | ||||||
"parameters": [ | ||||||
{ | ||||||
"name": "data", | ||||||
"description": "An array of numbers.", | ||||||
m-mohr marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
"schema": { | ||||||
"type": "array", | ||||||
"items": { | ||||||
"type": [ | ||||||
"number", | ||||||
"null" | ||||||
] | ||||||
} | ||||||
} | ||||||
}, | ||||||
{ | ||||||
"name": "model", | ||||||
"description": "A ML model that can be trained with one of the ML processes such as ``fit_regr_random_forest()``.", | ||||||
m-mohr marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
"schema": { | ||||||
"type": "object", | ||||||
"subtype": "ml-model" | ||||||
} | ||||||
} | ||||||
], | ||||||
"returns": { | ||||||
"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I'm not sure about the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is actually tricky - some thoughts:
I think as a user my preference would be for this process to assume that the model has already been constructed to handle nans correctly in both training and inference and therefore doesn't try to interfere. Hope this is useful! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, you agree that it would be better to drop
from the general description of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly! |
||||||
"schema": { | ||||||
"type": "array", | ||||||
"items": { | ||||||
"type": [ | ||||||
"number", | ||||||
"null" | ||||||
] | ||||||
} | ||||||
} | ||||||
} | ||||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict_ml_model
looks a bit weird to me, it reads like you will be predicting the model (which is a confusing statement), instead of letting the model predict classes.Something like
predict_class
,ml_model_predict
or evenml_predict
would feel better.Note that we for the array, text, date related processes, we also use a prefix based naming (
array_append
,array_apply
,date_shift
) instead of a postfix based naming.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My aim was to align with
predict_curve
so that in docs they would be listed side by side. Unfortunately, we are not very consistent with prefixes/suffixes (array_apply / date_shift / load_collection / save_result / reduce_dimension)...Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing with the name
predict_class
is that that process would not only work for ML classification, but also ML regression, so ideally, the termclass
should be avoided. Unless there is a good reason to have separate inference/predict processes for ML classification and ML regression, but as far as I know that's not the case.To me, "predict" implies "machine learning", so having both "ml" and "predict" in the name is slightly redundant.
However, it might be better for discoverability and self-documenting reasons to keep that bit of redundancy.