-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add evaluate function for classifiers #195
Conversation
Co-authored-by: Stephan Tulkens <[email protected]>
…d-multilabel-classification
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit on the fence about this one. If you follow my reasoning, you just end up with a thin wrapper around MultiLabelBinarizer.
@@ -227,6 +228,51 @@ def fit( | |||
self.eval() | |||
return self | |||
|
|||
def evaluate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this function is hitting the wrong abstraction level. Here's some observations:
- The function doesn't need to know what the classifier is, because multi_output labels look different, so you can leave out the check for
self.multilabel
. - There's no need to encode labels for non-multilabel output, you can just pass the un-encoded labels.
- This function is not available to converted pipelines, but equally applicable.
So I would refactor this into a function that takes a bunch of labels, and then, based on the type and shape of the output, returns a report. This function is then called by this function.
So something like this:
def evaluate(self, ...):
predictions = self.predict(...)
return evaluate_single_or_multilabel(predictions, y)
The evaluate_single_or_multilabel then simplifies to:
def evaluate_single_or_multilabel(y, pred):
if _is_multi_label_shaped(y):
# Binarization etc.
return classification_report(y_binarized, pred_binarized)
return classification_report(y, pred)
That way you can also test these functions without needing to have models, and can also reuse them in other contexts. The consequence of all of this, however, is that evaluate
simplifies to:
evaluate_single_or_multilabel(ds["label"], model.predict(ds["text"]))
So maybe having evaluate
is not even necessary any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the code as per your suggestions.
- In the inference model.py there's now
evaluate_single_or_multilabel
and_is_multi_label_shaped
- both inference and train model.py have an
evaluate
function that callsevaluate_single_or_multilabel
with the model predictions - single label case doesn't use a labelencoder anymore
This way evaluate
is available to both trained models and pipeline converted models. I also updated the tests to reflect this.
As for your other comment: yes, this is a essentially a thin wrapper around MultiLabelBinarizer and classification_report. However, I think this is worth it. Consider the following example:
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
predictions = classifier.predict(X, y)
mlb = MultiLabelBinarizer(classes=classifier.classes)
y_true = mlb.fit_transform(ds["test"]["labels"])
y_pred = mlb.transform(predictions)
print(classification_report(y_true, y_pred, target_names=classifier.classes, zero_division=0))
Vs:
print(classifier.evaluate(X, y))
This is much easier to run and understand in my opinion, and fits in with the rest of our training code, which creates a wrapper around torch/lightning. While the function does not add much for the singelabel case, it does provide a unified interface and function, and even in that case it does give a slightly nicer way to evaluate IMO:
from sklearn.metrics import classification_report
predictions = classifier.predict(X, y)
print(classification_report(y, predictions, target_names=classifier.classes, zero_division=0))
Vs:
print(classifier.evaluate(X, y))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional add-on: you don't need to pass classes to the evaluate function, you can derive the classes from the predicted or gold labels
This PR adds an
evaluate
function that can be used after fitting a classifier. It returns an sklearn classification report. It works for both multi and single label inputs, and str and int type labels. This makes it a bit easier to evaluate a model by abstracting away the logic for binarizing labels in the multilabel case, and in general provides a nice natural flow, e.g.:The tests are also updated to include int type labels. While our typing doesn't officially support list[int], many of the datasets we benchmarked on have int type labels, and not including that in the tests is dangerous (I accidentally broke the int label logic in my previous PR but our tests didn't catch that, now they will). In a followup we can think about potentially updating our typing.