Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add evaluate function for classifiers #195

Merged
merged 37 commits into from
Feb 17, 2025
Merged

feat: Add evaluate function for classifiers #195

merged 37 commits into from
Feb 17, 2025

Conversation

Pringled
Copy link
Member

This PR adds an evaluate function that can be used after fitting a classifier. It returns an sklearn classification report. It works for both multi and single label inputs, and str and int type labels. This makes it a bit easier to evaluate a model by abstracting away the logic for binarizing labels in the multilabel case, and in general provides a nice natural flow, e.g.:

# Load a model
classifier = StaticModelForClassification.from_pretrained(model_name="minishlab/potion-base-32M")
# Load a dataset
ds = load_dataset("setfit/subj")
# Fit 
classifier.fit(ds["train"]["text"], ds["train"]["label"], max_epochs=1)
# Evaluate
print(classifier.evaluate(ds["test"]["text"], ds["test"]["label"]))

The tests are also updated to include int type labels. While our typing doesn't officially support list[int], many of the datasets we benchmarked on have int type labels, and not including that in the tests is dangerous (I accidentally broke the int label logic in my previous PR but our tests didn't catch that, now they will). In a followup we can think about potentially updating our typing.

@Pringled Pringled requested a review from stephantul February 16, 2025 15:06
Copy link

codecov bot commented Feb 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
model2vec/inference/__init__.py 100.00% <100.00%> (ø)
model2vec/inference/model.py 92.38% <100.00%> (+1.79%) ⬆️
model2vec/train/classifier.py 97.38% <100.00%> (+0.07%) ⬆️
tests/conftest.py 100.00% <100.00%> (ø)
tests/test_inference.py 100.00% <100.00%> (ø)
tests/test_trainable.py 100.00% <100.00%> (ø)

Copy link
Member

@stephantul stephantul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence about this one. If you follow my reasoning, you just end up with a thin wrapper around MultiLabelBinarizer.

@@ -227,6 +228,51 @@ def fit(
self.eval()
return self

def evaluate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this function is hitting the wrong abstraction level. Here's some observations:

  • The function doesn't need to know what the classifier is, because multi_output labels look different, so you can leave out the check for self.multilabel.
  • There's no need to encode labels for non-multilabel output, you can just pass the un-encoded labels.
  • This function is not available to converted pipelines, but equally applicable.

So I would refactor this into a function that takes a bunch of labels, and then, based on the type and shape of the output, returns a report. This function is then called by this function.

So something like this:

def evaluate(self, ...):
    predictions = self.predict(...)
    return evaluate_single_or_multilabel(predictions, y)

The evaluate_single_or_multilabel then simplifies to:

def evaluate_single_or_multilabel(y, pred):
    if _is_multi_label_shaped(y):
        # Binarization etc.
        return classification_report(y_binarized, pred_binarized)
    return classification_report(y, pred)

That way you can also test these functions without needing to have models, and can also reuse them in other contexts. The consequence of all of this, however, is that evaluate simplifies to:

evaluate_single_or_multilabel(ds["label"], model.predict(ds["text"]))

So maybe having evaluate is not even necessary any more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the code as per your suggestions.

  • In the inference model.py there's now evaluate_single_or_multilabel and _is_multi_label_shaped
  • both inference and train model.py have an evaluate function that calls evaluate_single_or_multilabel with the model predictions
  • single label case doesn't use a labelencoder anymore

This way evaluate is available to both trained models and pipeline converted models. I also updated the tests to reflect this.

As for your other comment: yes, this is a essentially a thin wrapper around MultiLabelBinarizer and classification_report. However, I think this is worth it. Consider the following example:

from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer

predictions = classifier.predict(X, y)

mlb = MultiLabelBinarizer(classes=classifier.classes)
y_true = mlb.fit_transform(ds["test"]["labels"])
y_pred = mlb.transform(predictions)

print(classification_report(y_true, y_pred, target_names=classifier.classes, zero_division=0))

Vs:

print(classifier.evaluate(X, y))

This is much easier to run and understand in my opinion, and fits in with the rest of our training code, which creates a wrapper around torch/lightning. While the function does not add much for the singelabel case, it does provide a unified interface and function, and even in that case it does give a slightly nicer way to evaluate IMO:

from sklearn.metrics import classification_report

predictions = classifier.predict(X, y)

print(classification_report(y, predictions, target_names=classifier.classes, zero_division=0))

Vs:

print(classifier.evaluate(X, y))

@Pringled Pringled requested a review from stephantul February 17, 2025 08:35
Copy link
Member

@stephantul stephantul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional add-on: you don't need to pass classes to the evaluate function, you can derive the classes from the predicted or gold labels

@Pringled Pringled merged commit 722f939 into main Feb 17, 2025
6 checks passed
@Pringled Pringled deleted the add-evaluate branch February 17, 2025 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants