Large changes to classifications #1248

Borda · 2022-10-04T19:54:29Z

Borda
Oct 4, 2022
Maintainer

TorchMetrics v0.10 is now out, significantly changing the whole classification package. This blog post will go over the reasons why the classification package needs to be refactored, what it means for our end users, and finally, what benefits it gives. A guide on how to upgrade your code to the recent changes can be found near the bottom.

Why the classification metrics need to change

We have for a long time known that there were some underlying problems with how we initially structured the classification package. Essentially, classification tasks can e divided into either binary, multiclass, or multilabel, and determining what task a user is trying to run a given metric on is hard just based on the input. The reason a package such as sklearn can do this is to only support input in very specific formats (no multi-dimensional arrays and no support for both integer and probability/logit formats).

This meant that some metrics, especially for binary tasks, could have been calculating something different than expected if the user were to provide another shape but the expected. This is against the core value of TorchMetrics, that our users, of course should trust that the metric they are evaluating is given the excepted result.

Additionally, classification metrics were missing consistency. For some, metrics num_classes=2 meant binary, and for others num_classes=1 meant binary. You can read more about the underlying reasons for this refactor in this and this issue.

The solution

The solution we went with was to split every classification metric into three separate metrics with the prefix binary_* , multiclass_* and multilabel_* . This solves a number of the above problems out of the box because it becomes easier for us to match our users' expectations for any given input shape. It additionally has some other benefits both for us as developers and ends users

Maintainability: by splitting the code into three distinctive functions, we are (hopefully) lowering the code complexity, making the codebase easier to maintain in the long term.
Speed: by completely removing the auto-detection of task at runtime, we can significantly increase computational speed (more on this later).
Task-specific arguments: by splitting into three functions, we also make it more clear what input arguments affect the computed result. Take - Accuracy as an example: both num_classes , top_k , average are arguments that have an influence if you are doing multiclass classification but doing nothing for binary classification and vice versa with the thresholds argument. The task-specific versions only contain the arguments that influence the given task.
There are many smaller quality-of-life improvements hidden throughout the refactor, however here are our top 3:

Standardized arguments

The input arguments for the classification package are now much more standardized. Here are a few examples:

Each metric now only supports arguments that influence the final result. This means that num_classes is removed from all binary_* metrics are now required for all multiclass_* metrics and renamed to num_labels for all multilabel_* metrics.
The ignore_index argument is now supported by ALL classification metrics and supports any value and not only values in the [0,num_classes] range (similar to torch loss functions). Below is shown an example:
We added a new validate_args to all classification metrics to allow users to skip validation of inputs making the computations completely faster. By default, we will still do input validation because it is the safest option for the user. Still, if you are confident that the input to the metric is correct, then you can now disable this, checking for a potential speed-up (more on this later).

Constant memory implementations

Some of the most useful metrics for evaluating classification problems are metrics such as ROC, AUROC, AveragePrecision, etc., because they not only evaluate your model for a single threshold but a whole range of thresholds, essentially giving you the ability to see the trade-off between Type I and Type II errors. However, a big problem with the standard formulation of these metrics (which we have been using) is that they require access to all data for their calculation. Our implementation has been extremely memory-intensive for these kinds of metrics.

In v0.10 of TorchMetrics, all these metrics now have an argument called thresholds. By default, it is None and the metric will still save all targets and predictions in memory as you are used to. However, if this argument is instead set to a tensor - torch.linspace(0,1,100) it will instead use a constant-memory approximation by evaluating the metric under those provided thresholds.

Setting thresholds=None has an approximate memory footprint of O(num_samples) whereas using thresholds=torch.linspace(0,1,100) has an approximate memory footprint of O(num_thresholds). In this particular case, users will save memory when the metric is computed on more than 100 samples. This feature can save memory by comparing this to modern machine learning, where evaluation is often done on thousands to millions of data points.

This also means that the Binned* metrics that currently exist in TorchMetrics are being deprecated as their functionality is now captured by this argument.

All metrics are faster (ish)

By splitting each metric into 3 separate metrics, we reduce the number of calculations needed. We, therefore, expected out-of-the-box that our new implementations would be faster. The table below shows the timings of different metrics with the old and new implementations (with and without input validation). Numbers in parentheses denote speed-up over old implementations.

The following observations can be made:

Some metrics are a bit faster (1.3x), and others are much faster (4.6x) after the refactor!
Disabling input validation can speed up things. For example, multiclass_confusion_matrix goes from a speedup of 3.36x to 4.81 when input validation is disabled. A clear advantage for users that are familiar with the metrics and do not need validation of their input at every update.
If we compare binary with multiclass, the biggest speedup can be seen for multiclass problems.
Every metric is faster except for the precision-recall curve, even the new approximative binning method. This is a bit strange, as the non-approximation should be equally fast (it's the same code). We are actively looking into this.

[0.10.0] - 2022-10-04

Added

Added a new NLP metric InfoLM (Add InfoLM #915)
Added Perplexity metric (Add the Perplexity metric #922)
Added ConcordanceCorrCoef metric to regression package (Add concordance correlation coefficient #1201)
Added argument normalize to LPIPS metric (Add normalize arg to LPIPS metric #1216)
Added support for multiprocessing of batches in PESQ metric (Add support for multiprocessing in PESQ #1227)
Added support for multioutput in PearsonCorrCoef and SpearmanCorrCoef (Support multioutput pearson and spearman corrcoef #1200)

Changed

Classification refactor ([Refactor] Classification 1/n #1054, [Refactor] Classification 2/n #1143, [Refactor] Classification 3/n #1145, [Refactor] Classification 4/n #1151, [Refactor] Classification 5/n #1159, [Refactor] Classification 6/n #1163, [Refactor] Classification 7/n #1167, [Refactor] Classification 8/n #1175, [Refactor] Classification 9/n #1189, Refactor/classification 10 #1197, Fixing typing #1215, Classification Refactor [rebase & merge] #1195)
Changed update in FID metric to be done in an online fashion to save memory (Changed FID to be online instead of storing each feature #1199)
Improved performance of retrieval metrics (improved efficiency of all module retrieval metrics #1242)
Changed SSIM and MSSSIM update to be online to reduce memory usage (Online SSIM and MS-SSIM Computation #1231)

Fixed

Fixed a bug in ssim when return_full_image=True where the score was still reduced (Fix reduction for full image in SSIM #1204)
Fixed MPS support for:
- MAE metric (Fix MPS support for MAE metric #1210)
- Jaccard index ([MPS support] Make Jaccard Index working on MPS #1205)
Fixed bug in ClasswiseWrapper such that compute gave wrong result (Bugfix for classwise wrapper compute logic #1225)
Fixed synchronization of empty list states (Bugfix: fix compute when called on empty lists #1219)

Contributors

@Borda, @bryant1410, @geoffrey-g-delhomme, @justusschock, @lucadiliello, @nicolas-dufour, @Queuecumber, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

This discussion was created from the release Large changes to classifications.

Answered by SkafteNicki

Jun 8, 2023

The problem is the input order
In sklearn it should be metric(truth, preds)
In torchmetrics it should be metric(preds, truth) (we use the this order because it is consistent with loss functions in torch)
Switching the order for the sklearn calculations in your example fixes it.

View full answer

jzazo · 2022-10-05T08:18:09Z

jzazo
Oct 5, 2022

Hi! Quick question: does multilabel mean multiple binary labels? Is it implemented as independent binary metrics? So binary implies single binary metric? Thanks.

2 replies

SkafteNicki Oct 5, 2022
Maintainer

Hi @jzazo,
Yes multi label is mean as multiple binary labels. It more or less corresponds to running independent binary metrics but have additional options for how to aggregate the score over the different labels and computations can run in parallel for all labels instead of sequentially if using independent metrics.
Hope this answers your question :)

jzazo Oct 5, 2022

perfect, thanks!

hanlint · 2022-10-13T15:00:55Z

hanlint
Oct 13, 2022

Thanks for this significant update! Is there a link to the upgrade guide?

0 replies

biphasic · 2022-12-08T15:44:18Z

biphasic
Dec 8, 2022

Hello, may I ask why the default in MultiClassAccuracy is set to 'macro' instead of 'micro'? Sklearn uses 'micro' statistics too https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

0 replies

sraghavm · 2023-06-08T08:06:14Z

sraghavm
Jun 8, 2023

Still precision values are different from sklearn

5 replies

Borda Jun 8, 2023
Maintainer Author

Still precision values are different from sklearn

@sraghavm could you please share example?
cc: @SkafteNicki

sraghavm Jun 8, 2023

sraghavm
Jun 8, 2023

import torchmetrics
from sklearn.metrics import accuracy_score,precision_score,recall_score,matthews_corrcoef,confusion_matrix,roc_auc_score
import pandas as pd
import torch
import numpy as np

preds = [1,0,0,1,0,1,1,0,1]
truth = [0,1,0,1,1,1,0,1,1]

acc = accuracy_score(preds,truth)
prec = precision_score(preds,truth)
cmtr1 = confusion_matrix(preds,truth)
rec = recall_score(preds,truth)
mcc = matthews_corrcoef(preds,truth)
acc1 = roc_auc_score(preds,truth)

print(acc,prec,rec,mcc,acc1)
print(cmtr1)
accp = torchmetrics.Accuracy(task='binary')(torch.tensor(preds),torch.tensor(truth))
precp = torchmetrics.Precision(task='binary',average='none', num_classes=2)(torch.tensor(preds),torch.tensor(truth))
cmtr = torchmetrics.ConfusionMatrix(task='binary')(torch.tensor(preds),torch.tensor(truth))
rec1 = torchmetrics.Recall(task='binary')(torch.tensor(preds),torch.tensor(truth))
mcc1 = torchmetrics.MatthewsCorrCoef(task='binary')(torch.tensor(preds),torch.tensor(truth))
acc2 = torchmetrics.AUROC(task='binary',average='weighted')(torch.tensor(preds),torch.tensor(truth))
print(accp,precp,rec1,mcc1,acc2)
print(np.array(cmtr))


Results :
0.4444444444444444 0.5 0.6 -0.15811388300841897 0.425
[[1 3]
[2 3]]
tensor(0.4444) tensor(0.6000) tensor(0.5000) tensor(-0.1581) tensor(0.4167)
[[1 2]
[3 3]]

Precision,recall and AUROC values dont match? I cant understand why

Also, these values dont change if Averaging is changed to Micro or Macro or Weighted

Borda Jun 8, 2023
Maintainer Author

Ok, and are you using 0.10 or the latest 1.0.0RC?

SkafteNicki Jun 8, 2023
Maintainer

The problem is the input order
In sklearn it should be metric(truth, preds)
In torchmetrics it should be metric(preds, truth) (we use the this order because it is consistent with loss functions in torch)
Switching the order for the sklearn calculations in your example fixes it.

Answer selected by Borda

sraghavm Jun 8, 2023

@Borda @SkafteNicki Huge Thanks !! I wrecked my mind over this simple folly..Cheers guys

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large changes to classifications #1248

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Large changes to classifications #1248

Borda Oct 4, 2022 Maintainer

Why the classification metrics need to change

The solution

Standardized arguments

Constant memory implementations

All metrics are faster (ish)

[0.10.0] - 2022-10-04

Added

Changed

Fixed

Contributors

Replies: 4 comments · 7 replies

jzazo Oct 5, 2022

SkafteNicki Oct 5, 2022 Maintainer

jzazo Oct 5, 2022

hanlint Oct 13, 2022

biphasic Dec 8, 2022

sraghavm Jun 8, 2023

Borda Jun 8, 2023 Maintainer Author

sraghavm Jun 8, 2023

Borda Jun 8, 2023 Maintainer Author

SkafteNicki Jun 8, 2023 Maintainer

sraghavm Jun 8, 2023

Borda
Oct 4, 2022
Maintainer

Replies: 4 comments 7 replies

jzazo
Oct 5, 2022

SkafteNicki Oct 5, 2022
Maintainer

hanlint
Oct 13, 2022

biphasic
Dec 8, 2022

sraghavm
Jun 8, 2023

Borda Jun 8, 2023
Maintainer Author

Borda Jun 8, 2023
Maintainer Author

SkafteNicki Jun 8, 2023
Maintainer