Feature/brier score #3270

konstantinos-p · 2025-09-22T16:57:00Z

What does this PR do?

Implement Brier score and it's decomposition

The ECE and other similar calibration metrics such as the TACE, are not proper scoring rules. In particular, an untrained model with no predictive accuracy will minimize the ECE. This pathological behaviour can be fixed by using a proper scoring rule such as the Brier score.
The Brier score can be decomposed into Uncertainty, Reliability and Resolution. Reliability measures the model's calibration why Resolution correlates with model accuracy. Uncertainty, is the inherent difficulty of the problem.
The Brier score is used in many papers that evaluate the calibration of classification models. ex https://arxiv.org/abs/2302.04019 https://arxiv.org/abs/2002.06470

I followed the original paper describing the decomposition of the Brier score into resolution, reliability and uncertainty

https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml

and specifically the implementation found in

https://github.com/google-research/google-research/blob/master/uq_benchmark_2019/metrics_lib.py
and the paper
https://arxiv.org/abs/1906.02530

State of the PR
I added some rudimentary "tests" and the code seems to work for the Binary and Multiclass settings.

@SkafteNicki

Did you have fun?

Make sure you had fun coding 🙃

📚 Documentation preview 📚: https://torchmetrics--3270.org.readthedocs.build/en/3270/

The mean Brier score and it's decomposition should satisfy the following equation Brier = Uncertainty - Resolution + Reliability. After inspecting the results on a toy test case, the Uncertainty was estimated wrongly with a negative sign, also the Brier score formula was slightly wrong (f-1)^2=f^2-2f+1 not (f-1)^2=f^2-2f causing it to also be negative. With these fixes the decomposition equation is satisfied.

There were some more mistakes in estimating the Uncertainty. Also, the Confusion matrix had to be transposed such that the true labels are on the x axis.

for more information, see https://pre-commit.ci

Borda · 2025-09-22T17:03:26Z

src/torchmetrics/classification/brier.py

+
+
+class BinaryBrier(Metric):
+    r"""Compute the `confusion matrix`_ for binary tasks.


sounds strange...

To be clear this is still early WIP. I could turn it into a draft(?)

konstantinos-p · 2025-10-05T14:15:35Z

I created #2196 a while back and decided to work a bit on it since people have commented repeatedly and asked for it. I've also updated the description of the PR to explain more the motivation behind the Brier score vs the ECE variants.

@Borda @SkafteNicki @justusschock let me know if there is interest in this metric. I will then add tests, polish the existing code etc. I've seen that for other metrics the tests compare with reference implementations e.x. scikit-learn. After a brief search I didn't find any reference implementations from other libraries for the Brier decomposition. I will need some guidance on how to proceed with these.

Konstantinos Pitas added 5 commits September 22, 2025 16:06

BinaryBrier first version

06b51e2

Multiclass Brier first implementation

9a2f5e4

Some more fixes for the decomposition

0aee789

There were some more mistakes in estimating the Uncertainty. Also, the Confusion matrix had to be transposed such that the true labels are on the x axis.

Change return type to dict

563883c

konstantinos-p requested review from SkafteNicki, Borda and justusschock as code owners September 22, 2025 16:57

github-actions bot added the topic: Classif label Sep 22, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

5343a79

for more information, see https://pre-commit.ci

Borda reviewed Sep 22, 2025

View reviewed changes

Borda marked this pull request as draft September 22, 2025 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/brier score #3270

Feature/brier score #3270

Uh oh!

konstantinos-p commented Sep 22, 2025 •

edited

Loading

Uh oh!

Borda Sep 22, 2025

Uh oh!

konstantinos-p Sep 22, 2025

Uh oh!

konstantinos-p commented Oct 5, 2025 •

edited

Loading

Uh oh!

Uh oh!



		class BinaryBrier(Metric):
		r"""Compute the `confusion matrix`_ for binary tasks.

Feature/brier score #3270

Are you sure you want to change the base?

Feature/brier score #3270

Uh oh!

Conversation

konstantinos-p commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Did you have fun?

Uh oh!

Borda Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

konstantinos-p Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

konstantinos-p commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

konstantinos-p commented Sep 22, 2025 •

edited

Loading

konstantinos-p commented Oct 5, 2025 •

edited

Loading