Skip to content

Conversation

konstantinos-p
Copy link

@konstantinos-p konstantinos-p commented Sep 22, 2025

What does this PR do?

Fixes #2196

Implement Brier score and it's decomposition

  • The ECE and other similar calibration metrics such as the TACE, are not proper scoring rules. In particular, an untrained model with no predictive accuracy will minimize the ECE. This pathological behaviour can be fixed by using a proper scoring rule such as the Brier score.
  • The Brier score can be decomposed into Uncertainty, Reliability and Resolution. Reliability measures the model's calibration why Resolution correlates with model accuracy. Uncertainty, is the inherent difficulty of the problem.
  • The Brier score is used in many papers that evaluate the calibration of classification models. ex https://arxiv.org/abs/2302.04019 https://arxiv.org/abs/2002.06470

I followed the original paper describing the decomposition of the Brier score into resolution, reliability and uncertainty

https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml

and specifically the implementation found in

https://github.com/google-research/google-research/blob/master/uq_benchmark_2019/metrics_lib.py
and the paper
https://arxiv.org/abs/1906.02530

State of the PR
I added some rudimentary "tests" and the code seems to work for the Binary and Multiclass settings.

@SkafteNicki

Did you have fun?

Make sure you had fun coding 🙃


📚 Documentation preview 📚: https://torchmetrics--3270.org.readthedocs.build/en/3270/

Konstantinos Pitas added 5 commits September 22, 2025 16:06
The mean Brier score and it's decomposition should satisfy the following equation Brier = Uncertainty - Resolution + Reliability. After inspecting the results on a toy test case, the Uncertainty was estimated wrongly with a negative sign, also the Brier score formula was slightly wrong (f-1)^2=f^2-2f+1 not (f-1)^2=f^2-2f causing it to also be negative. With these fixes the decomposition equation is satisfied.
There were some more mistakes in estimating the Uncertainty. Also, the Confusion matrix had to be transposed such that the true labels are on the x axis.


class BinaryBrier(Metric):
r"""Compute the `confusion matrix`_ for binary tasks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds strange...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear this is still early WIP. I could turn it into a draft(?)

@Borda Borda marked this pull request as draft September 22, 2025 17:17
@konstantinos-p
Copy link
Author

konstantinos-p commented Oct 5, 2025

I created #2196 a while back and decided to work a bit on it since people have commented repeatedly and asked for it. I've also updated the description of the PR to explain more the motivation behind the Brier score vs the ECE variants.

@Borda @SkafteNicki @justusschock let me know if there is interest in this metric. I will then add tests, polish the existing code etc. I've seen that for other metrics the tests compare with reference implementations e.x. scikit-learn. After a brief search I didn't find any reference implementations from other libraries for the Brier decomposition. I will need some guidance on how to proceed with these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Brier score and it's decomposition into resolution, reliability and uncertainty.
2 participants