-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Average locations, then score the result #52
Comments
Thanks for writing. We acknowledge that while we aim to be objective in our
benchmarking, we have made arbitrary decisions along the way which affect
rank order in benchmarking. While we have made the choices we present in
that paper after a long process (years) of talking among ourselves and
other model experts, you certainly could choose differently. There is no
one right way to do this. I am curious if you can express why you find your
suggestion better? I try out all sorts of things from time to time and am
open to suggestions.
The longer I do this, the more I think that 'score' is a poor term--it
implies that this is a game to be won or lost. We tend to think of these
scalars rather as a synthesis. We give you, for example, bias maps which
you could sort through. But this is a lot of information across all
datasets for all variables and all models. So we attempt to synthesize that
information for you in a 'useful' way as a single scalar, where 'useful' is
a goal we sometimes miss. It is our hope that differences in these scalars
will help alert the scientist to dig deeper to understand what has led to
this difference.
Re: alpha. I left this part in the paper for completeness, but we have
never actually used a value different than 1. Realize that using alpha=1 is
also a choice, although one that we made implicitly.
Hope that gives you something to chew on, happy to chat more.
…On Thu, Jan 27, 2022 at 10:22 AM Timothy Hodson ***@***.***> wrote:
Reading Collier et al 2018 it seems that the procedure for computing a
score is as follows (using bias as example):
1. calculate the relative bias error at a given location (equation 13)
2. score the relative error for that location (equation 14)
3. compute the scalar score as the average score across all locations
(15)
But I believe a better procedure is:
1. calculate the relative bias error
2. average across all location
3. score the result
Or when scoring a given location, just steps 1 and 3.
Have I misread the paper? When you use the first method, tweaking \alpha
in the scoring function can alter how models rank relative to one another,
which isn't ideal.
—
Reply to this email directly, view it on GitHub
<#52>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFCBYKW6XD6EACRV43OLLUYFPLJANCNFSM5M6HJBRA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Full disclosure, I wrote a paper on this topic: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2021MS002681 In that paper, we took a different tack and instead gave the 'overall' score based on a single objective metric. We chose MSE as a demonstration, though admittedly it falls short in many applications. Then we decomposed the MSE into things like bias and variance to show how these different 'concepts' contribute to the model's overall performance. In that way, the overall score is meaningful, in the sense that the model that best approximates reality will score highest. That "meaning" is retained if you average across locations then score. However if you score then average, the relative rankings of models may change and the score loses some of its meaning. Anyway, we respect all that your team has done. I'd agree the system is useful, but I thought this detail about alpha and rankings was something to be aware of. |
Reading Collier et al 2018 it seems that the procedure for computing a score is as follows (using bias as example):
But I believe a better procedure is:
Or when scoring a given location, just steps 1 and 3.
Have I misread the paper? When you use the first method, tweaking \alpha in the scoring function can alter how models rank relative to one another, which isn't ideal.
The text was updated successfully, but these errors were encountered: