Average locations, then score the result #52

thodson-usgs · 2022-01-27T15:22:15Z

Reading Collier et al 2018 it seems that the procedure for computing a score is as follows (using bias as example):

calculate the relative bias error at a given location (equation 13)
score the relative error for that location (equation 14)
compute the scalar score as the average score across all locations (15)

But I believe a better procedure is:

calculate the relative bias error
average across all location
score the result
Or when scoring a given location, just steps 1 and 3.

Have I misread the paper? When you use the first method, tweaking \alpha in the scoring function can alter how models rank relative to one another, which isn't ideal.

nocollier · 2022-01-27T16:15:10Z

Thanks for writing. We acknowledge that while we aim to be objective in our benchmarking, we have made arbitrary decisions along the way which affect rank order in benchmarking. While we have made the choices we present in that paper after a long process (years) of talking among ourselves and other model experts, you certainly could choose differently. There is no one right way to do this. I am curious if you can express why you find your suggestion better? I try out all sorts of things from time to time and am open to suggestions. The longer I do this, the more I think that 'score' is a poor term--it implies that this is a game to be won or lost. We tend to think of these scalars rather as a synthesis. We give you, for example, bias maps which you could sort through. But this is a lot of information across all datasets for all variables and all models. So we attempt to synthesize that information for you in a 'useful' way as a single scalar, where 'useful' is a goal we sometimes miss. It is our hope that differences in these scalars will help alert the scientist to dig deeper to understand what has led to this difference. Re: alpha. I left this part in the paper for completeness, but we have never actually used a value different than 1. Realize that using alpha=1 is also a choice, although one that we made implicitly. Hope that gives you something to chew on, happy to chat more.

…

On Thu, Jan 27, 2022 at 10:22 AM Timothy Hodson ***@***.***> wrote: Reading Collier et al 2018 it seems that the procedure for computing a score is as follows (using bias as example): 1. calculate the relative bias error at a given location (equation 13) 2. score the relative error for that location (equation 14) 3. compute the scalar score as the average score across all locations (15) But I believe a better procedure is: 1. calculate the relative bias error 2. average across all location 3. score the result Or when scoring a given location, just steps 1 and 3. Have I misread the paper? When you use the first method, tweaking \alpha in the scoring function can alter how models rank relative to one another, which isn't ideal. — Reply to this email directly, view it on GitHub <#52>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKFCBYKW6XD6EACRV43OLLUYFPLJANCNFSM5M6HJBRA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

thodson-usgs · 2022-01-27T16:55:37Z

Full disclosure, I wrote a paper on this topic: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2021MS002681
My coauthors and I were really impressed by the ILAMB system, which is why we borrowed several aspects of that system in the paper, but I also noted this point about alpha. I regret that I made that critique without reaching out to you first.

In that paper, we took a different tack and instead gave the 'overall' score based on a single objective metric. We chose MSE as a demonstration, though admittedly it falls short in many applications. Then we decomposed the MSE into things like bias and variance to show how these different 'concepts' contribute to the model's overall performance. In that way, the overall score is meaningful, in the sense that the model that best approximates reality will score highest. That "meaning" is retained if you average across locations then score. However if you score then average, the relative rankings of models may change and the score loses some of its meaning.

Anyway, we respect all that your team has done. I'd agree the system is useful, but I thought this detail about alpha and rankings was something to be aware of.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average locations, then score the result #52

Average locations, then score the result #52

thodson-usgs commented Jan 27, 2022

nocollier commented Jan 27, 2022 via email

thodson-usgs commented Jan 27, 2022 •

edited

Loading

Average locations, then score the result #52

Average locations, then score the result #52

Comments

thodson-usgs commented Jan 27, 2022

nocollier commented Jan 27, 2022 via email

thodson-usgs commented Jan 27, 2022 • edited Loading

thodson-usgs commented Jan 27, 2022 •

edited

Loading