-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wmedian: the first weight is ignored when ProbabilityWeights is used #933
Comments
I am the author of the function agree with all these issues — they all come down to the arbitrariness of the first weight. Additionally, the part of the weighted That being said, I cannot think of a good fix. I guess I just really can't see how the quantile definition used by Julia can be naturally extended to weighted vector without implying weird edge cases. If you can find a more natural way, feel free to do a PR. Multiplying all weights by a large number so that they become integers would not really solve your issues since replacing a 0 weight by eps() would lead to very different results too. |
Hi @matthieugomez, thanks for replying. I am not familiar with Julia 's quantile function, but if that can advance in any way this discussion, I use the following function in python: import numpy as np
def weighted_quantiles(values, weights, quantiles, interpolate=False):
"""
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5)
2
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5, interpolate=True)
2.5
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5)
1
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5)
2
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5)
3
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5)
4
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5, interpolate=True)
1.002997002997003
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5, interpolate=True)
2.000999000999001
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5, interpolate=True)
2.999000999000999
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5, interpolate=True)
3.9970029970029968
"""
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])] After a discussion here The |
Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:
The frequency weights seems to do what I'd expect:
unlike the ProbabilityWeights:
which seems to:
Worse, it is numerically inaccurate:
whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights
IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).
Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.
Originally posted by @perrette in #435 (comment)
The text was updated successfully, but these errors were encountered: