Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chauvenet criterion wrong results #8

Open
baptistelabat-syroco opened this issue Nov 4, 2024 · 3 comments
Open

Chauvenet criterion wrong results #8

baptistelabat-syroco opened this issue Nov 4, 2024 · 3 comments
Labels
good first issue Good for newcomers

Comments

@baptistelabat-syroco
Copy link

Describe the bug
Chauvenet's criterion should consider around 1 sample of a normal distribution as a outlier.
Here it is considering 30 to 40% percent of point to be outliers

To Reproduce
from pythresh.thresholds.chau import CHAU
import numpy as np
normal_array = np.random.randn(99)
outlier_array = CHAU().eval(normal_array)
print(np.vstack([normal_array, outlier_array]).T)
np.sum(outlier_array)

Expected behavior
We tested on a normal distribution only a few points or zero should be considered as outliers.

Desktop (please complete the following information):

  • OS: linux
  • Version 24.04 LTS

Additional context
https://www.statisticshowto.com/chauvenets-criterion/
This table can be obtained with the following code:
prob_threshold = 1.0 / (2.0 * n)
number_of_tails = 2
threshold = -scipy_stats.norm.isf(1 - prob_threshold/number_of_tails)

@KulikDM KulikDM added the good first issue Good for newcomers label Nov 4, 2024
@KulikDM
Copy link
Owner

KulikDM commented Nov 4, 2024

Hi @baptistelabat-syroco

Thanks for spotting this!
In all honesty, the current implementation of CHAU was trying to cater for the fact that the decision scores generated by the outlier detection method are most likely "not" normally distributed. So the implementation is non-standard compared to the example you provided above.

That being said, perhaps CHAU can be better aligned with the standard definition that you provided and then use robust zscores e.g.

median = np.median(decision)
mad = np.median(np.abs(decision - median))  # Median Absolute Deviation

# Using a scaling factor of 1.4826 to approximate normality scaling
robust_std_dev = mad * 1.4826
z_scores = (decision - median) / robust_std_dev

This would produce what is expected for a normal distribution while also better catering for non-normal distributions as well.
If you have any suggestions or ideas, your feedback is always appreciated!

@baptistelabat-syroco
Copy link
Author

Thanks for your return.
I am afraid I don't understand your scaling factor. It seems rather hacky to me. I have to admit I struggle to find a clear, common definition of this Chauvenet's criterion, and I was expecting to use your library as a reference to validate our own implementation.

@KulikDM
Copy link
Owner

KulikDM commented Nov 6, 2024

Hi @baptistelabat-syroco,
It does seem a bit strange, but this should hopefully help clarify https://stats.stackexchange.com/questions/355943/how-to-estimate-the-scale-factor-for-mad-for-a-non-normal-distribution
The example I gave above was for a normal distribution scaling but it can be done intuitively for other distribution types as well.
It's unfortunate that the results between the two methods are not matching, but I don't see any issue with your implementation with respect to only a normal distribution... unless you were wanting to cater for non-normal distributions as well, which is the route PyThresh took.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants