-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clip the range of histograms when there are outliers #1157
Conversation
I'm sure the heuristic used to choose the range can be improved so if anybody gets the chance to try it on a couple of datasets and see how it looks that would be very helpful |
to avoid clipping very close to the actual min or max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skrub/_reporting/_plotting.py
Outdated
@@ -115,6 +117,67 @@ def _adjust_fig_size(fig, ax, target_w, target_h): | |||
fig.set_size_inches((w, h)) | |||
|
|||
|
|||
def _get_range(values, frac=0.2, factor=3.0): | |||
min_value, low_p, high_p, max_value = np.percentile( | |||
values, [0, frac * 100.0, (1.0 - frac) * 100.0, 100.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.quantile to simplify this line?
delta = high_p - low_p | ||
if not delta: | ||
return min_value, max_value | ||
margin = factor * delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting heuristic, how did you get these default parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the random number generator in my head 😅 as mentioned I think the heuristic can definitely be refined or replaced with somehting else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of using the inter-quantile range because that's what box plots do, but I think we could read more about simple outlier detection methods
thanks for the review @Vincent-Maladiere :) |
fixes #1155
this limits the range of data shown in some histograms to avoid having all the data in one bin and seeing no details of the distribution