-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Box plot: Add 'order by importance' checkbox to groups #4055
Conversation
3241ee5
to
a674fc3
Compare
Codecov Report
@@ Coverage Diff @@
## master #4055 +/- ##
==========================================
+ Coverage 85.94% 85.94% +<.01%
==========================================
Files 393 393
Lines 70033 70090 +57
==========================================
+ Hits 60187 60240 +53
- Misses 9846 9850 +4 |
Gosh, I forgot what we decided in the end, so please forget me if I'm wrong.
|
@BlazZupan and @lanzagar, please confirm functionality. I'll write/fix tests afterwards. |
Comment by @BlazZupan: when stretching bars makes no sense (when there are no groups or when the grouping variable is the same as the variable shown), bars should not be strectched. This should also disable the checkbox. This does notbelong to this PR and is implemented in #4176. |
5330135
to
537e5f4
Compare
0895222
to
5c985b2
Compare
Box plot shows distribution of some attribute and, if grouping is enabled, how the distribution of this attribute varies accross groups. It thus conveys information about conditional probability of the target variable given the value of the grouping variable.
If "Order by importance" is checked, the widget computes the chi square or ANOVA between all target variables and the currently selected group. This may help the user answer the question "If I divide the data into such and such groups, which is the attribute by which these groups differ most".
However, since the widget shows the conditional probability of the target given the grouping variable, it might make more sense to sort grouping variables. This will allow the user to set the target (typically the outcome, the class) and see which (grouping) attribute is the most informative about this class. This is currently possible by setting the outcome as the grouping variable and sorting the variables whose distributions we're observing, but in this case the widget shows the wrong conditional probabilities.
Both ways make some sense, but I believe that sorting group variables makes more sense. A circumstantial evidence for the latter is also that if we sort by variables, we compute a mixture of chi-square and ANOVA p-values and sort them. This is not wrong, they should be commensurable. If we sort by groups, we compute either chi-square (if variable is discrete) or ANOVA (if it's numeric) for all groups (because all groups are always discrete).
I changed the widget so that it can be tried out, but haven't thoroughly checked the code yet. I would appreciate some comments before jumping into a change that we might decide to revert the day after tomorrow.
@BlazZupan?