-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Preprocess: Add filtering by missing values #4266
Conversation
d70df3e
to
3a2b007
Compare
Codecov Report
@@ Coverage Diff @@
## master #4266 +/- ##
==========================================
+ Coverage 86.05% 86.82% +0.76%
==========================================
Files 394 396 +2
Lines 70228 71622 +1394
==========================================
+ Hits 60435 62185 +1750
+ Misses 9793 9437 -356 |
f72f29c
to
ce493b0
Compare
Orange/widgets/data/owpreprocess.py
Outdated
@@ -252,35 +252,103 @@ def __repr__(self): | |||
|
|||
class RemoveSparseEditor(BaseEditor): | |||
|
|||
options = ["Nan's", "0's"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaNs and 0s, no apostrophe. I prefer missing instead of NaN, but that's ok, too. 0 should probably be written with a word, 'zeros'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use "missing". Muggles won't understand NaN.
Other preprocessors have a vertical layout, so I suggest placing the two filtering options vertically instead of horizontally. |
1b930a3
to
e565f56
Compare
Orange/preprocess/preprocess.py
Outdated
Minimal proportion of non-zero entries of a feature | ||
threshold: int or float | ||
if >= 1, the argument represents the allowed number of 0s or NaNs; | ||
if below 0, it represents the allowed proportion of 0s or NaNs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
below 0
-> below 1
Orange/preprocess/preprocess.py
Outdated
if >= 1, the argument represents the allowed number of 0s or NaNs; | ||
if below 0, it represents the allowed proportion of 0s or NaNs | ||
filter0: bool | ||
if True (default), preprocessor counts 0s, otherwise NoNs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NoNs
-> NaNs
Orange/preprocess/preprocess.py
Outdated
@@ -572,26 +572,42 @@ def __call__(self, data): | |||
|
|||
class RemoveSparse(Preprocess): | |||
""" | |||
Remove sparse features. Sparseness is determined according to | |||
user-defined treshold. | |||
Filter out the features with too many nan's or 0. Threshold is user defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter out features with too many (>threshold) zeros or missing values.
Orange/preprocess/preprocess.py
Outdated
""" | ||
|
||
def __init__(self, threshold=0.05): | ||
def __init__(self, threshold=5, filter0=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to leave the default value at threshold=0.05
so it remains backwards compatible.
(this does not impact the widget as it always calls it with a defined threshold anyway)
e565f56
to
5cc1a3b
Compare
5cc1a3b
to
bb801df
Compare
Description of changes
I extended Filter sparse features preprocessor with filtering columns by Nan's. We spoke about having 3 options, filter by 0, Nan's or both. I chose not to implement the third option, since the order of operation matters here and the user just use this preprocessor two times and have complete control that way.
Includes