[ENH] Preprocess: Add filtering by missing values #4266

AndrejaKovacic · 2019-12-13T09:43:37Z

Description of changes

I extended Filter sparse features preprocessor with filtering columns by Nan's. We spoke about having 3 options, filter by 0, Nan's or both. I chose not to implement the third option, since the order of operation matters here and the user just use this preprocessor two times and have complete control that way.

Includes

Code changes
Tests
Documentation

codecov · 2019-12-13T09:59:43Z

Codecov Report

Merging #4266 into master will increase coverage by 0.76%.
The diff coverage is 98.91%.

@@            Coverage Diff             @@
##           master    #4266      +/-   ##
==========================================
+ Coverage   86.05%   86.82%   +0.76%     
==========================================
  Files         394      396       +2     
  Lines       70228    71622    +1394     
==========================================
+ Hits        60435    62185    +1750     
+ Misses       9793     9437     -356

ajdapretnar · 2019-12-19T11:28:20Z

Orange/widgets/data/owpreprocess.py

@@ -252,35 +252,103 @@ def __repr__(self):

 class RemoveSparseEditor(BaseEditor):

+    options = ["Nan's", "0's"]


NaNs and 0s, no apostrophe. I prefer missing instead of NaN, but that's ok, too. 0 should probably be written with a word, 'zeros'.

I think we should use "missing". Muggles won't understand NaN.

ajdapretnar · 2019-12-19T11:31:38Z

Other preprocessors have a vertical layout, so I suggest placing the two filtering options vertically instead of horizontally.
Also, 'Select random features' and 'Select relevant features' have the threshold options as 'Fixed' and 'Percentage'. I think it is nice we have the same layout across all preprocessors. That said, there's an option in Text add-on where words are filtered by their absolute frequency if the input is an integer and by their relative value if the input is a float (e.g. 0.1 == 10%). Perhaps something we could have in preprocessors, too. Not sure about the user perspective here, though.

lanzagar · 2020-01-10T10:19:02Z

Orange/preprocess/preprocess.py

-        Minimal proportion of non-zero entries of a feature
+    threshold: int or float
+        if >= 1, the argument represents the allowed number of 0s or NaNs;
+        if below 0, it represents the allowed proportion of 0s or NaNs


below 0 -> below 1

lanzagar · 2020-01-10T10:19:44Z

Orange/preprocess/preprocess.py

+        if >= 1, the argument represents the allowed number of 0s or NaNs;
+        if below 0, it represents the allowed proportion of 0s or NaNs
+    filter0: bool
+        if True (default), preprocessor counts 0s, otherwise NoNs


NoNs -> NaNs

lanzagar · 2020-01-10T10:27:00Z

Orange/preprocess/preprocess.py

@@ -572,26 +572,42 @@ def __call__(self, data):

 class RemoveSparse(Preprocess):
    """
-    Remove sparse  features. Sparseness is determined according to
-    user-defined treshold.
+    Filter out the features with too many nan's or 0. Threshold is user defined.


Filter out features with too many (>threshold) zeros or missing values.

lanzagar · 2020-01-10T10:29:47Z

Orange/preprocess/preprocess.py

    """
-
-    def __init__(self, threshold=0.05):
+    def __init__(self, threshold=5, filter0=True):


I suggest to leave the default value at threshold=0.05 so it remains backwards compatible.
(this does not impact the widget as it always calls it with a defined threshold anyway)

Add filtering by nans preprocessor

3a2b007

AndrejaKovacic force-pushed the filter_nans branch from d70df3e to 3a2b007 Compare December 13, 2019 09:59

Add tests

ce493b0

AndrejaKovacic force-pushed the filter_nans branch from f72f29c to ce493b0 Compare December 13, 2019 12:49

ajdapretnar reviewed Dec 19, 2019

View reviewed changes

Change layout

8d43115

janezd assigned lanzagar Dec 20, 2019

janezd force-pushed the filter_nans branch 2 times, most recently from 1b930a3 to e565f56 Compare December 20, 2019 14:18

lanzagar reviewed Jan 10, 2020

View reviewed changes

AndrejaKovacic force-pushed the filter_nans branch from e565f56 to 5cc1a3b Compare January 10, 2020 11:11

fix typos, comments

bb801df

AndrejaKovacic force-pushed the filter_nans branch from 5cc1a3b to bb801df Compare January 10, 2020 11:16

lanzagar changed the title ~~[ENH] Add filtering by nans~~ [ENH] Preprocess: Add filtering by missing values Jan 10, 2020

lanzagar merged commit 27aabbf into biolab:master Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Preprocess: Add filtering by missing values #4266

[ENH] Preprocess: Add filtering by missing values #4266

AndrejaKovacic commented Dec 13, 2019 •

edited

Loading

codecov bot commented Dec 13, 2019 •

edited

Loading

ajdapretnar Dec 19, 2019

janezd Dec 19, 2019

ajdapretnar commented Dec 19, 2019

lanzagar Jan 10, 2020

lanzagar Jan 10, 2020

lanzagar Jan 10, 2020

lanzagar Jan 10, 2020

		@@ -252,35 +252,103 @@ def __repr__(self):

		class RemoveSparseEditor(BaseEditor):

		options = ["Nan's", "0's"]

[ENH] Preprocess: Add filtering by missing values #4266

[ENH] Preprocess: Add filtering by missing values #4266

Conversation

AndrejaKovacic commented Dec 13, 2019 • edited Loading

Description of changes

Includes

codecov bot commented Dec 13, 2019 • edited Loading

Codecov Report

ajdapretnar Dec 19, 2019

Choose a reason for hiding this comment

janezd Dec 19, 2019

Choose a reason for hiding this comment

ajdapretnar commented Dec 19, 2019

lanzagar Jan 10, 2020

Choose a reason for hiding this comment

lanzagar Jan 10, 2020

Choose a reason for hiding this comment

lanzagar Jan 10, 2020

Choose a reason for hiding this comment

lanzagar Jan 10, 2020

Choose a reason for hiding this comment

AndrejaKovacic commented Dec 13, 2019 •

edited

Loading

codecov bot commented Dec 13, 2019 •

edited

Loading