Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Preprocess: Add filtering by missing values #4266

Merged
merged 4 commits into from
Jan 10, 2020

Conversation

AndrejaKovacic
Copy link
Contributor

@AndrejaKovacic AndrejaKovacic commented Dec 13, 2019

Description of changes

I extended Filter sparse features preprocessor with filtering columns by Nan's. We spoke about having 3 options, filter by 0, Nan's or both. I chose not to implement the third option, since the order of operation matters here and the user just use this preprocessor two times and have complete control that way.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Dec 13, 2019

Codecov Report

Merging #4266 into master will increase coverage by 0.76%.
The diff coverage is 98.91%.

@@            Coverage Diff             @@
##           master    #4266      +/-   ##
==========================================
+ Coverage   86.05%   86.82%   +0.76%     
==========================================
  Files         394      396       +2     
  Lines       70228    71622    +1394     
==========================================
+ Hits        60435    62185    +1750     
+ Misses       9793     9437     -356

@@ -252,35 +252,103 @@ def __repr__(self):

class RemoveSparseEditor(BaseEditor):

options = ["Nan's", "0's"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaNs and 0s, no apostrophe. I prefer missing instead of NaN, but that's ok, too. 0 should probably be written with a word, 'zeros'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use "missing". Muggles won't understand NaN.

@ajdapretnar
Copy link
Contributor

Other preprocessors have a vertical layout, so I suggest placing the two filtering options vertically instead of horizontally.
Also, 'Select random features' and 'Select relevant features' have the threshold options as 'Fixed' and 'Percentage'. I think it is nice we have the same layout across all preprocessors. That said, there's an option in Text add-on where words are filtered by their absolute frequency if the input is an integer and by their relative value if the input is a float (e.g. 0.1 == 10%). Perhaps something we could have in preprocessors, too. Not sure about the user perspective here, though.

@janezd janezd force-pushed the filter_nans branch 2 times, most recently from 1b930a3 to e565f56 Compare December 20, 2019 14:18
Minimal proportion of non-zero entries of a feature
threshold: int or float
if >= 1, the argument represents the allowed number of 0s or NaNs;
if below 0, it represents the allowed proportion of 0s or NaNs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below 0 -> below 1

if >= 1, the argument represents the allowed number of 0s or NaNs;
if below 0, it represents the allowed proportion of 0s or NaNs
filter0: bool
if True (default), preprocessor counts 0s, otherwise NoNs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoNs -> NaNs

@@ -572,26 +572,42 @@ def __call__(self, data):

class RemoveSparse(Preprocess):
"""
Remove sparse features. Sparseness is determined according to
user-defined treshold.
Filter out the features with too many nan's or 0. Threshold is user defined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter out features with too many (>threshold) zeros or missing values.

"""

def __init__(self, threshold=0.05):
def __init__(self, threshold=5, filter0=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to leave the default value at threshold=0.05 so it remains backwards compatible.
(this does not impact the widget as it always calls it with a defined threshold anyway)

@lanzagar lanzagar changed the title [ENH] Add filtering by nans [ENH] Preprocess: Add filtering by missing values Jan 10, 2020
@lanzagar lanzagar merged commit 27aabbf into biolab:master Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants