-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing_only init param to all imputers. missing_only is used in combination w/ the variables param. #698
base: main
Are you sure you want to change the base?
Conversation
…imputation __init__.py
…riately revise unit test
hi @solegalli, When I'm guessing the latter because I can't think of a scenario in which we would want to eliminate variables that do not have missing values. If my intuition is correct, is the sole purpose of this new feature for computational efficiency? If not, what are the other objectives of this new feature? |
hola @solegalli, are you on vacation? Just wanted to see if you saw my question. No rush! I'm accustomed to your rapid responses, so I thought I check-in. |
Hey @Morgan-Sell
The aim is not to expand the feature space unnecessarily. If you add indicators for all variables, not just those with NA, you will end up with a lot of variables that just contain the value 0. Is this what you mean by efficiency? Cheers |
@solegalli, welcome back! Hope you enjoyed your "holiday" as they say on your side of the pond ;) Based on your explanation, it seems that the If this is the case, then should this functionality only reside in the AddMissingIndicator class? |
Its most meaningful / useful function is indeed for the MissingIndicator class. But we should add it in all imputers. It will play a role when variables =None. At the moment, when |
…er class. Add missing_only param to CategoricalEncoder __init__. All CategoricalEncoder tests pass.
I added |
@@ -175,6 +185,10 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None): | |||
# select all variables or check variables entered by the user | |||
self.variables_ = find_all_variables(X, self.variables) | |||
|
|||
# identify variables with missing values | |||
if self.missing_only and self.variables is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as previous, we need to combine this with find_all_variables so it doesnt search twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my prior response. By implementing the elif approach, we'll encounter a similar issue. This time we only want categorical variables, but if we create a new elif then we'll have both categorical and numerical variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Morgan-Sell !
Thank you so much for the changes. This is well ahead in the development :)
I made a few code suggestions and added a few comments, so here I summarize the main things:
- for numerical imputers, when
variables=None
and missing=True, it selects all numerical variables with nan - for categorical imputers, when variables=None and missing=True, it selects all categorical variables with nan
- for the transformer that adds missing indicators, or the random sample imputer, when variables=None and missing=True, it selects all all variables with nan
I think the function that you created needs a bit of refinement.
hi @solegalli, I responded to two of your comments. Once we agree on the approach, we will quickly move forward with this PR. It's, more or less, copying/pasting the agreed-upon approach on all transformers. |
…on between 'ignore_format' and 'missing_only'
hi @solegalli, i created two global functions to identify either categorical or numerical variables that have missing values. These functions incorporate the "check_or_find_all..." functions. I have one question about the ArbitraryImputer's fit() order of operations. See my above comment. I also saw that DropMissingData() already incorporates the In the meantime, I will implement the new code in the other imputation transformers. |
hi @solegalli, AddMissingIndicator() already has the Should I refactor the code so the AddMissingIndicator() code base has a similar structure to the other imputers? Or, keep the code as is? If you have time, take a look at the other imputers. We're almost there!!! |
hi @solegalli, Did you see my above questions regarding the AddMissingIndicator() class? |
Closes #388.
When
missing_only=True
andvariables=None
, all numerical variables that do not contain missing values will be omitted fromself.variables_
.This functionality does not apply to categorical variables.