-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Math features migrate to numpy #774
base: main
Are you sure you want to change the base?
Math features migrate to numpy #774
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #774 +/- ##
==========================================
+ Coverage 97.52% 97.53% +0.01%
==========================================
Files 107 108 +1
Lines 4283 4341 +58
Branches 854 866 +12
==========================================
+ Hits 4177 4234 +57
Misses 62 62
- Partials 44 45 +1 ☔ View full report in Codecov by Sentry. |
…ys an instance of list and the raise can never happen.
… in _get_new_features_name cause func is always an instance of list and the else gets never reached. Code Section was never covered by tests.
…(pandas default of ddof is 1, numpy default of ddof is 0) default is 1 for backward compatability
…(pandas default of ddof is 1, numpy default of ddof is 0) default is 1 for backward compatability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @olikra
Thank you so much for moving forward with these changes. And apologies for the delayed response. I was first sick and then on a short holiday break.
I added some comments, I am not convinced we need the CustomFuntions class, and I wonder if there is a better way of implementing the functionality other than so many elifs.
Could we not do something like what we have in relative features:
methods_dict = { |
Would be great if you could look at the comments and let me know your thoughts. Thank you!
""" | ||
|
||
|
||
class CustomFunctions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this class? it seems to be implementing a very simple functionality. Can this not be a one liner in the main MathFeatures class?
Besides that, the class name does not immediately tell what the class is doing, we'd need docstrings and typehints.
@@ -68,6 +69,11 @@ class MathFeatures(BaseCreation): | |||
one name per function. If None, the transformer will assign arbitrary names, | |||
starting with the function and followed by the variables separated by _. | |||
|
|||
ddof: int, float, default = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can ddof ever be a float? the ddof are usually integers, no?
@@ -130,6 +136,54 @@ class MathFeatures(BaseCreation): | |||
0 1 4 2.5 | |||
1 2 5 3.5 | |||
2 3 6 4.5 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this, should go in the user guide, not here. It is too much information for the class docstrings. But let's hold on with the change until we finalize the class update
@@ -139,8 +193,19 @@ def __init__( | |||
new_variables_names: Optional[List[str]] = None, | |||
missing_values: str = "raise", | |||
drop_original: bool = False, | |||
ddof: Union[int, float] = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ddof should be just int
) -> None: | ||
|
||
# casting input parameter func to a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for compatibility with sklearn, we can't modify the parameters that the user enters at init. They need to remain the same. Anything that needs changing, needs to happen in the fit and be added as a new attribute followed by _ if necessary at all.
def np_transform(np_df, new_variable_names, np_variables, np_functions): | ||
np_result_df = pd.DataFrame() | ||
for np_function_idx, np_function in enumerate(np_functions): | ||
if callable(np_function): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is a callable, can we not apply it straightaway? is not getting the name and then the if loop adding complexity?
else: | ||
np_function_name = np_function | ||
|
||
if np_function_name in ("sum"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think asserting equality is faster than scanning something in set, is that the case? did you check?
np_result_df[new_variable_names[np_function_idx]] = pd.Series( | ||
result | ||
) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need continue here? so that it reads the following elif? should we not just use if then?
@@ -4,6 +4,7 @@ | |||
from sklearn.pipeline import Pipeline | |||
|
|||
from feature_engine.creation import MathFeatures | |||
from feature_engine.creation.custom_functions import CustomFunctions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do keep this function, we need to test it in a separate test file. We have 1 test file per class/script
@@ -39,6 +40,11 @@ def test_error_if_func_is_dictionary(): | |||
MathFeatures(variables=["Age", "Name"], func={"A": "sum", "B": "mean"}) | |||
|
|||
|
|||
def test_error_if_ddof_is_not_int_or_float(): | |||
with pytest.raises(ValueError): | |||
MathFeatures(variables=["Age", "Name"], func={"std", "var"}, ddof="A") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also want to assert the error message is the expected one
I migrated the transform method in math_features from pandas.agg to native numpy functions with backward compatibility. At least all existing tests on math-feature pass.
Following key-functions are implemented:
In case a defined function is not in above collection it falls back to pandas.agg. In the Examples-section I described the way to call a custom function with numpy.apply_over_axes()
I additionally changed the deprecated
dob_datrange = pd.date_range("2020-02-24", periods=4, freq="T")
to
dob_datrange = pd.date_range("2020-02-24", periods=4, freq="min")
to avoid FutureWarnings from pandas.
Please have a look. @solegalli