Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Math features migrate to numpy #774

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

olikra
Copy link
Contributor

@olikra olikra commented Jun 24, 2024

I migrated the transform method in math_features from pandas.agg to native numpy functions with backward compatibility. At least all existing tests on math-feature pass.

Following key-functions are implemented:

  • "sum", "np.sum"
  • "mean", "np.mean"
  • "min", "np.min"
  • "max", "np.max"
  • "prod", "np.prod"
  • "median", "np.median"
  • "std", "np.std"
  • "var", "np.var"

In case a defined function is not in above collection it falls back to pandas.agg. In the Examples-section I described the way to call a custom function with numpy.apply_over_axes()

I additionally changed the deprecated

dob_datrange = pd.date_range("2020-02-24", periods=4, freq="T")

to

dob_datrange = pd.date_range("2020-02-24", periods=4, freq="min")

to avoid FutureWarnings from pandas.

Please have a look. @solegalli

Copy link

codecov bot commented Jun 24, 2024

Codecov Report

Attention: Patch coverage is 98.50746% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.53%. Comparing base (8cfab26) to head (63f26c9).

Files Patch % Lines
feature_engine/creation/math_features.py 98.41% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #774      +/-   ##
==========================================
+ Coverage   97.52%   97.53%   +0.01%     
==========================================
  Files         107      108       +1     
  Lines        4283     4341      +58     
  Branches      854      866      +12     
==========================================
+ Hits         4177     4234      +57     
  Misses         62       62              
- Partials       44       45       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

olikra added 5 commits June 24, 2024 12:04
…ys an instance of list and the raise can never happen.
… in _get_new_features_name cause func is always an instance of list and the else gets never reached. Code Section was never covered by tests.
olikra added 5 commits June 26, 2024 17:58
…(pandas default of ddof is 1, numpy default of ddof is 0) default is 1 for backward compatability
…(pandas default of ddof is 1, numpy default of ddof is 0) default is 1 for backward compatability
Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @olikra

Thank you so much for moving forward with these changes. And apologies for the delayed response. I was first sick and then on a short holiday break.

I added some comments, I am not convinced we need the CustomFuntions class, and I wonder if there is a better way of implementing the functionality other than so many elifs.

Could we not do something like what we have in relative features:

Would be great if you could look at the comments and let me know your thoughts. Thank you!

"""


class CustomFunctions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this class? it seems to be implementing a very simple functionality. Can this not be a one liner in the main MathFeatures class?

Besides that, the class name does not immediately tell what the class is doing, we'd need docstrings and typehints.

@@ -68,6 +69,11 @@ class MathFeatures(BaseCreation):
one name per function. If None, the transformer will assign arbitrary names,
starting with the function and followed by the variables separated by _.

ddof: int, float, default = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can ddof ever be a float? the ddof are usually integers, no?

@@ -130,6 +136,54 @@ class MathFeatures(BaseCreation):
0 1 4 2.5
1 2 5 3.5
2 3 6 4.5

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this, should go in the user guide, not here. It is too much information for the class docstrings. But let's hold on with the change until we finalize the class update

@@ -139,8 +193,19 @@ def __init__(
new_variables_names: Optional[List[str]] = None,
missing_values: str = "raise",
drop_original: bool = False,
ddof: Union[int, float] = 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ddof should be just int

) -> None:

# casting input parameter func to a list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for compatibility with sklearn, we can't modify the parameters that the user enters at init. They need to remain the same. Anything that needs changing, needs to happen in the fit and be added as a new attribute followed by _ if necessary at all.

def np_transform(np_df, new_variable_names, np_variables, np_functions):
np_result_df = pd.DataFrame()
for np_function_idx, np_function in enumerate(np_functions):
if callable(np_function):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is a callable, can we not apply it straightaway? is not getting the name and then the if loop adding complexity?

else:
np_function_name = np_function

if np_function_name in ("sum"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think asserting equality is faster than scanning something in set, is that the case? did you check?

np_result_df[new_variable_names[np_function_idx]] = pd.Series(
result
)
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need continue here? so that it reads the following elif? should we not just use if then?

@@ -4,6 +4,7 @@
from sklearn.pipeline import Pipeline

from feature_engine.creation import MathFeatures
from feature_engine.creation.custom_functions import CustomFunctions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do keep this function, we need to test it in a separate test file. We have 1 test file per class/script

@@ -39,6 +40,11 @@ def test_error_if_func_is_dictionary():
MathFeatures(variables=["Age", "Name"], func={"A": "sum", "B": "mean"})


def test_error_if_ddof_is_not_int_or_float():
with pytest.raises(ValueError):
MathFeatures(variables=["Age", "Name"], func={"std", "var"}, ddof="A")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also want to assert the error message is the expected one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants