-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MathFeatures seems much slower than pandas.sum() #576
Comments
hi @solegalli, Were you able to pinpoint the root cause of this issue? |
No. Didn't have time to check. |
"seems much slower" Do we know how much slower MathFeatures is? If desired I can do some checks and report back... |
I don't know exactly how much slower it is. It would be great if you could do the checks @olikra ! |
@solegalli I suggest following approach concentrating on the 4 basic arithmetics (add,subtract,divide,multiply):
In a second step we could do further calculation on the two columns itself with (log, sin,...) |
Sounds good @olikra |
@solegalli Did some investigation: 1. Performance Measurement Approach
A script reads in the testdata (looping from 100.000 to 1.000.000 in 10.000 increments) and doing the math 50 times to get better averages. For pandas i used this line of code: For numpy: for feature-engine math:
Probably, there are more performant solutions for pandas and numpy. But it's just about the comparison to feature-engine math! 2. Results For the 500.000 record block the runtime measured via time.time_ns() directly before and after the 3 line of code: Runtime over all record-blocks: Luckily the runtime is linear to number of records! I stepped into the feature-engine-math and measured the time for each codeline with transform and fit + transform_fit: I assume the issue is in the transform part :-) I'm not familiar with profiling the code, but used cProfile for a first look inside:
Not sure if above output is an indicator for the issue. Regards Addition: Performance measurement was done on a Debian system exclusive - no other tasks like mariadb,nginx,... influenced the run |
It crossed my mind: Of course I have to test the pandas.agg() function similarly. Step into it as soon as posible... |
Thank you @olikra !! This is really useful. So, by the looks of it, we'd have to replace the transform logic, stepping away from pandas agg, and probably replacing with numpy. Yes, the problem is during transform, where the feature creation/combination takes place. It would be great to have numpy sum in the comparison, to see if this makes a better solution: https://numpy.org/doc/stable/reference/generated/numpy.sum.html Would you be able to add that one too in your comparison? and then, i guess we can modify the logic to replace the functions with those of numpy |
@solegalli added numpy.sum to the stack:
Performance for the 500.000 record float based run: Note: To check for correctness of all computations I calculated a checksum (sum of floats in all features/samples) after each run. For the 500.000 record-runs it looks so: The checksum for the pandas-based calculations is the same- but the numpy-based ones are a little bit different. I assume the cause is the different handling of floats in numpy and pandas. Not sure, if this can be relevant for existing models when the calculation changes from pandas to numpy. Cause the measurement is already setup I did the same with integers: So no big differences from Float to Integer. |
Thank you so much @olikra ! This is great! We need to move from pandas to numpy then. Would you be also up for trying to change the logic of this transformer? |
@solegalli I can give it a try. Need to get more familiar with feature-engine and the coding-style, test-style. Until now it was a blackbox-action for me. So the plan is to replace the generic pandas.agg functionality by native numpy functionality, where a native numpy functionality is available at this code-segment: MathFeatures.transform(self, X)
|
Yes @olikra that is the bit to be replaced. Not sure there is a numpy equivalent of pandas.agg, if there is, that would be the simplest, if not, we need to break it down function by function :/ Hopefully, you wouldn't have to change the tests much. |
@solegalli Let me present a first idea how the migration from pandas.agg to numpy in .transform could be done. We have to keep in mind, that already existing implementations of MathFeatures (in your courses and in the rest of the world) have to work in the same way as before migration! sidenote: The "func function, str, list or dict" in pandas.agg with the endless combinations does make things not easier... :-) Overall solution We have to iterate over the func-list, cause I found no numpy equivalent of pandas.agg. So the func-list is handled like a stack and we use a dictionary to do the calculation for each element in the stack. In this example func will get following values: func = ['mean', 'sum', 'np.log', 'convert_kelvin_to_celsius'] 'np.log' is a call to numpy, because we haven't provide it in the following dictionary.
Inside the iterator the calculation is done by this (surrounded code not shown):
df_nan is populated with the 'mean' and 'sum' values and the stack looks this way: func = ['np.log', 'convert_kelvin_to_celsius'] Unfortunatley we have to call these two functions with the pandas.agg. If we could find a way to identify a custom-function (e.g. by an decorator) we could call the custom-function by:
Specific issues with 'std' and 'var' Pandas uses different default-values e.g. for ddof. Using following code:
Pandas uses ddof = 1 as default and numpy uses ddof = 0. Same for 'var' Specific issues with Nan Having some Nan's in the array:
results in this: Open question is, how we handle different default values against the background that MathFeatures is already in use. |
Hey @olikra thank you so much! This is an incredible amount of work. I think you are on the right track. The main thing is not to break backwards compatibility in terms of the functionality. That means, that the user can still instantiate the transformer as usual to get the results they expect. So in principle, we should try and implement as many of the pandas supported methods as possible. Then, I think that using ddof 1 or 0 is a minor detail, but if we want to be absolutely strict with backward compatibility we can enforce numpy to use ddof 1 as well. Is the idea to use numpy whenever possible and pandas agg for pandas functions like the convert_kelvin_to_celsius? How many functions are there? did you find a list? If this is the approach, it will accelerate the transformer quite a bit, because I presume that most users would use the standard functions like sum, mean etc and those are supported by numpy. Then, depending on how many extra functions are supported by agg, we can choose to reimplement them in numpy (if they are just a few) or the easiest would be to default to pandas. The latter has the advantage that if/when pandas releases another function, it will be available for MathFeatures by default, whereas if we re-code them ourselves, every time pandas makes an update, we need to update it from our side. |
@solegalli _cython_table = { 2. Backwards compatibility On my Jupiter-Notebook I created/installed a first version of a boosted-transform method. I implemented the direct usage of numpy.nan* for following functions-calls: calculations = ('mean', 'sum', 'min', 'max', 'prod', 'median', 'std', 'var') During exercising your training-courses I will check/adjust it and do some fine-tuning. 3. Performance transform (with pandas.agg() ) - Numbers in Seconds boosted-transform ( with numpy.nan_) numbers in Millisecond 4. Test of transform |
@solegalli ('mean','sum', 'min', 'max', 'prod', 'median', 'std', 'var') from pandas.agg to numpy. 32 Tests of 33 already passed... Will look into it tomorrow evening... |
@solegalli Math-Feature-conversion is done so far. I assume some parts need discussion. Do you prefer discussion in this issue or in the pull request? |
I have uploaded a pr (#774) |
We discuss the code changes in the PR. I added a few comments. |
Need to check what is going on and fix
The text was updated successfully, but these errors were encountered: