-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SmartCorrelatedSelection
does not behave consistently on different hardware with highly correlated features
#824
Comments
I have submitted a PR with my suggestions, let me know if you would like me to make changes :) |
Hi @ClaudioSalvatoreArcidiacono Thanks for the issue. If you set the |
Hey @solegalli, Thanks for taking a look. I tried setting That's because
|
SmartCorrelatedSelection
does not behave consistently on different hardware with highly correlated features
Thank you @ClaudioSalvatoreArcidiacono for the detailed explanation of the problem. I do agree that returning different results on different operating systems is not ideal. My understanding is that operations with sets are faster than operations with lists. Sets also guarantee that we don't have the same feature twice. So I am disinclined to go from sets to lists. Could we use ordered sets instead? Changing from sets to lists would also break backward compatibility. I understand from your previous comment that just changing the algorithm to mergesort does not resolve the issue, correct? |
Hey @solegalli, You are correct, changing the algorithm to mergesort would solve the issue for all selection methods except for The overhead added by going from sets to lists in this context should be quite minimal given that we are not dealing with huge sets/lists. Moreover, mainly lookup operations (e.g. check if Ordered Sets could be an option, but they are not part of the python standard library, there is a library called ordered_set which implements OrderedSets, but I would say that adding an extra dependency is not really worthy for this change. Another option could be the OrderedDict class, or directly a I think that by how the In order to keep backward compatibility we could convert the lists back to sets as a last step (see example here). In conclusion, I would say that the only overhead added by the proposed change is the conversion from list to set, which is only necessary to keep backward compatibility. |
Describe the bug
The behavior of
SmartCorrelatedSelection
is unpredictable when there are features that are very similar, so similar that they have the same number of missing values or the same single feature model perforamnce.In particular I have noticed that running the exact same code, with same python version and package versions on my development machine (a mac m1) and on a different machine (a linux-based remote node) I get different results.
To Reproduce
Run the code below on two different machines:
On one machine I get:
Whether on another machine I get:
Notice how the dropped features mismatches in the two sets.
For the tests above I used the following packages versions:
Expected behavior
I would expect the two runs to give the exact same results.
Screenshots
N/A
Additional context
I have already implemented an alternative version of
SmartCorrelatedSelection
that does not have this issue and I would like to contribute to the project by sharing my version.There are 2 reasons for the issue above
quicksort
. Thequicksort
sorting algorithm is not stable (see pandas doc), meaning that in cases of ties it does not keep the original feature order. To solve that I have added the parameterkind="mergesort"
to every call to pd.Series.sort_values.mergesort
is a stable sorting algorithm and it ensures the same ordering, also in case of ties.selection_method='model_performance'
, the temp_set is here defined as a set, which is a collection that does not preserve order. When this value is returned and it is used in here the original order of the feature is not preserved anymore, when the features are finally sorted in here even withmergesort
as a sorting algorithm the result will differ in case of ties (because the original order is not preserved due to the set issue). To solve this second point I have changed thetemp_set
variable to be alist
instead of aset()
The text was updated successfully, but these errors were encountered: