Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmartCorrelatedSelection does not behave consistently on different hardware with highly correlated features #824

Open
ClaudioSalvatoreArcidiacono opened this issue Nov 29, 2024 · 5 comments · May be fixed by #825

Comments

@ClaudioSalvatoreArcidiacono
Copy link
Contributor

ClaudioSalvatoreArcidiacono commented Nov 29, 2024

Describe the bug
The behavior of SmartCorrelatedSelection is unpredictable when there are features that are very similar, so similar that they have the same number of missing values or the same single feature model perforamnce.

In particular I have noticed that running the exact same code, with same python version and package versions on my development machine (a mac m1) and on a different machine (a linux-based remote node) I get different results.

To Reproduce
Run the code below on two different machines:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from feature_engine.selection import SmartCorrelatedSelection
import pandas as pd
from sklearn.model_selection import StratifiedKFold

# Generate a synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, random_state=42)

# Convert to DataFrame for feature naming
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)

# Duplicate features
for i in range(10):  # Duplicating the first 10 features
    X[f'feature_{i}_duplicate'] = X[f'feature_{i}']

# Initialize the Logistic Regression model
model = LogisticRegression()

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Initialize the SmartCorrelatedSelection
selector = SmartCorrelatedSelection(
    estimator=model,
    selection_method='model_performance',
    threshold=0.9,  # Example threshold for correlation
    cv=cv
)

# Fit the selector to the data
selector.fit(X, y)

# Get the features that have been dropped
dropped_features = selector.features_to_drop_
print(dropped_features)

On one machine I get:

['feature_0', 'feature_1_duplicate', 'feature_2_duplicate', 'feature_3', 'feature_4', 'feature_5', 'feature_6_duplicate', 'feature_7', 'feature_8', 'feature_9']

Whether on another machine I get:

['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7_duplicate', 'feature_8', 'feature_9']

Notice how the dropped features mismatches in the two sets.

For the tests above I used the following packages versions:

  • python: 3.8.16
  • feature_engine: 1.8.2
  • pandas: 2.2.3
  • numpy: 2.0.2

Expected behavior
I would expect the two runs to give the exact same results.

Screenshots
N/A

Additional context
I have already implemented an alternative version of SmartCorrelatedSelection that does not have this issue and I would like to contribute to the project by sharing my version.

There are 2 reasons for the issue above

  1. When feature values are sorted (for example here) the default sorting algorithm of pandas series is used, which is quicksort. The quicksort sorting algorithm is not stable (see pandas doc), meaning that in cases of ties it does not keep the original feature order. To solve that I have added the parameter kind="mergesort" to every call to pd.Series.sort_values. mergesort is a stable sorting algorithm and it ensures the same ordering, also in case of ties.
  2. Specifically for the selection_method='model_performance', the temp_set is here defined as a set, which is a collection that does not preserve order. When this value is returned and it is used in here the original order of the feature is not preserved anymore, when the features are finally sorted in here even with mergesort as a sorting algorithm the result will differ in case of ties (because the original order is not preserved due to the set issue). To solve this second point I have changed the temp_set variable to be a list instead of a set()
@ClaudioSalvatoreArcidiacono
Copy link
Contributor Author

I have submitted a PR with my suggestions, let me know if you would like me to make changes :)

@solegalli
Copy link
Collaborator

Hi @ClaudioSalvatoreArcidiacono

Thanks for the issue.

If you set the random_state in the LogisticRegression, do you still see this behaviour?

@ClaudioSalvatoreArcidiacono
Copy link
Contributor Author

ClaudioSalvatoreArcidiacono commented Dec 2, 2024

Hey @solegalli,

Thanks for taking a look.

I tried setting random_state in the LogisticRegression and I obtained the same result.

That's because random_state is not used by the default solver of LogisticRegression (lbfgs). The documentation for random_state indeed states:

random_state int, RandomState instance, default=None
Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.

@ClaudioSalvatoreArcidiacono ClaudioSalvatoreArcidiacono changed the title SmartCorrelatedSelector does not behave consistently on different hardware with highly correlated features SmartCorrelatedSelection does not behave consistently on different hardware with highly correlated features Dec 2, 2024
@solegalli
Copy link
Collaborator

Thank you @ClaudioSalvatoreArcidiacono for the detailed explanation of the problem. I do agree that returning different results on different operating systems is not ideal.

My understanding is that operations with sets are faster than operations with lists. Sets also guarantee that we don't have the same feature twice. So I am disinclined to go from sets to lists. Could we use ordered sets instead?

Changing from sets to lists would also break backward compatibility.

I understand from your previous comment that just changing the algorithm to mergesort does not resolve the issue, correct?

@ClaudioSalvatoreArcidiacono
Copy link
Contributor Author

Hey @solegalli,

You are correct, changing the algorithm to mergesort would solve the issue for all selection methods except for selection_method='model_performance'.

The overhead added by going from sets to lists in this context should be quite minimal given that we are not dealing with huge sets/lists. Moreover, mainly lookup operations (e.g. check if el in set or el in list), insert if not present and set operations are faster in sets than lists and we do not do lookups using the temp_set variable, mainly on the examined set, which will stay a set. In here we do perform a difference operation which is faster in sets rather than lists, but we could find ways to avoid that I believe, for example by initialising the variable as an empty list in here.

Ordered Sets could be an option, but they are not part of the python standard library, there is a library called ordered_set which implements OrderedSets, but I would say that adding an extra dependency is not really worthy for this change.

Another option could be the OrderedDict class, or directly a dict since they are ordered from py 3.7. However storing a set of features as keys of a dict with no values seems kind of ugly to me.

I think that by how the find_correlated_features algorithm is written it is not possible to get duplicate elements in the temp_set. I tried to find a counter example to the previous statement but I really could not find a set of inputs that could prove it wrong, please let me know if you think otherwise.

In order to keep backward compatibility we could convert the lists back to sets as a last step (see example here).

In conclusion, I would say that the only overhead added by the proposed change is the conversion from list to set, which is only necessary to keep backward compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants