`SmartCorrelatedSelection` does not behave consistently on different hardware with highly correlated features #824

ClaudioSalvatoreArcidiacono · 2024-11-29T15:32:19Z

Describe the bug
The behavior of SmartCorrelatedSelection is unpredictable when there are features that are very similar, so similar that they have the same number of missing values or the same single feature model perforamnce.

In particular I have noticed that running the exact same code, with same python version and package versions on my development machine (a mac m1) and on a different machine (a linux-based remote node) I get different results.

To Reproduce
Run the code below on two different machines:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from feature_engine.selection import SmartCorrelatedSelection
import pandas as pd
from sklearn.model_selection import StratifiedKFold

# Generate a synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, random_state=42)

# Convert to DataFrame for feature naming
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)

# Duplicate features
for i in range(10):  # Duplicating the first 10 features
    X[f'feature_{i}_duplicate'] = X[f'feature_{i}']

# Initialize the Logistic Regression model
model = LogisticRegression()

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Initialize the SmartCorrelatedSelection
selector = SmartCorrelatedSelection(
    estimator=model,
    selection_method='model_performance',
    threshold=0.9,  # Example threshold for correlation
    cv=cv
)

# Fit the selector to the data
selector.fit(X, y)

# Get the features that have been dropped
dropped_features = selector.features_to_drop_
print(dropped_features)

On one machine I get:

['feature_0', 'feature_1_duplicate', 'feature_2_duplicate', 'feature_3', 'feature_4', 'feature_5', 'feature_6_duplicate', 'feature_7', 'feature_8', 'feature_9']

Whether on another machine I get:

['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7_duplicate', 'feature_8', 'feature_9']

Notice how the dropped features mismatches in the two sets.

For the tests above I used the following packages versions:

python: 3.8.16
feature_engine: 1.8.2
pandas: 2.2.3
numpy: 2.0.2

Expected behavior
I would expect the two runs to give the exact same results.

Screenshots
N/A

Additional context
I have already implemented an alternative version of SmartCorrelatedSelection that does not have this issue and I would like to contribute to the project by sharing my version.

There are 2 reasons for the issue above

When feature values are sorted (for example here) the default sorting algorithm of pandas series is used, which is quicksort. The quicksort sorting algorithm is not stable (see pandas doc), meaning that in cases of ties it does not keep the original feature order. To solve that I have added the parameter kind="mergesort" to every call to pd.Series.sort_values. mergesort is a stable sorting algorithm and it ensures the same ordering, also in case of ties.
Specifically for the selection_method='model_performance', the temp_set is here defined as a set, which is a collection that does not preserve order. When this value is returned and it is used in here the original order of the feature is not preserved anymore, when the features are finally sorted in here even with mergesort as a sorting algorithm the result will differ in case of ties (because the original order is not preserved due to the set issue). To solve this second point I have changed the temp_set variable to be a list instead of a set()

The text was updated successfully, but these errors were encountered:

ClaudioSalvatoreArcidiacono · 2024-11-29T16:05:37Z

I have submitted a PR with my suggestions, let me know if you would like me to make changes :)

solegalli · 2024-11-29T19:35:12Z

Hi @ClaudioSalvatoreArcidiacono

Thanks for the issue.

If you set the random_state in the LogisticRegression, do you still see this behaviour?

ClaudioSalvatoreArcidiacono · 2024-12-02T09:14:07Z

Hey @solegalli,

Thanks for taking a look.

I tried setting random_state in the LogisticRegression and I obtained the same result.

That's because random_state is not used by the default solver of LogisticRegression (lbfgs). The documentation for random_state indeed states:

random_state int, RandomState instance, default=None
Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.

solegalli · 2024-12-05T11:41:21Z

Thank you @ClaudioSalvatoreArcidiacono for the detailed explanation of the problem. I do agree that returning different results on different operating systems is not ideal.

My understanding is that operations with sets are faster than operations with lists. Sets also guarantee that we don't have the same feature twice. So I am disinclined to go from sets to lists. Could we use ordered sets instead?

Changing from sets to lists would also break backward compatibility.

I understand from your previous comment that just changing the algorithm to mergesort does not resolve the issue, correct?

ClaudioSalvatoreArcidiacono · 2024-12-05T13:02:32Z

Hey @solegalli,

You are correct, changing the algorithm to mergesort would solve the issue for all selection methods except for selection_method='model_performance'.

The overhead added by going from sets to lists in this context should be quite minimal given that we are not dealing with huge sets/lists. Moreover, mainly lookup operations (e.g. check if el in set or el in list), insert if not present and set operations are faster in sets than lists and we do not do lookups using the temp_set variable, mainly on the examined set, which will stay a set. In here we do perform a difference operation which is faster in sets rather than lists, but we could find ways to avoid that I believe, for example by initialising the variable as an empty list in here.

Ordered Sets could be an option, but they are not part of the python standard library, there is a library called ordered_set which implements OrderedSets, but I would say that adding an extra dependency is not really worthy for this change.

Another option could be the OrderedDict class, or directly a dict since they are ordered from py 3.7. However storing a set of features as keys of a dict with no values seems kind of ugly to me.

I think that by how the find_correlated_features algorithm is written it is not possible to get duplicate elements in the temp_set. I tried to find a counter example to the previous statement but I really could not find a set of inputs that could prove it wrong, please let me know if you think otherwise.

In order to keep backward compatibility we could convert the lists back to sets as a last step (see example here).

In conclusion, I would say that the only overhead added by the proposed change is the conversion from list to set, which is only necessary to keep backward compatibility.

ClaudioSalvatoreArcidiacono linked a pull request Nov 29, 2024 that will close this issue

Improve SmartCorrelatedSelector consistency with ties #825

Open

ClaudioSalvatoreArcidiacono changed the title ~~SmartCorrelatedSelector does not behave consistently on different hardware with highly correlated features~~ SmartCorrelatedSelection does not behave consistently on different hardware with highly correlated features Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`SmartCorrelatedSelection` does not behave consistently on different hardware with highly correlated features #824

`SmartCorrelatedSelection` does not behave consistently on different hardware with highly correlated features #824

ClaudioSalvatoreArcidiacono commented Nov 29, 2024 •

edited

Loading

ClaudioSalvatoreArcidiacono commented Nov 29, 2024

solegalli commented Nov 29, 2024

ClaudioSalvatoreArcidiacono commented Dec 2, 2024 •

edited

Loading

solegalli commented Dec 5, 2024

ClaudioSalvatoreArcidiacono commented Dec 5, 2024

SmartCorrelatedSelection does not behave consistently on different hardware with highly correlated features #824

SmartCorrelatedSelection does not behave consistently on different hardware with highly correlated features #824

Comments

ClaudioSalvatoreArcidiacono commented Nov 29, 2024 • edited Loading

ClaudioSalvatoreArcidiacono commented Nov 29, 2024

solegalli commented Nov 29, 2024

ClaudioSalvatoreArcidiacono commented Dec 2, 2024 • edited Loading

solegalli commented Dec 5, 2024

ClaudioSalvatoreArcidiacono commented Dec 5, 2024

`SmartCorrelatedSelection` does not behave consistently on different hardware with highly correlated features #824

`SmartCorrelatedSelection` does not behave consistently on different hardware with highly correlated features #824

ClaudioSalvatoreArcidiacono commented Nov 29, 2024 •

edited

Loading

ClaudioSalvatoreArcidiacono commented Dec 2, 2024 •

edited

Loading