Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Correlation with target as selection method for SmartCorrelatedSelection #826

Open
ClaudioSalvatoreArcidiacono opened this issue Dec 2, 2024 · 2 comments · May be fixed by #827

Comments

@ClaudioSalvatoreArcidiacono
Copy link
Contributor

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
I would like to add as a selection method of SmartCorrelatedSelection a new selection method based on the correlation of the features with the target variable.

Describe alternatives you've considered
I have not considered other alternatives.

Additional context
For linear models it can make a lot of sense to select features using correlation with target and it can be faster than computing single model performances.

I would like to contribute to this project by implementing this feature by myself.

@solegalli
Copy link
Collaborator

Hi @ClaudioSalvatoreArcidiacono

This functionality is already supported by sklearn through the f_regression.

I admit that f_regression is not great because it selects based on p_values and f_value instead of correlation coefficient, although the correlation coefficient could be inferred from the pvalue with the number of samples, so in essence, it is more or less the same. Highlight on more or less.

@ClaudioSalvatoreArcidiacono
Copy link
Contributor Author

Hey @solegalli, thanks for taking a look and thanks for mentioning f_regression from sklearn, it is one of those functionalities that I never fully explored to be completely honest.

I believe that adding correlation with target as a selection method to SmartCorrelatedSelection could still bring value to the library for the following reasons:

  • f_regression selects features to remove by looking at the interactions with the feature and the target, without considering correlations between features. In the proposed change the candidates features to drop are selected by looking at the correlation with other features and the correlation with target is used to make the final decision. So they achieve different purposes:
    • f_regression: Is used to remove features with a low correlation with the target.
    • SmartCorrelatedSelection with selection_method=corr_with_target as a selection method: Removes features with lower correlation with target among a correlated features group.

I would say that it is more like sklearn VarianceThreshold vs SmartCorrelatedSelection with variance as a selector method, they are quite different.

  • Unlike f_regression, which is not designed as an estimator, SmartCorrelatedSelection is. This makes it easier for users to incorporate it directly into their feature selection pipelines.
  • A direct correlation-based selection method can be more intuitive for many users, allowing them to easily understand and interpret the relationship between features and the target variable. This could enhance the usability and accessibility of the library.

Again, thanks for your time and I am happy to discuss this further!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants