Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter feature selection using Pearson correlation #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mandjevant
Copy link

In short:

Adding a filter feature selection method for applications that require efficient computation over the detection of complex relationships.

Implementation

  1. Transpose the data: self.dataX contains the data as a matrix. This data must be transposed to calculated Pearson correlation.
  2. Calculate Pearson correlation: Since scipy is already a requirement for this library, we can simply use scipy.stats.pearsonr. This will also return the p-value.
  3. Remove the features where the absolute correlation and p-value do not obey the set minimum and maximum.
  4. Return the indices of the selected features and the names of the selected features.

Function arguments:

Args:
    min_corr: Minimum correlation value for feature to be selected. Standard: 0.2
    max_corr: Maximum correlation value for feature to be selected. Standard: 1.0
    max_pvalue: Maximum p-value to determine statistical significance. Standard: 0.05

Results

Wrapper method:

  • The following features were selected: ['RM', 'TAX', 'LSTAT', 'PTRATIO', 'DIS', 'AGE']
  • The estimated error of the developed model is: 2.7131430936707424
  • Method took 73.3180787563324 seconds to complete.

fst-pso method:

  • The following features have been selected: ['CRIM', 'ZN', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT'] with a MAE of 2.79
  • The estimated error of the developed model is: 2.907917419119525
  • Method took 3338.643133163452 seconds to complete.

Filter method:

  • The following features were selected: ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT']
  • The estimated error of the developed model is: 2.6634497163494317
  • Method took 2.928159236907959 seconds to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant