Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] PCA on sparse data. #2313

Merged
merged 6 commits into from
May 26, 2017
Merged

[ENH] PCA on sparse data. #2313

merged 6 commits into from
May 26, 2017

Conversation

acopar
Copy link
Contributor

@acopar acopar commented May 12, 2017

Issue

Fixes #2255.
Use TruncatedSVD (LSA) on sparse data instead of PCA, since PCA does not work on sparse data. The difference is that TruncatedSVD truncates the vectors and does not center the data before calling svd. For dense data, functionality remains the same.

Description of changes

Workflow:
Corpus (Select bookexcerpts.tab) -> Bag of Words -> PCA
pca

  • widgets/unsupervised/owpca.py: Select decomposition with radio buttons.
  • projection/pca.py: TruncatedSVD model
  • tests: test_pca.py
Includes
  • Code changes
  • Tests
  • Documentation

@acopar acopar mentioned this pull request May 12, 2017
3 tasks
# shape of X after preprocessing
# Add -1 to avoid error in scikit fit_transform
# ValueError: n_components must be < n_features (strict)
params["n_components"] = min(min(X.shape) - 1, self.max_components)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, then, instead of n_components and max_components doing about the same thing not have just n_components and override it if shape too small?

E.g. what happens if n_components=1e12 is passed and X.shape == (3, 3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, max_components was in there because of similarities with PCA, but is in fact not needed for TruncatedSVD model. To answer your question, n_components is bound by max_components from inside the widget, so in this particular case it would not make any difference :) Will fix it.

self.__truncated_svd_test_helper(data, n_com=10, min_xpl_var=0.7)
self.__truncated_svd_test_helper(data, n_com=31, min_xpl_var=0.99)

def __truncated_svd_test_helper(self, data, n_com, min_xpl_var):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually ok to use full, expressive variable names where common abbreviations don't exist (e.g. n_ and min_ are ok).

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides comments above, I also found one functional glitch. If I pass Iris to PCA then clicking on and off to Center data causes number of components to jump from 3 to 4. Is this expected?

@@ -96,7 +96,7 @@ def __init__(self, pca):

def __call__(self, data):
if data.domain != self.pca.pre_domain:
data = data.transform(self.pca.pre_domain)
data = data.from_table(self.pca.pre_domain, data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with data = data.transform(self.pca.pre_domain)?

__wraps__ = skl_decomposition.TruncatedSVD
name = 'truncated svd'

def __init__(self, n_components=None, copy=True, whiten=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TruncatedSVD doesn't seem to have copy, whiten, svd_solver or iterated_power attribute. Contrary, it has attributes algorithm and n_iter which are missing in our constructor.

super().__init__(preprocessors=preprocessors)
if n_components is not None and max_components is not None:
raise ValueError("n_components and max_components can not both be defined.")
# max_components limits the number of PCA components if the minimum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_components limits the number of SVD components if the minimum

self.__truncated_svd_test_helper(data, n_com=31, min_xpl_var=0.99)

def __truncated_svd_test_helper(self, data, n_com, min_xpl_var):
pca = TruncatedSVD(n_components=n_com)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use svd variable name here and below? This is a test for SVD, not PCA.

@@ -375,6 +386,26 @@ def _update_normalize(self):
if self.data is None:
self._invalidate_selection()

def _init_projector(self):
if self.center:
Copy link
Contributor

@nikicc nikicc May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

projector = PCA if self.center else TruncatedSVD
self._pca_projector = projector(n_components=MAX_COMPONENTS)
self._pca_projector.component = self.ncomponents
self._pca_preprocessors = projector.preprocessors

@@ -151,6 +154,7 @@ def __init__(self):
self.plot.setRange(xRange=(0.0, 1.0), yRange=(0.0, 1.0))

self.mainArea.layout().addWidget(self.plot)
self._init_projector()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are calling self._init_projector() here, can we then remove lines 70–74?

self._pca_preprocessors = TruncatedSVD.preprocessors

def _update_center(self):
if self.center and self.data is not None and self.data.is_sparse():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is currently unreachable, if I am not mistaken. You are setting self.center = False if data is sparse in set_data. It is here as a sanity check?

return
# PCA does not support sparse data
# Falling back to TruncatedSVD aka LSA
self.center = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides self.center also self.normalize shoud be disabled and set to False for sparse data sets.

Default normalization cannot handle sparse data set. It might have worked on BoW data (for which all features are marked to be skipped during normalization), but it does not work for sparse data sets in general.

To reproduce the problem, transform Iris to sparse in Python script, pass it to PCA and check Normalize data.

@@ -125,6 +126,8 @@ def __init__(self):
self.options_box = gui.vBox(self.controlArea, "Options")
gui.checkBox(self.options_box, self, "normalize", "Normalize data",
callback=self._update_normalize)
self.center_box = gui.checkBox(self.options_box, self, "center",
"Center data", callback=self._update_center)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure Center data is the right name for this checkbox. The problem as I see it, is that we also have Normalize data option above, which can be misleading since data normalization by default implicates data centering. If I select Normalize data and don't select Center data I would expect that the data is only scales, but this is not what happens.

In essence this checkbox is selecting between PCA and TruncatedSVD. Should we maybe put them to a dropdown?

@acopar
Copy link
Contributor Author

acopar commented May 17, 2017

Besides comments above, I also found one functional glitch. If I pass Iris to PCA then clicking on and off to Center data causes number of components to jump from 3 to 4. Is this expected?

This is a limitation of TruncatedSVD. It needs the number of components to be strictly less than the number of features. See n_components parameter in TruncatedSVD

In essence this checkbox is selecting between PCA and TruncatedSVD. Should we maybe put them to a dropdown?

I converted this to radio buttons instead.

All other requests should be fulfilled in this commit. Refer to the file changes and commit message for details.

@codecov-io
Copy link

codecov-io commented May 17, 2017

Codecov Report

Merging #2313 into master will increase coverage by 0.06%.
The diff coverage is 97.14%.

@@            Coverage Diff            @@
##           master   #2313      +/-   ##
=========================================
+ Coverage   73.24%   73.3%   +0.06%     
=========================================
  Files         317     317              
  Lines       55382   55429      +47     
=========================================
+ Hits        40566   40634      +68     
+ Misses      14816   14795      -21

@acopar acopar force-pushed the pca-sparse branch 4 times, most recently from 5a1a5b6 to 68d92b9 Compare May 18, 2017 11:47
if not decomposition.supports_sparse:
self.assertFalse(buttons[i].isEnabled())

data = Table("iris")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.widget.set_data(data) after this line is missing.

self.assertTrue(decomposition.supports_sparse)
self.assertFalse(self.widget.normalize_box.isEnabled())

buttons = self.decomposition_box.group.box.buttons
Copy link
Contributor

@nikicc nikicc May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just buttons = self.widget.decomposition_box.buttons

@@ -82,6 +82,8 @@ class SklProjector(Projector, metaclass=WrapperMeta):
__wraps__ = None
_params = {}
name = 'skl projection'
supports_sparse = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to Projector?

DECOMPOSITIONS = [
PCA,
TruncatedSVD
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one more empty line to be compliant with PEP recommendation about blank lines.

self.__truncated_svd_test_helper(data, n_components=31, min_variance=0.99)

def __truncated_svd_test_helper(self, data, n_components, min_variance):
trsvd = TruncatedSVD(n_components=n_components)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be just model = TruncatedSVD(n_components=n_components)(data)?

"decomposition_idx", [d.name for d in DECOMPOSITIONS],
box="Decomposition", callback=self._update_decomposition
)

# Options
self.options_box = gui.vBox(self.controlArea, "Options")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint suggestion: options_box and maxp_spin don't need to be attributes of self.

self.clear_outputs()
return
# PCA does not support sparse data
# Falling back to TruncatedSVD aka LSA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't apply any more, does it?

@@ -160,6 +174,23 @@ def update_model(self):
else:
self.__timer.stop()

def update_buttons_sparse(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this I would simply put:

def toggle_buttons(self, sparse_data=False):
    buttons = self.decomposition_box.buttons
    for i, cls in enumerate(DECOMPOSITIONS):
        buttons[i].setDisabled(sparse_data and not cls.supports_sparse)

and then call with self.toggle_buttons(data is not None and data.is_sparse())

@@ -194,11 +226,21 @@ def set_data(self, data):
self.start_button.setEnabled(True)
if not isinstance(data, SqlTable):
self.sampling_box.setVisible(False)

self.openContext(data)
if isinstance(data, Table):
if data.is_sparse():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this:

    self.openContext(data)
    if data.is_sparse():
        self.normalize = False
        self.normalize_box.setEnabled(False)
    else:
        self.normalize_box.setEnabled(True)
    self._update_decomposition()

self.toggle_buttons(data is not None and data.is_sparse())
self.data = data
self.fit()

after checks for errors.

@acopar acopar force-pushed the pca-sparse branch 6 times, most recently from f14a18c to dc7fdc6 Compare May 19, 2017 14:31
@acopar
Copy link
Contributor Author

acopar commented May 19, 2017

QRadioButton has no attribute isDisabled(), so I must use isEnabled function in comparisons. Other than that, all other requests should be fixed.

# shape of the X matrix (after preprocessing) is higher than
# max_components, so that sklearn does not always compute the full
# transform, which is faster and uses less memory for big data.
self.max_components = max_components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have max_components removed (here and in PCA above). No good reason for it. None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing max_components .

@acopar acopar force-pushed the pca-sparse branch 2 times, most recently from 827979a to fa97f23 Compare May 19, 2017 15:33
@acopar
Copy link
Contributor Author

acopar commented May 19, 2017

max_components is removed from PCA and TruncatedSVD models. Instead, n_components is used to perform the same task.

@nikicc nikicc force-pushed the pca-sparse branch 2 times, most recently from fef00a4 to 99e9f96 Compare May 19, 2017 19:56
@nikicc
Copy link
Contributor

nikicc commented May 19, 2017

@markotoplak we removed max_components argument from PCA that you recently introduced in #2234 and propose to simply use n_components instead. Please, check if all is OK, especially commit 99e9f96.

@nikicc nikicc added the DH2017 label May 25, 2017
@acopar acopar force-pushed the pca-sparse branch 2 times, most recently from ef88585 to 44b34c1 Compare May 25, 2017 14:39
acopar and others added 6 commits May 26, 2017 12:01
supports_sparse indicates whether projector can handle sparse data.
Added TruncatedSVD model (SklProjector) for sparse data.
PCA widget uses sklearn's TruncatedSVD when dealing with sparse data, since sklearn`s PCA does not support sparse data. Preserve old behaviour for non-sparse data (sklearn.decomposition.PCA).

- Use radio button to select decomposition (PCA, TruncatedSVD).
- On sparse data, TruncatedSVD is selected. PCA and normalization are disabled.

Changes:
widgets/unsupervised/owpca.py:
   - Default method: PCA
   - When input data is sparse, changes to TruncatedSVD automatically, PCA is disabled.
   - In addition, user can Choose TruncatedSVD for non-sparse data. Due to sklearn's limitations, results for only n_features-1 features can be shown.

projection/pca.py:
   - Camel case Projector model names, because they are used to display method name in widget.
Test checks if decompositions that do not support sparse data are disabled. Also checks that normalize box is disabled for sparse data.
max_components and n_components cannot both be set in model. max_components is redundant, since a copy of parameters (not including max_components) is passed to scikit. Instead, n_components is set directly without the use of max_components threshold.
@markotoplak
Copy link
Member

@nikicc What removing max_components changes is that the number of components on the output can now be smaller than n_components (before an error was raised). This changes nothing in the widgets, but the script users should take care to check the output. I think this is OK.

@markotoplak markotoplak merged commit 070f696 into biolab:master May 26, 2017
@jerneju
Copy link
Contributor

jerneju commented Jun 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants