[ENH] PCA on sparse data. #2313

acopar · 2017-05-12T13:05:26Z

Issue

Fixes #2255.
Use TruncatedSVD (LSA) on sparse data instead of PCA, since PCA does not work on sparse data. The difference is that TruncatedSVD truncates the vectors and does not center the data before calling svd. For dense data, functionality remains the same.

Description of changes

Workflow:
Corpus (Select bookexcerpts.tab) -> Bag of Words -> PCA

widgets/unsupervised/owpca.py: Select decomposition with radio buttons.
projection/pca.py: TruncatedSVD model
tests: test_pca.py

Includes

Code changes
Tests
Documentation

kernc · 2017-05-12T13:14:38Z

Orange/projection/pca.py

+            # shape of X after preprocessing
+            # Add -1 to avoid error in scikit fit_transform
+            # ValueError: n_components must be < n_features (strict)
+            params["n_components"] = min(min(X.shape) - 1, self.max_components)


Why, then, instead of n_components and max_components doing about the same thing not have just n_components and override it if shape too small?

E.g. what happens if n_components=1e12 is passed and X.shape == (3, 3)?

I agree, max_components was in there because of similarities with PCA, but is in fact not needed for TruncatedSVD model. To answer your question, n_components is bound by max_components from inside the widget, so in this particular case it would not make any difference :) Will fix it.

kernc · 2017-05-12T13:17:01Z

Orange/tests/test_pca.py

+        self.__truncated_svd_test_helper(data, n_com=10, min_xpl_var=0.7)
+        self.__truncated_svd_test_helper(data, n_com=31, min_xpl_var=0.99)
+
+    def __truncated_svd_test_helper(self, data, n_com, min_xpl_var):


It's actually ok to use full, expressive variable names where common abbreviations don't exist (e.g. n_ and min_ are ok).

nikicc

Besides comments above, I also found one functional glitch. If I pass Iris to PCA then clicking on and off to Center data causes number of components to jump from 3 to 4. Is this expected?

nikicc · 2017-05-16T15:17:22Z

Orange/projection/pca.py

@@ -96,7 +96,7 @@ def __init__(self, pca):

    def __call__(self, data):
        if data.domain != self.pca.pre_domain:
-            data = data.transform(self.pca.pre_domain)
+            data = data.from_table(self.pca.pre_domain, data)


What's wrong with data = data.transform(self.pca.pre_domain)?

nikicc · 2017-05-16T15:27:01Z

Orange/projection/pca.py

+    __wraps__ = skl_decomposition.TruncatedSVD
+    name = 'truncated svd'
+
+    def __init__(self, n_components=None, copy=True, whiten=False,


TruncatedSVD doesn't seem to have copy, whiten, svd_solver or iterated_power attribute. Contrary, it has attributes algorithm and n_iter which are missing in our constructor.

nikicc · 2017-05-16T15:28:40Z

Orange/projection/pca.py

+        super().__init__(preprocessors=preprocessors)
+        if n_components is not None and max_components is not None:
+            raise ValueError("n_components and max_components can not both be defined.")
+        # max_components limits the number of PCA components if the minimum


max_components limits the number of SVD components if the minimum

nikicc · 2017-05-16T15:37:01Z

Orange/tests/test_pca.py

+        self.__truncated_svd_test_helper(data, n_com=31, min_xpl_var=0.99)
+
+    def __truncated_svd_test_helper(self, data, n_com, min_xpl_var):
+        pca = TruncatedSVD(n_components=n_com)


Should we use svd variable name here and below? This is a test for SVD, not PCA.

nikicc · 2017-05-16T15:50:56Z

Orange/widgets/unsupervised/owpca.py

@@ -375,6 +386,26 @@ def _update_normalize(self):
        if self.data is None:
            self._invalidate_selection()

+    def _init_projector(self):
+        if self.center:


What about:

projector = PCA if self.center else TruncatedSVD self._pca_projector = projector(n_components=MAX_COMPONENTS) self._pca_projector.component = self.ncomponents self._pca_preprocessors = projector.preprocessors

nikicc · 2017-05-16T15:58:51Z

Orange/widgets/unsupervised/owpca.py

@@ -151,6 +154,7 @@ def __init__(self):
        self.plot.setRange(xRange=(0.0, 1.0), yRange=(0.0, 1.0))

        self.mainArea.layout().addWidget(self.plot)
+        self._init_projector()


If we are calling self._init_projector() here, can we then remove lines 70–74?

nikicc · 2017-05-16T16:01:48Z

Orange/widgets/unsupervised/owpca.py

+            self._pca_preprocessors = TruncatedSVD.preprocessors
+
+    def _update_center(self):
+        if self.center and self.data is not None and self.data.is_sparse():


This code is currently unreachable, if I am not mistaken. You are setting self.center = False if data is sparse in set_data. It is here as a sanity check?

nikicc · 2017-05-16T16:31:03Z

Orange/widgets/unsupervised/owpca.py

-                return
+                # PCA does not support sparse data
+                # Falling back to TruncatedSVD aka LSA
+                self.center = False


Besides self.center also self.normalize shoud be disabled and set to False for sparse data sets.

Default normalization cannot handle sparse data set. It might have worked on BoW data (for which all features are marked to be skipped during normalization), but it does not work for sparse data sets in general.

To reproduce the problem, transform Iris to sparse in Python script, pass it to PCA and check Normalize data.

nikicc · 2017-05-16T16:49:40Z

Orange/widgets/unsupervised/owpca.py

@@ -125,6 +126,8 @@ def __init__(self):
        self.options_box = gui.vBox(self.controlArea, "Options")
        gui.checkBox(self.options_box, self, "normalize", "Normalize data",
                     callback=self._update_normalize)
+        self.center_box = gui.checkBox(self.options_box, self, "center", 
+            "Center data", callback=self._update_center)


I'm not sure Center data is the right name for this checkbox. The problem as I see it, is that we also have Normalize data option above, which can be misleading since data normalization by default implicates data centering. If I select Normalize data and don't select Center data I would expect that the data is only scales, but this is not what happens.

In essence this checkbox is selecting between PCA and TruncatedSVD. Should we maybe put them to a dropdown?

acopar · 2017-05-17T11:42:37Z

Besides comments above, I also found one functional glitch. If I pass Iris to PCA then clicking on and off to Center data causes number of components to jump from 3 to 4. Is this expected?

This is a limitation of TruncatedSVD. It needs the number of components to be strictly less than the number of features. See n_components parameter in TruncatedSVD

In essence this checkbox is selecting between PCA and TruncatedSVD. Should we maybe put them to a dropdown?

I converted this to radio buttons instead.

All other requests should be fulfilled in this commit. Refer to the file changes and commit message for details.

codecov-io · 2017-05-17T14:11:11Z

Codecov Report

Merging #2313 into master will increase coverage by 0.06%.
The diff coverage is 97.14%.

@@            Coverage Diff            @@
##           master   #2313      +/-   ##
=========================================
+ Coverage   73.24%   73.3%   +0.06%     
=========================================
  Files         317     317              
  Lines       55382   55429      +47     
=========================================
+ Hits        40566   40634      +68     
+ Misses      14816   14795      -21

nikicc · 2017-05-18T13:47:59Z

Orange/widgets/unsupervised/tests/test_owpca.py

+            if not decomposition.supports_sparse:
+                self.assertFalse(buttons[i].isEnabled())
+
+        data = Table("iris")


self.widget.set_data(data) after this line is missing.

nikicc · 2017-05-18T13:48:12Z

Orange/widgets/unsupervised/tests/test_owpca.py

+        self.assertTrue(decomposition.supports_sparse)
+        self.assertFalse(self.widget.normalize_box.isEnabled())
+
+        buttons = self.decomposition_box.group.box.buttons


This should be just buttons = self.widget.decomposition_box.buttons

nikicc · 2017-05-18T13:48:44Z

Orange/projection/base.py

@@ -82,6 +82,8 @@ class SklProjector(Projector, metaclass=WrapperMeta):
    __wraps__ = None
    _params = {}
    name = 'skl projection'
+    supports_sparse = False


Should this be moved to Projector?

nikicc · 2017-05-18T13:50:26Z

Orange/widgets/unsupervised/owpca.py

+DECOMPOSITIONS = [
+    PCA,
+    TruncatedSVD
+]


Please add one more empty line to be compliant with PEP recommendation about blank lines.

nikicc · 2017-05-18T13:53:08Z

Orange/tests/test_pca.py

+        self.__truncated_svd_test_helper(data, n_components=31, min_variance=0.99)
+
+    def __truncated_svd_test_helper(self, data, n_components, min_variance):
+        trsvd = TruncatedSVD(n_components=n_components)


This can probably be just model = TruncatedSVD(n_components=n_components)(data)?

nikicc · 2017-05-18T13:56:27Z

Orange/widgets/unsupervised/owpca.py

+            "decomposition_idx", [d.name for d in DECOMPOSITIONS],
+            box="Decomposition", callback=self._update_decomposition
+        )
+
        # Options
        self.options_box = gui.vBox(self.controlArea, "Options")


Lint suggestion: options_box and maxp_spin don't need to be attributes of self.

nikicc · 2017-05-18T14:03:13Z

Orange/widgets/unsupervised/owpca.py

-                self.clear_outputs()
-                return
+                # PCA does not support sparse data
+                # Falling back to TruncatedSVD aka LSA


This comment doesn't apply any more, does it?

nikicc · 2017-05-18T14:34:20Z

Orange/widgets/unsupervised/owpca.py

@@ -160,6 +174,23 @@ def update_model(self):
        else:
            self.__timer.stop()

+    def update_buttons_sparse(self):


Instead of this I would simply put:

def toggle_buttons(self, sparse_data=False): buttons = self.decomposition_box.buttons for i, cls in enumerate(DECOMPOSITIONS): buttons[i].setDisabled(sparse_data and not cls.supports_sparse)

and then call with self.toggle_buttons(data is not None and data.is_sparse())

nikicc · 2017-05-18T14:35:17Z

Orange/widgets/unsupervised/owpca.py

@@ -194,11 +226,21 @@ def set_data(self, data):
                self.start_button.setEnabled(True)
        if not isinstance(data, SqlTable):
            self.sampling_box.setVisible(False)
+
+        self.openContext(data)
        if isinstance(data, Table):
            if data.is_sparse():


I would put this:

self.openContext(data) if data.is_sparse(): self.normalize = False self.normalize_box.setEnabled(False) else: self.normalize_box.setEnabled(True) self._update_decomposition() self.toggle_buttons(data is not None and data.is_sparse()) self.data = data self.fit()

after checks for errors.

acopar · 2017-05-19T14:33:57Z

QRadioButton has no attribute isDisabled(), so I must use isEnabled function in comparisons. Other than that, all other requests should be fixed.

kernc · 2017-05-19T14:56:17Z

Orange/projection/pca.py

+        # shape of the X matrix (after preprocessing) is higher than
+        # max_components, so that sklearn does not always compute the full
+        # transform, which is faster and uses less memory for big data.
+        self.max_components = max_components


I'd have max_components removed (here and in PCA above). No good reason for it. None.

+1 for removing max_components .

acopar · 2017-05-19T15:36:59Z

max_components is removed from PCA and TruncatedSVD models. Instead, n_components is used to perform the same task.

nikicc · 2017-05-19T20:49:49Z

@markotoplak we removed max_components argument from PCA that you recently introduced in #2234 and propose to simply use n_components instead. Please, check if all is OK, especially commit 99e9f96.

supports_sparse indicates whether projector can handle sparse data.

Added TruncatedSVD model (SklProjector) for sparse data.

PCA widget uses sklearn's TruncatedSVD when dealing with sparse data, since sklearn`s PCA does not support sparse data. Preserve old behaviour for non-sparse data (sklearn.decomposition.PCA). - Use radio button to select decomposition (PCA, TruncatedSVD). - On sparse data, TruncatedSVD is selected. PCA and normalization are disabled. Changes: widgets/unsupervised/owpca.py: - Default method: PCA - When input data is sparse, changes to TruncatedSVD automatically, PCA is disabled. - In addition, user can Choose TruncatedSVD for non-sparse data. Due to sklearn's limitations, results for only n_features-1 features can be shown. projection/pca.py: - Camel case Projector model names, because they are used to display method name in widget.

Test checks if decompositions that do not support sparse data are disabled. Also checks that normalize box is disabled for sparse data.

max_components and n_components cannot both be set in model. max_components is redundant, since a copy of parameters (not including max_components) is passed to scikit. Instead, n_components is set directly without the use of max_components threshold.

markotoplak · 2017-05-26T12:10:16Z

@nikicc What removing max_components changes is that the number of components on the output can now be smaller than n_components (before an error was raised). This changes nothing in the widgets, but the script users should take care to check the output. I think this is OK.

jerneju · 2017-06-05T12:50:34Z

https://sentry.io/biolab/orange3/issues/287354884/

acopar mentioned this pull request May 12, 2017

[ENH] PCA on sparse data #2311

Closed

3 tasks

kernc suggested changes May 12, 2017

View reviewed changes

nikicc suggested changes May 16, 2017

View reviewed changes

acopar force-pushed the pca-sparse branch from 26bef48 to c7e9b43 Compare May 17, 2017 14:10

acopar force-pushed the pca-sparse branch 4 times, most recently from 5a1a5b6 to 68d92b9 Compare May 18, 2017 11:47

nikicc suggested changes May 18, 2017

View reviewed changes

acopar force-pushed the pca-sparse branch 6 times, most recently from f14a18c to dc7fdc6 Compare May 19, 2017 14:31

kernc reviewed May 19, 2017

View reviewed changes

acopar force-pushed the pca-sparse branch 2 times, most recently from 827979a to fa97f23 Compare May 19, 2017 15:33

acopar force-pushed the pca-sparse branch from fa97f23 to 5c806fb Compare May 19, 2017 15:40

nikicc requested a review from markotoplak May 19, 2017 15:46

kernc approved these changes May 19, 2017

View reviewed changes

nikicc force-pushed the pca-sparse branch 2 times, most recently from fef00a4 to 99e9f96 Compare May 19, 2017 19:56

nikicc approved these changes May 19, 2017

View reviewed changes

nikicc added the DH2017 label May 25, 2017

acopar force-pushed the pca-sparse branch 2 times, most recently from ef88585 to 44b34c1 Compare May 25, 2017 14:39

nikicc assigned markotoplak May 26, 2017

acopar and others added 6 commits May 26, 2017 12:01

Projector: supports_sparse attribute

2a0bf38

supports_sparse indicates whether projector can handle sparse data.

TruncatedSVD decomposition

fe8bf13

Added TruncatedSVD model (SklProjector) for sparse data.

OWPCA test for sparse data

b032ea1

Test checks if decompositions that do not support sparse data are disabled. Also checks that normalize box is disabled for sparse data.

OWPCA: Add decomposition and normalization to report

2b6e8fd

acopar force-pushed the pca-sparse branch from ac39e64 to 2b6e8fd Compare May 26, 2017 10:01

markotoplak merged commit 070f696 into biolab:master May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] PCA on sparse data. #2313

[ENH] PCA on sparse data. #2313

acopar commented May 12, 2017 •

edited

Loading

kernc May 12, 2017

acopar May 12, 2017

kernc May 12, 2017

nikicc left a comment

nikicc May 16, 2017

nikicc May 16, 2017

nikicc May 16, 2017

nikicc May 16, 2017

nikicc May 16, 2017 •

edited by kernc

Loading

nikicc May 16, 2017

nikicc May 16, 2017

nikicc May 16, 2017

nikicc May 16, 2017

acopar commented May 17, 2017

codecov-io commented May 17, 2017 •

edited

Loading

nikicc May 18, 2017

nikicc May 18, 2017 •

edited

Loading

nikicc May 18, 2017

nikicc May 18, 2017

nikicc May 18, 2017

nikicc May 18, 2017

nikicc May 18, 2017

nikicc May 18, 2017

nikicc May 18, 2017

acopar commented May 19, 2017

kernc May 19, 2017

nikicc May 19, 2017

acopar commented May 19, 2017

nikicc commented May 19, 2017

markotoplak commented May 26, 2017

jerneju commented Jun 5, 2017

[ENH] PCA on sparse data. #2313

[ENH] PCA on sparse data. #2313

Conversation

acopar commented May 12, 2017 • edited Loading

Issue

Description of changes

Includes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikicc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikicc May 16, 2017 • edited by kernc Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acopar commented May 17, 2017

codecov-io commented May 17, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

nikicc May 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acopar commented May 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acopar commented May 19, 2017

nikicc commented May 19, 2017

markotoplak commented May 26, 2017

jerneju commented Jun 5, 2017

acopar commented May 12, 2017 •

edited

Loading

nikicc May 16, 2017 •

edited by kernc

Loading

codecov-io commented May 17, 2017 •

edited

Loading

nikicc May 18, 2017 •

edited

Loading