-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] PCA on sparse data. #2313
Conversation
Orange/projection/pca.py
Outdated
# shape of X after preprocessing | ||
# Add -1 to avoid error in scikit fit_transform | ||
# ValueError: n_components must be < n_features (strict) | ||
params["n_components"] = min(min(X.shape) - 1, self.max_components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why, then, instead of n_components
and max_components
doing about the same thing not have just n_components
and override it if shape too small?
E.g. what happens if n_components=1e12
is passed and X.shape == (3, 3)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, max_components was in there because of similarities with PCA, but is in fact not needed for TruncatedSVD model. To answer your question, n_components is bound by max_components from inside the widget, so in this particular case it would not make any difference :) Will fix it.
Orange/tests/test_pca.py
Outdated
self.__truncated_svd_test_helper(data, n_com=10, min_xpl_var=0.7) | ||
self.__truncated_svd_test_helper(data, n_com=31, min_xpl_var=0.99) | ||
|
||
def __truncated_svd_test_helper(self, data, n_com, min_xpl_var): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually ok to use full, expressive variable names where common abbreviations don't exist (e.g. n_
and min_
are ok).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides comments above, I also found one functional glitch. If I pass Iris to PCA then clicking on and off to Center data
causes number of components to jump from 3 to 4. Is this expected?
Orange/projection/pca.py
Outdated
@@ -96,7 +96,7 @@ def __init__(self, pca): | |||
|
|||
def __call__(self, data): | |||
if data.domain != self.pca.pre_domain: | |||
data = data.transform(self.pca.pre_domain) | |||
data = data.from_table(self.pca.pre_domain, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with data = data.transform(self.pca.pre_domain)
?
Orange/projection/pca.py
Outdated
__wraps__ = skl_decomposition.TruncatedSVD | ||
name = 'truncated svd' | ||
|
||
def __init__(self, n_components=None, copy=True, whiten=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TruncatedSVD
doesn't seem to have copy
, whiten
, svd_solver
or iterated_power
attribute. Contrary, it has attributes algorithm
and n_iter
which are missing in our constructor.
Orange/projection/pca.py
Outdated
super().__init__(preprocessors=preprocessors) | ||
if n_components is not None and max_components is not None: | ||
raise ValueError("n_components and max_components can not both be defined.") | ||
# max_components limits the number of PCA components if the minimum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_components limits the number of SVD components if the minimum
Orange/tests/test_pca.py
Outdated
self.__truncated_svd_test_helper(data, n_com=31, min_xpl_var=0.99) | ||
|
||
def __truncated_svd_test_helper(self, data, n_com, min_xpl_var): | ||
pca = TruncatedSVD(n_components=n_com) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use svd
variable name here and below? This is a test for SVD, not PCA.
Orange/widgets/unsupervised/owpca.py
Outdated
@@ -375,6 +386,26 @@ def _update_normalize(self): | |||
if self.data is None: | |||
self._invalidate_selection() | |||
|
|||
def _init_projector(self): | |||
if self.center: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
projector = PCA if self.center else TruncatedSVD
self._pca_projector = projector(n_components=MAX_COMPONENTS)
self._pca_projector.component = self.ncomponents
self._pca_preprocessors = projector.preprocessors
Orange/widgets/unsupervised/owpca.py
Outdated
@@ -151,6 +154,7 @@ def __init__(self): | |||
self.plot.setRange(xRange=(0.0, 1.0), yRange=(0.0, 1.0)) | |||
|
|||
self.mainArea.layout().addWidget(self.plot) | |||
self._init_projector() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are calling self._init_projector()
here, can we then remove lines 70–74?
Orange/widgets/unsupervised/owpca.py
Outdated
self._pca_preprocessors = TruncatedSVD.preprocessors | ||
|
||
def _update_center(self): | ||
if self.center and self.data is not None and self.data.is_sparse(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is currently unreachable, if I am not mistaken. You are setting self.center = False
if data is sparse in set_data
. It is here as a sanity check?
Orange/widgets/unsupervised/owpca.py
Outdated
return | ||
# PCA does not support sparse data | ||
# Falling back to TruncatedSVD aka LSA | ||
self.center = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides self.center
also self.normalize
shoud be disabled and set to False for sparse data sets.
Default normalization cannot handle sparse data set. It might have worked on BoW data (for which all features are marked to be skipped during normalization), but it does not work for sparse data sets in general.
To reproduce the problem, transform Iris to sparse in Python script, pass it to PCA and check Normalize data.
Orange/widgets/unsupervised/owpca.py
Outdated
@@ -125,6 +126,8 @@ def __init__(self): | |||
self.options_box = gui.vBox(self.controlArea, "Options") | |||
gui.checkBox(self.options_box, self, "normalize", "Normalize data", | |||
callback=self._update_normalize) | |||
self.center_box = gui.checkBox(self.options_box, self, "center", | |||
"Center data", callback=self._update_center) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure Center data
is the right name for this checkbox. The problem as I see it, is that we also have Normalize data
option above, which can be misleading since data normalization by default implicates data centering. If I select Normalize data and don't select Center data I would expect that the data is only scales, but this is not what happens.
In essence this checkbox is selecting between PCA and TruncatedSVD. Should we maybe put them to a dropdown?
This is a limitation of TruncatedSVD. It needs the number of components to be strictly less than the number of features. See n_components parameter in TruncatedSVD
I converted this to radio buttons instead. All other requests should be fulfilled in this commit. Refer to the file changes and commit message for details. |
Codecov Report
@@ Coverage Diff @@
## master #2313 +/- ##
=========================================
+ Coverage 73.24% 73.3% +0.06%
=========================================
Files 317 317
Lines 55382 55429 +47
=========================================
+ Hits 40566 40634 +68
+ Misses 14816 14795 -21 |
5a1a5b6
to
68d92b9
Compare
if not decomposition.supports_sparse: | ||
self.assertFalse(buttons[i].isEnabled()) | ||
|
||
data = Table("iris") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.widget.set_data(data)
after this line is missing.
self.assertTrue(decomposition.supports_sparse) | ||
self.assertFalse(self.widget.normalize_box.isEnabled()) | ||
|
||
buttons = self.decomposition_box.group.box.buttons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be just buttons = self.widget.decomposition_box.buttons
@@ -82,6 +82,8 @@ class SklProjector(Projector, metaclass=WrapperMeta): | |||
__wraps__ = None | |||
_params = {} | |||
name = 'skl projection' | |||
supports_sparse = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to Projector
?
DECOMPOSITIONS = [ | ||
PCA, | ||
TruncatedSVD | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add one more empty line to be compliant with PEP recommendation about blank lines.
Orange/tests/test_pca.py
Outdated
self.__truncated_svd_test_helper(data, n_components=31, min_variance=0.99) | ||
|
||
def __truncated_svd_test_helper(self, data, n_components, min_variance): | ||
trsvd = TruncatedSVD(n_components=n_components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be just model = TruncatedSVD(n_components=n_components)(data)
?
"decomposition_idx", [d.name for d in DECOMPOSITIONS], | ||
box="Decomposition", callback=self._update_decomposition | ||
) | ||
|
||
# Options | ||
self.options_box = gui.vBox(self.controlArea, "Options") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint suggestion: options_box
and maxp_spin
don't need to be attributes of self
.
Orange/widgets/unsupervised/owpca.py
Outdated
self.clear_outputs() | ||
return | ||
# PCA does not support sparse data | ||
# Falling back to TruncatedSVD aka LSA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't apply any more, does it?
Orange/widgets/unsupervised/owpca.py
Outdated
@@ -160,6 +174,23 @@ def update_model(self): | |||
else: | |||
self.__timer.stop() | |||
|
|||
def update_buttons_sparse(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this I would simply put:
def toggle_buttons(self, sparse_data=False):
buttons = self.decomposition_box.buttons
for i, cls in enumerate(DECOMPOSITIONS):
buttons[i].setDisabled(sparse_data and not cls.supports_sparse)
and then call with self.toggle_buttons(data is not None and data.is_sparse())
Orange/widgets/unsupervised/owpca.py
Outdated
@@ -194,11 +226,21 @@ def set_data(self, data): | |||
self.start_button.setEnabled(True) | |||
if not isinstance(data, SqlTable): | |||
self.sampling_box.setVisible(False) | |||
|
|||
self.openContext(data) | |||
if isinstance(data, Table): | |||
if data.is_sparse(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this:
self.openContext(data)
if data.is_sparse():
self.normalize = False
self.normalize_box.setEnabled(False)
else:
self.normalize_box.setEnabled(True)
self._update_decomposition()
self.toggle_buttons(data is not None and data.is_sparse())
self.data = data
self.fit()
after checks for errors.
f14a18c
to
dc7fdc6
Compare
QRadioButton has no attribute isDisabled(), so I must use isEnabled function in comparisons. Other than that, all other requests should be fixed. |
Orange/projection/pca.py
Outdated
# shape of the X matrix (after preprocessing) is higher than | ||
# max_components, so that sklearn does not always compute the full | ||
# transform, which is faster and uses less memory for big data. | ||
self.max_components = max_components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have max_components
removed (here and in PCA
above). No good reason for it. None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for removing max_components
.
827979a
to
fa97f23
Compare
max_components is removed from PCA and TruncatedSVD models. Instead, n_components is used to perform the same task. |
fef00a4
to
99e9f96
Compare
@markotoplak we removed |
ef88585
to
44b34c1
Compare
supports_sparse indicates whether projector can handle sparse data.
Added TruncatedSVD model (SklProjector) for sparse data.
PCA widget uses sklearn's TruncatedSVD when dealing with sparse data, since sklearn`s PCA does not support sparse data. Preserve old behaviour for non-sparse data (sklearn.decomposition.PCA). - Use radio button to select decomposition (PCA, TruncatedSVD). - On sparse data, TruncatedSVD is selected. PCA and normalization are disabled. Changes: widgets/unsupervised/owpca.py: - Default method: PCA - When input data is sparse, changes to TruncatedSVD automatically, PCA is disabled. - In addition, user can Choose TruncatedSVD for non-sparse data. Due to sklearn's limitations, results for only n_features-1 features can be shown. projection/pca.py: - Camel case Projector model names, because they are used to display method name in widget.
Test checks if decompositions that do not support sparse data are disabled. Also checks that normalize box is disabled for sparse data.
max_components and n_components cannot both be set in model. max_components is redundant, since a copy of parameters (not including max_components) is passed to scikit. Instead, n_components is set directly without the use of max_components threshold.
@nikicc What removing max_components changes is that the number of components on the output can now be smaller than n_components (before an error was raised). This changes nothing in the widgets, but the script users should take care to check the output. I think this is OK. |
Issue
Fixes #2255.
Use TruncatedSVD (LSA) on sparse data instead of PCA, since PCA does not work on sparse data. The difference is that TruncatedSVD truncates the vectors and does not center the data before calling svd. For dense data, functionality remains the same.
Description of changes
Workflow:
Corpus (Select bookexcerpts.tab) -> Bag of Words -> PCA
Includes