Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse kde #222

Merged
merged 173 commits into from
Oct 9, 2024
Merged

Sparse kde #222

merged 173 commits into from
Oct 9, 2024

Conversation

GardevoirX
Copy link
Contributor

@GardevoirX GardevoirX commented Feb 16, 2024

This PR introduces SparseKDE:

  • The class SparseKDE is located at src/skmatter/utils/_sparsekde.py. It mitigates the high cost of doing KDE for large datasets by doing KDE for selected data points (e.g. grid points sampled by farthest point-sampling). This class takes the original dataset as a parameter and fits the model using the sampled grid points.
  • There are two auxiliary classes and some auxiliary functions of SparseKDE stored in src/skmatter/utils/_sparsekde.py.
  • Two distance metrics compatible with PBC, pairwise_euclidean_distances and pairwise_mahalanobis_distances, are realized and stored in src/skmatter/metrics/pairwise.py.
  • Tests for SparseKDE and some auxiliary functions are stored in tests/test_neighbors.py. Tests for distance metrics are stored in tests/test_metrics.py.

I am not sure if the current API of SparseKDE is OK and if the auxiliary classes should be integrated into SparseKDE. Also, SparseKDE seems to be too large and complex. Perhaps it needs to be decomposed into smaller parts, but I have not figured out how.

Contributor (creator of PR) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

For Reviewer

  • CHANGELOG updated if important change?

📚 Documentation preview 📚: https://scikit-matter--222.org.readthedocs.build/en/222/

@PicoCentauri PicoCentauri self-assigned this May 1, 2024
Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good. Most of the comments are fairly easy. The main points are that you should test of the user provides the correct size of the cell that you expect and raise an error otherwise. The second main point is instead of requiring strings for some parameters we can directly use executables.

examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
src/skmatter/neighbors/_sparsekde.py Outdated Show resolved Hide resolved
src/skmatter/metrics/_pairwise.py Outdated Show resolved Hide resolved
src/skmatter/metrics/_pairwise.py Outdated Show resolved Hide resolved
src/skmatter/neighbors/_sparsekde.py Show resolved Hide resolved
src/skmatter/neighbors/_sparsekde.py Outdated Show resolved Hide resolved
@PicoCentauri
Copy link
Collaborator

Also, can you please update the CHANGELOG file.

Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvements.

CHANGELOG Outdated
@@ -11,8 +11,15 @@ The rules for CHANGELOG file:

.. inclusion-marker-changelog-start

0.3.0 (XXXX/XX/XX)
0.2.1 (XXXX/XX/XX)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the version 0.3.0.

Suggested change
0.2.1 (XXXX/XX/XX)
0.3.0 (XXXX/XX/XX)

In the pyproject.toml we also use version 0.3.0. And we will decide once we relase if we do a major/minor or patch release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like not done?

examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
labels: np.ndarray,
probs: np.ndarray,
normpks: float,
metric: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think passing an instance would be even easier. You don't even have to keep a dictionary and users can use their own distance measures without changing the code.

src/skmatter/neighbors/_sparsekde.py Outdated Show resolved Hide resolved
src/skmatter/clustering/__init__.py Outdated Show resolved Hide resolved
src/skmatter/clustering/_quick_shift.py Show resolved Hide resolved
src/skmatter/metrics/__init__.py Outdated Show resolved Hide resolved
src/skmatter/metrics/_pairwise.py Outdated Show resolved Hide resolved
X, Y = check_pairwise_arrays(X, Y)
cov_inv = _mahalanobis_preprocess(cov_inv)
dists = _mahalanobis(cell, X, Y, cov_inv)
if not squared:
dists **= 0.5
return dists


def _check_dimension(X, cell):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do also want to add a test here that the cell is rectangular? Maybe you have it already and I just oversaw it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not, because I do not give a place for the angle in the cell parameter, and every number in the cell will be interpreted as a side length of a rectangular box. I will add a note in the documentation to inform the user that it only supports the rectangular cell.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay but if the cell is rectangular your format is an 1d array of length 3?

This should be tested and you write in your docs that this is the expected format and if anything else is given your raise a meaningful error message giving the actual type and the type that you expect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, perhaps I should not say rectangular. The cell can be rectangular, cubic, 4-cube or n-cube. Here is not a limitation on the dimension. Thus I only test if the dimension of the cell matches the dimension of descriptors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, rectangular. I also commented this at another place in more details. Maybe it makes sense to rename cell to cell_length to make sure that what we want.

src/skmatter/neighbors/_sparsekde.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. I think we are almost there from my side.

src/skmatter/clustering/_quick_shift.py Outdated Show resolved Hide resolved
src/skmatter/clustering/_quick_shift.py Outdated Show resolved Hide resolved
src/skmatter/clustering/_quick_shift.py Outdated Show resolved Hide resolved
available.

.. note::
Currently only rectangular cells are supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed above maybe also give the expected format here.

src/skmatter/neighbors/_sparsekde.py Show resolved Hide resolved
Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. Now I also get the "problem" with a cell. I made a suggestion to make clear what the code can handle and what not.

src/skmatter/clustering/_quick_shift.py Outdated Show resolved Hide resolved
src/skmatter/clustering/_quick_shift.py Outdated Show resolved Hide resolved
src/skmatter/clustering/_quick_shift.py Outdated Show resolved Hide resolved
src/skmatter/metrics/_pairwise.py Outdated Show resolved Hide resolved
src/skmatter/metrics/_pairwise.py Outdated Show resolved Hide resolved
Comment on lines 39 to 47
X : {array-like, sparse matrix} of shape (n_samples_X, n_components)
An array where each row is a sample and each column is a component.

Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \
default=None
An array where each row is a sample and each column is a component.
If `None`, method uses `Y=X`.

Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need empty lines between the arguments. I don't why this was done in the past. I also removed them from other functions in #227.

Can you do this for your the doc strings as well?

Suggested change
X : {array-like, sparse matrix} of shape (n_samples_X, n_components)
An array where each row is a sample and each column is a component.
Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \
default=None
An array where each row is a sample and each column is a component.
If `None`, method uses `Y=X`.
Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \
X : {array-like, sparse matrix} of shape (n_samples_X, n_components)
An array where each row is a sample and each column is a component.
Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \
default=None
An array where each row is a sample and each column is a component.
If `None`, method uses `Y=X`.
Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \

src/skmatter/metrics/__init__.py Outdated Show resolved Hide resolved
src/skmatter/metrics/_pairwise.py Outdated Show resolved Hide resolved
X, Y = check_pairwise_arrays(X, Y)
cov_inv = _mahalanobis_preprocess(cov_inv)
dists = _mahalanobis(cell, X, Y, cov_inv)
if not squared:
dists **= 0.5
return dists


def _check_dimension(X, cell):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, rectangular. I also commented this at another place in more details. Maybe it makes sense to rename cell to cell_length to make sure that what we want.

Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I a happy. I asked other developers for their review.

@agoscinski agoscinski self-requested a review May 17, 2024 16:55
@GardevoirX
Copy link
Contributor Author

Hi @agoscinski, I think this PR is ready for further review. Thank you for your time!

Copy link
Collaborator

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, only the properties that are not recomputed when refitted I think are concerning.

examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved
src/skmatter/neighbors/_sparsekde.py Outdated Show resolved Hide resolved
src/skmatter/neighbors/_sparsekde.py Outdated Show resolved Hide resolved
src/skmatter/neighbors/_sparsekde.py Outdated Show resolved Hide resolved
examples/neighbors/pamm.py Outdated Show resolved Hide resolved
examples/neighbors/pamm.py Outdated Show resolved Hide resolved
examples/neighbors/pamm.py Outdated Show resolved Hide resolved
examples/neighbors/pamm.py Outdated Show resolved Hide resolved
src/skmatter/utils/_sparsekde.py Outdated Show resolved Hide resolved
Returns the instance itself.
"""
self._bandwidth_inv_ = None
self._normkernels_ = None
Copy link
Collaborator

@agoscinski agoscinski Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more transparent to the reader of the code if you set self._bandwidth_inv = None. Otherwise in the remaining code you don't use it anymore and it becomes only clear when reading the property.

For that you need a property setter

@_bandwith_inv.setter
def _bandwith_inv(self, value)
    self._bandwith_inv_ = value

Alternatively, you could use https://docs.python.org/3/library/functools.html#functools.cached_property then you don't have two variables, less code and you can reset cached_properties with del self.property_name.

Same for _normkernels_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After consideration, I think adding some comments at these lines might be better. There are two reasons why I did not take the two ways you mentioned. For the setter way, it does allow me to re-set the value of _bandwith_inv much clearer, but it also exposes the logic of calculating it. For the cached_property way, the value of these two properties won't be updated if the user do the fitting twice.

But I am not pretty sure if not exposing the logic of setting these two properties makes sense. If not, I think the setter way is better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cached_property way, the value of these two properties won't be updated if the user do the fitting twice.

If you set it to None at the beginning, as you do it anyway here, they can be updated. I think that would be the better solution here, but I am fine with the comment.

CHANGELOG Outdated Show resolved Hide resolved
@agoscinski
Copy link
Collaborator

agoscinski commented Oct 9, 2024

Would squash merge to one commit with your PR message a bit adapted. Are you fine with it?

Implements SparseKDE, QuickShift, adds H2O-BLYP-Piglet dataset (#222)

* Add the class `SparseKDE` is located at `src/skmatter/utils/_sparsekde.py`.
  It mitigates the high cost of doing KDE for large datasets by doing KDE for
  selected data points (e.g. grid points sampled by farthest point-sampling).
  This class takes the original dataset as a parameter and fits the model
  using the sampled grid points. The corresponding tests can be found in
  `tests/test_neighbors.py`.
* Add the class `QuickShift` in `src/skmatter/clustering/_quick_shift.py`
  implementing the quick shift clustering algorithm with corresponding tests in
  `tests/test_clustering.py`.
* Add H2O-BLYP-Piglet dataset containing 27233 hydrogen bond with 3D descriptor
  and weights. The corresponding tests can be found in `tests/test_datasets.py`
* Add two auxiliary functions of `effdim` and `oas` stored in
  `src/skmatter/utils/_sparsekde.py` with corresponding tests in
  `tests/test_neighbors.py`.
* Add two distance metrics compatible with PBC,  `pairwise_euclidean_distances`
  and `pairwise_mahalanobis_distances`, are realized and stored in
  `src/skmatter/metrics/pairwise.py` with corresponding tests in
  `tests/test_metrics.py`.

@GardevoirX
Copy link
Contributor Author

Would squash merge to one commit with your PR message a bit adapted. Are you fine with it?

Implements SparseKDE, QuickShift, adds H2O-BLYP-Piglet dataset (#222)

* Add the class `SparseKDE` is located at `src/skmatter/utils/_sparsekde.py`.
  It mitigates the high cost of doing KDE for large datasets by doing KDE for
  selected data points (e.g. grid points sampled by farthest point-sampling).
  This class takes the original dataset as a parameter and fits the model
  using the sampled grid points. The corresponding tests can be found in
  `tests/test_neighbors.py`.
* Add the class `QuickShift` in `src/skmatter/clustering/_quick_shift.py`
  implementing the quick shift clustering algorithm with corresponding tests in
  `tests/test_clustering.py`.
* Add H2O-BLYP-Piglet dataset containing 27233 hydrogen bond with 3D descriptor
  and weights. The corresponding tests can be found in `tests/test_datasets.py`
* Add two auxiliary functions of `effdim` and `oas` stored in
  `src/skmatter/utils/_sparsekde.py` with corresponding tests in
  `tests/test_neighbors.py`.
* Add two distance metrics compatible with PBC,  `pairwise_euclidean_distances`
  and `pairwise_mahalanobis_distances`, are realized and stored in
  `src/skmatter/metrics/pairwise.py` with corresponding tests in
  `tests/test_metrics.py`.

Totally okay, thank you for your time and consideration!

Copy link
Collaborator

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks also for the work and the patience!

@agoscinski agoscinski merged commit ad56b1d into scikit-learn-contrib:main Oct 9, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants