Sparse kde #222

GardevoirX · 2024-02-16T08:55:10Z

This PR introduces SparseKDE:

The class SparseKDE is located at src/skmatter/utils/_sparsekde.py. It mitigates the high cost of doing KDE for large datasets by doing KDE for selected data points (e.g. grid points sampled by farthest point-sampling). This class takes the original dataset as a parameter and fits the model using the sampled grid points.
There are two auxiliary classes and some auxiliary functions of SparseKDE stored in src/skmatter/utils/_sparsekde.py.
Two distance metrics compatible with PBC, pairwise_euclidean_distances and pairwise_mahalanobis_distances, are realized and stored in src/skmatter/metrics/pairwise.py.
Tests for SparseKDE and some auxiliary functions are stored in tests/test_neighbors.py. Tests for distance metrics are stored in tests/test_metrics.py.

I am not sure if the current API of SparseKDE is OK and if the auxiliary classes should be integrated into SparseKDE. Also, SparseKDE seems to be too large and complex. Perhaps it needs to be decomposed into smaller parts, but I have not figured out how.

Contributor (creator of PR) checklist

Tests updated (for new features and bugfixes)?
Documentation updated (for new features)?
Issue referenced (for PRs that solve an issue)?

For Reviewer

CHANGELOG updated if important change?

📚 Documentation preview 📚: https://scikit-matter--222.org.readthedocs.build/en/222/

PicoCentauri

Very good. Most of the comments are fairly easy. The main points are that you should test of the user provides the correct size of the cell that you expect and raise an error otherwise. The second main point is instead of requiring strings for some parameters we can directly use executables.

examples/neighbors/sparse-kde.py

src/skmatter/neighbors/_sparsekde.py

src/skmatter/metrics/_pairwise.py

src/skmatter/neighbors/_sparsekde.py

PicoCentauri · 2024-05-02T06:49:51Z

Also, can you please update the CHANGELOG file.

PicoCentauri

Very nice improvements.

PicoCentauri · 2024-05-03T06:09:54Z

CHANGELOG

@@ -11,8 +11,15 @@ The rules for CHANGELOG file:

 .. inclusion-marker-changelog-start

-0.3.0 (XXXX/XX/XX)
+0.2.1 (XXXX/XX/XX)


Keep the version 0.3.0.

Suggested change

0.2.1 (XXXX/XX/XX)

0.3.0 (XXXX/XX/XX)

In the pyproject.toml we also use version 0.3.0. And we will decide once we relase if we do a major/minor or patch release.

This looks like not done?

examples/neighbors/sparse-kde.py

PicoCentauri · 2024-05-03T06:40:50Z

examples/neighbors/sparse-kde.py

+    labels: np.ndarray,
+    probs: np.ndarray,
+    normpks: float,
+    metric: str,


But I think passing an instance would be even easier. You don't even have to keep a dictionary and users can use their own distance measures without changing the code.

src/skmatter/neighbors/_sparsekde.py

src/skmatter/clustering/__init__.py

src/skmatter/clustering/_quick_shift.py

src/skmatter/metrics/__init__.py

src/skmatter/metrics/_pairwise.py

PicoCentauri · 2024-05-03T06:57:02Z

src/skmatter/metrics/_pairwise.py

    X, Y = check_pairwise_arrays(X, Y)
    cov_inv = _mahalanobis_preprocess(cov_inv)
    dists = _mahalanobis(cell, X, Y, cov_inv)
    if not squared:
        dists **= 0.5
    return dists
+
+
+def _check_dimension(X, cell):


Do also want to add a test here that the cell is rectangular? Maybe you have it already and I just oversaw it.

Perhaps not, because I do not give a place for the angle in the cell parameter, and every number in the cell will be interpreted as a side length of a rectangular box. I will add a note in the documentation to inform the user that it only supports the rectangular cell.

Okay but if the cell is rectangular your format is an 1d array of length 3?

This should be tested and you write in your docs that this is the expected format and if anything else is given your raise a meaningful error message giving the actual type and the type that you expect.

Ah sorry, perhaps I should not say rectangular. The cell can be rectangular, cubic, 4-cube or n-cube. Here is not a limitation on the dimension. Thus I only test if the dimension of the cell matches the dimension of descriptors.

Yes, rectangular. I also commented this at another place in more details. Maybe it makes sense to rename cell to cell_length to make sure that what we want.

src/skmatter/neighbors/_sparsekde.py

src/skmatter/metrics/_pairwise.py

PicoCentauri

looks good. I think we are almost there from my side.

src/skmatter/clustering/_quick_shift.py

PicoCentauri · 2024-05-06T12:20:37Z

src/skmatter/metrics/__init__.py

+available.
+
+  .. note::
+    Currently only rectangular cells are supported.


as we discussed above maybe also give the expected format here.

src/skmatter/neighbors/_sparsekde.py

PicoCentauri

Very cool. Now I also get the "problem" with a cell. I made a suggestion to make clear what the code can handle and what not.

src/skmatter/clustering/_quick_shift.py

src/skmatter/metrics/_pairwise.py

PicoCentauri · 2024-05-13T13:57:23Z

src/skmatter/metrics/_pairwise.py

+    X : {array-like, sparse matrix} of shape (n_samples_X, n_components)
+        An array where each row is a sample and each column is a component.
+
+    Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \
+            default=None
+        An array where each row is a sample and each column is a component.
+        If `None`, method uses `Y=X`.
+
+    Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \


You don't need empty lines between the arguments. I don't why this was done in the past. I also removed them from other functions in #227.

Can you do this for your the doc strings as well?

Suggested change

X : {array-like, sparse matrix} of shape (n_samples_X, n_components)

An array where each row is a sample and each column is a component.

Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \

default=None

An array where each row is a sample and each column is a component.

If `None`, method uses `Y=X`.

Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \

X : {array-like, sparse matrix} of shape (n_samples_X, n_components)

An array where each row is a sample and each column is a component.

Y : {array-like, sparse matrix} of shape (n_samples_Y, n_components), \

default=None

An array where each row is a sample and each column is a component.

If `None`, method uses `Y=X`.

Y_norm_squared : array-like of shape (n_samples_Y,) or (n_samples_Y, 1) \

src/skmatter/metrics/__init__.py

src/skmatter/metrics/_pairwise.py

PicoCentauri · 2024-05-13T13:59:53Z

src/skmatter/metrics/_pairwise.py

    X, Y = check_pairwise_arrays(X, Y)
    cov_inv = _mahalanobis_preprocess(cov_inv)
    dists = _mahalanobis(cell, X, Y, cov_inv)
    if not squared:
        dists **= 0.5
    return dists
+
+
+def _check_dimension(X, cell):


Yes, rectangular. I also commented this at another place in more details. Maybe it makes sense to rename cell to cell_length to make sure that what we want.

PicoCentauri

Very nice. I a happy. I asked other developers for their review.

GardevoirX · 2024-08-09T12:49:43Z

Hi @agoscinski, I think this PR is ready for further review. Thank you for your time!

agoscinski

Looks mostly good, only the properties that are not recomputed when refitted I think are concerning.

examples/neighbors/sparse-kde.py

src/skmatter/neighbors/_sparsekde.py

examples/neighbors/pamm.py

src/skmatter/utils/_sparsekde.py

Co-authored-by: Alexander Goscinski <[email protected]>

into sparse-kde

Co-authored-by: Alexander Goscinski <[email protected]>

agoscinski · 2024-09-19T02:35:18Z

src/skmatter/neighbors/_sparsekde.py

+            Returns the instance itself.
+        """
+        self._bandwidth_inv_ = None
+        self._normkernels_ = None


I think it would be more transparent to the reader of the code if you set self._bandwidth_inv = None. Otherwise in the remaining code you don't use it anymore and it becomes only clear when reading the property.

For that you need a property setter

@_bandwith_inv.setter def _bandwith_inv(self, value) self._bandwith_inv_ = value

Alternatively, you could use https://docs.python.org/3/library/functools.html#functools.cached_property then you don't have two variables, less code and you can reset cached_properties with del self.property_name.

Same for _normkernels_

After consideration, I think adding some comments at these lines might be better. There are two reasons why I did not take the two ways you mentioned. For the setter way, it does allow me to re-set the value of _bandwith_inv much clearer, but it also exposes the logic of calculating it. For the cached_property way, the value of these two properties won't be updated if the user do the fitting twice.

But I am not pretty sure if not exposing the logic of setting these two properties makes sense. If not, I think the setter way is better.

For the cached_property way, the value of these two properties won't be updated if the user do the fitting twice.

If you set it to None at the beginning, as you do it anyway here, they can be updated. I think that would be the better solution here, but I am fine with the comment.

examples/neighbors/sparse-kde.py

Co-authored-by: Alexander Goscinski <[email protected]>

CHANGELOG

agoscinski · 2024-10-09T09:10:41Z

Would squash merge to one commit with your PR message a bit adapted. Are you fine with it?

Implements SparseKDE, QuickShift, adds H2O-BLYP-Piglet dataset (#222)

* Add the class `SparseKDE` is located at `src/skmatter/utils/_sparsekde.py`.
  It mitigates the high cost of doing KDE for large datasets by doing KDE for
  selected data points (e.g. grid points sampled by farthest point-sampling).
  This class takes the original dataset as a parameter and fits the model
  using the sampled grid points. The corresponding tests can be found in
  `tests/test_neighbors.py`.
* Add the class `QuickShift` in `src/skmatter/clustering/_quick_shift.py`
  implementing the quick shift clustering algorithm with corresponding tests in
  `tests/test_clustering.py`.
* Add H2O-BLYP-Piglet dataset containing 27233 hydrogen bond with 3D descriptor
  and weights. The corresponding tests can be found in `tests/test_datasets.py`
* Add two auxiliary functions of `effdim` and `oas` stored in
  `src/skmatter/utils/_sparsekde.py` with corresponding tests in
  `tests/test_neighbors.py`.
* Add two distance metrics compatible with PBC,  `pairwise_euclidean_distances`
  and `pairwise_mahalanobis_distances`, are realized and stored in
  `src/skmatter/metrics/pairwise.py` with corresponding tests in
  `tests/test_metrics.py`.

GardevoirX · 2024-10-09T10:27:53Z

Would squash merge to one commit with your PR message a bit adapted. Are you fine with it?

Implements SparseKDE, QuickShift, adds H2O-BLYP-Piglet dataset (#222)

* Add the class `SparseKDE` is located at `src/skmatter/utils/_sparsekde.py`.
  It mitigates the high cost of doing KDE for large datasets by doing KDE for
  selected data points (e.g. grid points sampled by farthest point-sampling).
  This class takes the original dataset as a parameter and fits the model
  using the sampled grid points. The corresponding tests can be found in
  `tests/test_neighbors.py`.
* Add the class `QuickShift` in `src/skmatter/clustering/_quick_shift.py`
  implementing the quick shift clustering algorithm with corresponding tests in
  `tests/test_clustering.py`.
* Add H2O-BLYP-Piglet dataset containing 27233 hydrogen bond with 3D descriptor
  and weights. The corresponding tests can be found in `tests/test_datasets.py`
* Add two auxiliary functions of `effdim` and `oas` stored in
  `src/skmatter/utils/_sparsekde.py` with corresponding tests in
  `tests/test_neighbors.py`.
* Add two distance metrics compatible with PBC,  `pairwise_euclidean_distances`
  and `pairwise_mahalanobis_distances`, are realized and stored in
  `src/skmatter/metrics/pairwise.py` with corresponding tests in
  `tests/test_metrics.py`.

Totally okay, thank you for your time and consideration!

agoscinski

Thanks also for the work and the patience!

PicoCentauri self-assigned this May 1, 2024

PicoCentauri reviewed May 1, 2024

View reviewed changes

PicoCentauri reviewed May 3, 2024

View reviewed changes

src/skmatter/metrics/_pairwise.py Outdated Show resolved Hide resolved

PicoCentauri reviewed May 6, 2024

View reviewed changes

PicoCentauri reviewed May 13, 2024

View reviewed changes

PicoCentauri reviewed May 14, 2024

View reviewed changes

agoscinski self-requested a review May 17, 2024 16:55

PicoCentauri and others added 21 commits May 27, 2024 13:56

Add basic API for SparseKDE and supporting functions

b6984ae

periodic euclidean distance

73d57ac

migration

12f6a53

migration2

460ff2d

migration complete

e05ee31

unittests

d625c2a

mahalanobis distance refactor

f138b05

distance tests

ac1cd93

test fix

0942b1e

sample

00577b0

minor fix

15682e4

comment update

aa55db6

format clean

ccec98f

docstring update

aea9e93

format clean

d8e248b

typing update

ed78b44

remove optional typing

f75e791

typing fix

f197629

typing for py38

88dc9bd

Add basic documentation

69a3cba

format and distance documentation

68ba548

GardevoirX added 4 commits August 7, 2024 17:01

Minor name fix

65c9636

Minor update

24b93c4

Minor fix

38c9da3

Minor test update

2098e5d

agoscinski requested changes Aug 12, 2024

View reviewed changes

GardevoirX and others added 14 commits August 12, 2024 23:08

Update examples/neighbors/sparse-kde.py

12a3590

Co-authored-by: Alexander Goscinski <[email protected]>

Update examples/neighbors/sparse-kde.py

a049505

Co-authored-by: Alexander Goscinski <[email protected]>

Update docs/src/references/neighbors.rst

f0c7741

Co-authored-by: Alexander Goscinski <[email protected]>

Update examples/neighbors/pamm.py

33f09bf

Co-authored-by: Alexander Goscinski <[email protected]>

Update examples/neighbors/pamm.py

0410832

Co-authored-by: Alexander Goscinski <[email protected]>

Update examples/neighbors/pamm.py

284360a

Co-authored-by: Alexander Goscinski <[email protected]>

Update src/skmatter/neighbors/_sparsekde.py

5b59abd

Co-authored-by: Alexander Goscinski <[email protected]>

Update src/skmatter/neighbors/_sparsekde.py

7dff19f

Co-authored-by: Alexander Goscinski <[email protected]>

Update examples/neighbors/pamm.py

9d8ad9b

Co-authored-by: Alexander Goscinski <[email protected]>

Updates on kde attributes and example

21ff481

Merge branch 'sparse-kde' of https://github.com/GardevoirX/scikit-matter

62b1e0b

into sparse-kde

Update src/skmatter/utils/_sparsekde.py

1e3d2e3

Co-authored-by: Alexander Goscinski <[email protected]>

Update _sparsekde.py

939e0a4

Updates on the semi-positive tests

0bc1715

agoscinski reviewed Sep 19, 2024

View reviewed changes

examples/neighbors/sparse-kde.py Outdated Show resolved Hide resolved

GardevoirX and others added 2 commits September 19, 2024 10:01

Update sparse-kde.py

a4aea47

Co-authored-by: Alexander Goscinski <[email protected]>

Add comments to _bandwidth_inv_ and _normkernels_

23043e2

agoscinski reviewed Oct 9, 2024

View reviewed changes

CHANGELOG Outdated Show resolved Hide resolved

Update CHANGELOG

96136dc

agoscinski approved these changes Oct 9, 2024

View reviewed changes

agoscinski merged commit ad56b1d into scikit-learn-contrib:main Oct 9, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse kde #222

Sparse kde #222

GardevoirX commented Feb 16, 2024 •

edited

Loading

PicoCentauri left a comment

PicoCentauri commented May 2, 2024

PicoCentauri left a comment

PicoCentauri May 3, 2024

agoscinski Jun 27, 2024

PicoCentauri May 3, 2024

PicoCentauri May 3, 2024

GardevoirX May 3, 2024

PicoCentauri May 6, 2024

GardevoirX May 6, 2024

PicoCentauri May 13, 2024

PicoCentauri left a comment

PicoCentauri May 6, 2024

PicoCentauri left a comment

PicoCentauri May 13, 2024

PicoCentauri May 13, 2024

PicoCentauri left a comment

GardevoirX commented Aug 9, 2024

agoscinski left a comment

agoscinski Sep 19, 2024 •

edited

Loading

GardevoirX Oct 7, 2024

agoscinski Oct 9, 2024

agoscinski commented Oct 9, 2024 •

edited

Loading

GardevoirX commented Oct 9, 2024

agoscinski left a comment

Sparse kde #222

Sparse kde #222

Conversation

GardevoirX commented Feb 16, 2024 • edited Loading

Contributor (creator of PR) checklist

For Reviewer

PicoCentauri left a comment

Choose a reason for hiding this comment

PicoCentauri commented May 2, 2024

PicoCentauri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PicoCentauri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PicoCentauri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PicoCentauri left a comment

Choose a reason for hiding this comment

GardevoirX commented Aug 9, 2024

agoscinski left a comment

Choose a reason for hiding this comment

agoscinski Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agoscinski commented Oct 9, 2024 • edited Loading

GardevoirX commented Oct 9, 2024

agoscinski left a comment

Choose a reason for hiding this comment

GardevoirX commented Feb 16, 2024 •

edited

Loading

agoscinski Sep 19, 2024 •

edited

Loading

agoscinski commented Oct 9, 2024 •

edited

Loading