Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse diversification is actually forward diversification again #234

Open
jlmelville opened this issue Dec 29, 2023 · 0 comments
Open

Comments

@jlmelville
Copy link
Collaborator

jlmelville commented Dec 29, 2023

I've noticed that no matter what dataset I try, when calling prepare with default pruning probability, the number of edges before and after the reverse diversification is always the same. Here's a small example:

from sklearn.datasets import load_iris
import pynndescent

iris = load_iris().data
iris_index = pynndescent.NNDescent(iris, n_neighbors=15, random_state=42, verbose=True)
iris_query = iris_index.query(iris, k=15)

It doesn't matter what dataset or random_state you use, but the output to look for is along the lines of:

Thu Dec 28 21:17:48 2023 Forward diversification reduced edges from 2250 to 639
Thu Dec 28 21:17:48 2023 Reverse diversification reduced edges from 639 to 639

So forward diversification does reduce the number of edges, but no occlusions are found in the reverse case.

I believe the issue arises here:

reverse_graph = self._search_graph.transpose()

Where the graph edges are reversed by transposing it. But this also turns it into a CSC sparse matrix. The next step for diversification is to call the diversify_csr function which from its name is expecting a CSR format. As a result, the wrong orientation is processed, so the same edges that were just processed in the forward diversification step are re-processed. Because all the occlusions were pruned no further pruning occurs. If you set prune_probability to less than one you might see something happen in the reverse diversification, but you are just doing forward diversification twice.

Things do behave more correctly if you use self._search_graph.transpose().tocsr() although I haven't stopped to think if this is the efficient way to fix it. At any rate...

Unfortunately I think there is another bug

I think there might be another bug in diversify_csr:

j = order[idx]
for k in range(idx):
l = order[k]
if retained[l] == 1:
d = dist(
source_data[current_indices[j]], source_data[current_indices[k]]

I think in the distance calculation, you need to fetch the data from current_indices[l] rather than current_indices[k]: you can see from the context that one is the loop index into the order array which contains the actual offset into the edge array. k is to l as idx is to j as far as I can see, and you always index into order via idx then use j to get the indices.

I could have got myself very confused here, but I do see the same behavior as with my C++ implementation with these changes.

Neither of these issues actually has much effect on the performance of the search graph as it happens so I guess that's good.

Something that might not be a problem

Probably worth mentioning that with high dimensional datasets there is an increasing probability that the reverse graph will contain a hub node where it has a very large number of connections. Could the double loop in the occlusion pruning lead to an O(N^2) situation rather than just O(k^2) in the forward case? When I tried this with a dataset I know has a hub issue it seems ok, but just to point out that fixing this could introduce some unexpected performance changes. Edited to add: probably degree-pruning the reverse graph before diversification would fix this.

I haven't looked at any of the other diversification functions to see if other code paths trigger this behavior. I can volunteer a PR for this, I assume it could have some numerical effects on unit tests though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant