Reverse diversification is actually forward diversification again #234

jlmelville · 2023-12-29T05:55:55Z

I've noticed that no matter what dataset I try, when calling prepare with default pruning probability, the number of edges before and after the reverse diversification is always the same. Here's a small example:

from sklearn.datasets import load_iris
import pynndescent

iris = load_iris().data
iris_index = pynndescent.NNDescent(iris, n_neighbors=15, random_state=42, verbose=True)
iris_query = iris_index.query(iris, k=15)

It doesn't matter what dataset or random_state you use, but the output to look for is along the lines of:

Thu Dec 28 21:17:48 2023 Forward diversification reduced edges from 2250 to 639
Thu Dec 28 21:17:48 2023 Reverse diversification reduced edges from 639 to 639

So forward diversification does reduce the number of edges, but no occlusions are found in the reverse case.

I believe the issue arises here:

pynndescent/pynndescent/pynndescent_.py

Line 1105 in d99a821

reverse_graph = self._search_graph.transpose()

Where the graph edges are reversed by transposing it. But this also turns it into a CSC sparse matrix. The next step for diversification is to call the diversify_csr function which from its name is expecting a CSR format. As a result, the wrong orientation is processed, so the same edges that were just processed in the forward diversification step are re-processed. Because all the occlusions were pruned no further pruning occurs. If you set prune_probability to less than one you might see something happen in the reverse diversification, but you are just doing forward diversification twice.

Things do behave more correctly if you use self._search_graph.transpose().tocsr() although I haven't stopped to think if this is the efficient way to fix it. At any rate...

Unfortunately I think there is another bug

I think there might be another bug in diversify_csr:

pynndescent/pynndescent/pynndescent_.py

Lines 443 to 449 in d99a821

    
           j = order[idx] 
        
           for k in range(idx): 
        
               l = order[k] 
        
               if retained[l] == 1: 
        
                   d = dist( 
        
                       source_data[current_indices[j]], source_data[current_indices[k]]

I think in the distance calculation, you need to fetch the data from current_indices[l] rather than current_indices[k]: you can see from the context that one is the loop index into the order array which contains the actual offset into the edge array. k is to l as idx is to j as far as I can see, and you always index into order via idx then use j to get the indices.

I could have got myself very confused here, but I do see the same behavior as with my C++ implementation with these changes.

Neither of these issues actually has much effect on the performance of the search graph as it happens so I guess that's good.

Something that might not be a problem

Probably worth mentioning that with high dimensional datasets there is an increasing probability that the reverse graph will contain a hub node where it has a very large number of connections. Could the double loop in the occlusion pruning lead to an O(N^2) situation rather than just O(k^2) in the forward case? When I tried this with a dataset I know has a hub issue it seems ok, but just to point out that fixing this could introduce some unexpected performance changes. Edited to add: probably degree-pruning the reverse graph before diversification would fix this.

I haven't looked at any of the other diversification functions to see if other code paths trigger this behavior. I can volunteer a PR for this, I assume it could have some numerical effects on unit tests though.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverse diversification is actually forward diversification again #234

Reverse diversification is actually forward diversification again #234

jlmelville commented Dec 29, 2023 •

edited

Loading

Reverse diversification is actually forward diversification again #234

Reverse diversification is actually forward diversification again #234

Comments

jlmelville commented Dec 29, 2023 • edited Loading

Unfortunately I think there is another bug

Something that might not be a problem

jlmelville commented Dec 29, 2023 •

edited

Loading