You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed that no matter what dataset I try, when calling prepare with default pruning probability, the number of edges before and after the reverse diversification is always the same. Here's a small example:
It doesn't matter what dataset or random_state you use, but the output to look for is along the lines of:
Thu Dec 28 21:17:48 2023 Forward diversification reduced edges from 2250 to 639
Thu Dec 28 21:17:48 2023 Reverse diversification reduced edges from 639 to 639
So forward diversification does reduce the number of edges, but no occlusions are found in the reverse case.
Where the graph edges are reversed by transposing it. But this also turns it into a CSC sparse matrix. The next step for diversification is to call the diversify_csr function which from its name is expecting a CSR format. As a result, the wrong orientation is processed, so the same edges that were just processed in the forward diversification step are re-processed. Because all the occlusions were pruned no further pruning occurs. If you set prune_probability to less than one you might see something happen in the reverse diversification, but you are just doing forward diversification twice.
Things do behave more correctly if you use self._search_graph.transpose().tocsr() although I haven't stopped to think if this is the efficient way to fix it. At any rate...
Unfortunately I think there is another bug
I think there might be another bug in diversify_csr:
I think in the distance calculation, you need to fetch the data from current_indices[l] rather than current_indices[k]: you can see from the context that one is the loop index into the order array which contains the actual offset into the edge array. k is to l as idx is to j as far as I can see, and you always index into order via idx then use j to get the indices.
I could have got myself very confused here, but I do see the same behavior as with my C++ implementation with these changes.
Neither of these issues actually has much effect on the performance of the search graph as it happens so I guess that's good.
Something that might not be a problem
Probably worth mentioning that with high dimensional datasets there is an increasing probability that the reverse graph will contain a hub node where it has a very large number of connections. Could the double loop in the occlusion pruning lead to an O(N^2) situation rather than just O(k^2) in the forward case? When I tried this with a dataset I know has a hub issue it seems ok, but just to point out that fixing this could introduce some unexpected performance changes. Edited to add: probably degree-pruning the reverse graph before diversification would fix this.
I haven't looked at any of the other diversification functions to see if other code paths trigger this behavior. I can volunteer a PR for this, I assume it could have some numerical effects on unit tests though.
The text was updated successfully, but these errors were encountered:
I've noticed that no matter what dataset I try, when calling
prepare
with default pruning probability, the number of edges before and after the reverse diversification is always the same. Here's a small example:It doesn't matter what dataset or
random_state
you use, but the output to look for is along the lines of:So forward diversification does reduce the number of edges, but no occlusions are found in the reverse case.
I believe the issue arises here:
pynndescent/pynndescent/pynndescent_.py
Line 1105 in d99a821
Where the graph edges are reversed by transposing it. But this also turns it into a CSC sparse matrix. The next step for diversification is to call the
diversify_csr
function which from its name is expecting a CSR format. As a result, the wrong orientation is processed, so the same edges that were just processed in the forward diversification step are re-processed. Because all the occlusions were pruned no further pruning occurs. If you setprune_probability
to less than one you might see something happen in the reverse diversification, but you are just doing forward diversification twice.Things do behave more correctly if you use
self._search_graph.transpose().tocsr()
although I haven't stopped to think if this is the efficient way to fix it. At any rate...Unfortunately I think there is another bug
I think there might be another bug in
diversify_csr
:pynndescent/pynndescent/pynndescent_.py
Lines 443 to 449 in d99a821
I think in the distance calculation, you need to fetch the data from
current_indices[l]
rather thancurrent_indices[k]
: you can see from the context that one is the loop index into theorder
array which contains the actual offset into the edge array.k
is tol
asidx
is toj
as far as I can see, and you always index intoorder
viaidx
then usej
to get the indices.I could have got myself very confused here, but I do see the same behavior as with my C++ implementation with these changes.
Neither of these issues actually has much effect on the performance of the search graph as it happens so I guess that's good.
Something that might not be a problem
Probably worth mentioning that with high dimensional datasets there is an increasing probability that the reverse graph will contain a hub node where it has a very large number of connections. Could the double loop in the occlusion pruning lead to an O(N^2) situation rather than just O(k^2) in the forward case? When I tried this with a dataset I know has a hub issue it seems ok, but just to point out that fixing this could introduce some unexpected performance changes. Edited to add: probably degree-pruning the reverse graph before diversification would fix this.
I haven't looked at any of the other diversification functions to see if other code paths trigger this behavior. I can volunteer a PR for this, I assume it could have some numerical effects on unit tests though.
The text was updated successfully, but these errors were encountered: