-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicate indices in batch NN Descent #702
base: branch-25.04
Are you sure you want to change the base?
Conversation
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
@@ -241,7 +241,7 @@ class AnnNNDescentBatchTest : public ::testing::TestWithParam<AnnNNDescentBatchI | |||
index_params.metric = ps.metric; | |||
index_params.graph_degree = ps.graph_degree; | |||
index_params.intermediate_graph_degree = 2 * ps.graph_degree; | |||
index_params.max_iterations = 10; | |||
index_params.max_iterations = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still get duplicates with 10 iterations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to be safe and make sure that the graph converges properly (I noticed that sometimes it takes more than 10 iters to converge; around 15 ~ 20 for the dataset of this size).
Running enough iterations becomes important in batching, because otherwise randomly initialized indices (with distances float max) stays in the graph, which may cause duplicate indices (one with a proper distance, and the other with a float max distance).
Also, the non-batching test at the top of this file also has 100 as max iterations, so wanted to be consistent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the only problem? I thought the bug was that there were duplicates because of minor differences in floating point distances.
We observed a troubling behavior in UMAP with the batching algorithm. It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case? However, in theory, if intermediate_graph_degree is the same then we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.
cc @jcrist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the bug was that there were duplicates because of minor differences in floating point distances.
The duplicate indices issue is solved by initializing the update_counter_
. Turns out that while running a few clusters after the first one, update_counter_
sometimes stays as -1 from the first iteration, resulting in not running any of the iterations within NN Descent at all. This results in returning the graph from the previous iteration, resulting in (seemingly) different distances between the two points.
I believe the following issues are more cuML side of things;
It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case?
The default n_neighbors=10
was not chosen by me (it was set to 10 before using NN Descent, so I just left it like that), and the graph_degree=64
is to match raft side of NN Descent index initialization (which was also the default graph degree for NN Descent before linking it with UMAP).
we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.
We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree
is equal to n_neighbors
works fine;
cuml_model = cuUMAP(n_neighbors=10, build_algo=build_algo, build_kwds={"nnd_graph_degree": 10})
Should I change the default values in cuml's umap.pyx do match the default value of n_neighbors for memory efficiency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree is equal to n_neighbors works fine;
To clarify, are you saying after this PR you get the same result with n_neighbors=10, nnd_graph_degree=10
and n_neighbors=10, nnd_graph_degree=64
? Before this wasn't the case (we thought this was due to the duplicates problem). If so, we should update cuml to change the defaults (and avoid the copy), but that will need to happen after the coming patch release. I also have a branch somewhere where I did this before we noticed the bug, happy to push that up once this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh I see. Yes, I checked by building cuML with the corresponding fixes in the RAFT branch (linked above in the PR).
Looked into it manually + checked that the cuML tests run properly with the changed nnd_graph_degree
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinsolp thanks for the explanations. In cuML, we do not need to change the default value of n_neighbors
. We just should be setting graph_degree = n_neighbors
when running UMAP, so that we can remove the unnecessary matrix slice which is causing overconsumption in memory. Can you replicate this PR in RAFT?
@jcrist can you quickly push up your branch and test if Jinsol's changes work once she has a PR up in RAFT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is already here : )
Purpose of this PR
Handling duplicate indices in batch NN Descent graph.
Resolves the following issues
NEIGHBORS_ANN_NN_DESCENT_TEST
raft#2450Notes