Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate indices in batch NN Descent #702

Open
wants to merge 11 commits into
base: branch-25.04
Choose a base branch
from

Conversation

jinsolp
Copy link

@jinsolp jinsolp commented Feb 14, 2025

Purpose of this PR

Handling duplicate indices in batch NN Descent graph.

Resolves the following issues

Notes

  • Also fixed in RAFT here for current use with cuML

@jinsolp jinsolp requested a review from a team as a code owner February 14, 2025 22:49
@github-actions github-actions bot added the cpp label Feb 14, 2025
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
Signed-off-by: jinsolp <[email protected]>
@cjnolet cjnolet added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 18, 2025
@@ -241,7 +241,7 @@ class AnnNNDescentBatchTest : public ::testing::TestWithParam<AnnNNDescentBatchI
index_params.metric = ps.metric;
index_params.graph_degree = ps.graph_degree;
index_params.intermediate_graph_degree = 2 * ps.graph_degree;
index_params.max_iterations = 10;
index_params.max_iterations = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still get duplicates with 10 iterations?

Copy link
Author

@jinsolp jinsolp Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to be safe and make sure that the graph converges properly (I noticed that sometimes it takes more than 10 iters to converge; around 15 ~ 20 for the dataset of this size).
Running enough iterations becomes important in batching, because otherwise randomly initialized indices (with distances float max) stays in the graph, which may cause duplicate indices (one with a proper distance, and the other with a float max distance).
Also, the non-batching test at the top of this file also has 100 as max iterations, so wanted to be consistent!

Copy link
Member

@divyegala divyegala Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the only problem? I thought the bug was that there were duplicates because of minor differences in floating point distances.

We observed a troubling behavior in UMAP with the batching algorithm. It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case? However, in theory, if intermediate_graph_degree is the same then we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.

cc @jcrist

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the bug was that there were duplicates because of minor differences in floating point distances.

The duplicate indices issue is solved by initializing the update_counter_. Turns out that while running a few clusters after the first one, update_counter_ sometimes stays as -1 from the first iteration, resulting in not running any of the iterations within NN Descent at all. This results in returning the graph from the previous iteration, resulting in (seemingly) different distances between the two points.

I believe the following issues are more cuML side of things;

It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case?

The default n_neighbors=10 was not chosen by me (it was set to 10 before using NN Descent, so I just left it like that), and the graph_degree=64 is to match raft side of NN Descent index initialization (which was also the default graph degree for NN Descent before linking it with UMAP).

we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.

We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree is equal to n_neighbors works fine;

cuml_model = cuUMAP(n_neighbors=10, build_algo=build_algo, build_kwds={"nnd_graph_degree": 10})

Should I change the default values in cuml's umap.pyx do match the default value of n_neighbors for memory efficiency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree is equal to n_neighbors works fine;

To clarify, are you saying after this PR you get the same result with n_neighbors=10, nnd_graph_degree=10 and n_neighbors=10, nnd_graph_degree=64? Before this wasn't the case (we thought this was due to the duplicates problem). If so, we should update cuml to change the defaults (and avoid the copy), but that will need to happen after the coming patch release. I also have a branch somewhere where I did this before we noticed the bug, happy to push that up once this is merged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I see. Yes, I checked by building cuML with the corresponding fixes in the RAFT branch (linked above in the PR).
Looked into it manually + checked that the cuML tests run properly with the changed nnd_graph_degree too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinsolp thanks for the explanations. In cuML, we do not need to change the default value of n_neighbors. We just should be setting graph_degree = n_neighbors when running UMAP, so that we can remove the unnecessary matrix slice which is causing overconsumption in memory. Can you replicate this PR in RAFT?

@jcrist can you quickly push up your branch and test if Jinsol's changes work once she has a PR up in RAFT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is already here : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp non-breaking Introduces a non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants