Fix duplicate indices in batch NN Descent #702

jinsolp · 2025-02-14T22:49:39Z

Purpose of this PR

Handling duplicate indices in batch NN Descent graph.

Resolves the following issues

Notes

Also fixed in RAFT here for current use with cuML

Signed-off-by: jinsolp <[email protected]>

divyegala · 2025-02-18T18:04:07Z

cpp/tests/neighbors/ann_nn_descent.cuh

@@ -241,7 +241,7 @@ class AnnNNDescentBatchTest : public ::testing::TestWithParam<AnnNNDescentBatchI
        index_params.metric                    = ps.metric;
        index_params.graph_degree              = ps.graph_degree;
        index_params.intermediate_graph_degree = 2 * ps.graph_degree;
-        index_params.max_iterations            = 10;
+        index_params.max_iterations            = 100;


Do you still get duplicates with 10 iterations?

This is to be safe and make sure that the graph converges properly (I noticed that sometimes it takes more than 10 iters to converge; around 15 ~ 20 for the dataset of this size).
Running enough iterations becomes important in batching, because otherwise randomly initialized indices (with distances float max) stays in the graph, which may cause duplicate indices (one with a proper distance, and the other with a float max distance).
Also, the non-batching test at the top of this file also has 100 as max iterations, so wanted to be consistent!

Is that the only problem? I thought the bug was that there were duplicates because of minor differences in floating point distances.

We observed a troubling behavior in UMAP with the batching algorithm. It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case? However, in theory, if intermediate_graph_degree is the same then we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.

cc @jcrist

I thought the bug was that there were duplicates because of minor differences in floating point distances.

The duplicate indices issue is solved by initializing the update_counter_. Turns out that while running a few clusters after the first one, update_counter_ sometimes stays as -1 from the first iteration, resulting in not running any of the iterations within NN Descent at all. This results in returning the graph from the previous iteration, resulting in (seemingly) different distances between the two points.

I believe the following issues are more cuML side of things;

It seems that you had to set graph_degree=64 and had to trim it down to n_neighbors=10 in the default case?

The default n_neighbors=10 was not chosen by me (it was set to 10 before using NN Descent, so I just left it like that), and the graph_degree=64 is to match raft side of NN Descent index initialization (which was also the default graph degree for NN Descent before linking it with UMAP).

we should be able to set graph_degree=n_neighbors and still get the same answer. Doing it with a higher graph_degree and then slicing it is causing us to use more memory than necessary.

We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree is equal to n_neighbors works fine;

cuml_model = cuUMAP(n_neighbors=10, build_algo=build_algo, build_kwds={"nnd_graph_degree": 10})

Should I change the default values in cuml's umap.pyx do match the default value of n_neighbors for memory efficiency?

We do get the same answer. i.e. changing the tests in cuML like this so that the nnd_graph_degree is equal to n_neighbors works fine;

To clarify, are you saying after this PR you get the same result with n_neighbors=10, nnd_graph_degree=10 and n_neighbors=10, nnd_graph_degree=64? Before this wasn't the case (we thought this was due to the duplicates problem). If so, we should update cuml to change the defaults (and avoid the copy), but that will need to happen after the coming patch release. I also have a branch somewhere where I did this before we noticed the bug, happy to push that up once this is merged.

Ahhh I see. Yes, I checked by building cuML with the corresponding fixes in the RAFT branch (linked above in the PR).
Looked into it manually + checked that the cuML tests run properly with the changed nnd_graph_degree too.

@jinsolp thanks for the explanations. In cuML, we do not need to change the default value of n_neighbors. We just should be setting graph_degree = n_neighbors when running UMAP, so that we can remove the unnecessary matrix slice which is causing overconsumption in memory. Can you replicate this PR in RAFT?

@jcrist can you quickly push up your branch and test if Jinsol's changes work once she has a PR up in RAFT?

The PR is already here : )

jinsolp added 3 commits February 14, 2025 21:55

fix duplicate indices issue

5670eac

Signed-off-by: jinsolp <[email protected]>

remove TODO comment

43a279b

Signed-off-by: jinsolp <[email protected]>

increase max iters

20ef4a2

Signed-off-by: jinsolp <[email protected]>

jinsolp requested a review from a team as a code owner February 14, 2025 22:49

github-actions bot added the cpp label Feb 14, 2025

jinsolp added 6 commits February 14, 2025 23:00

style check

5aa49a2

Signed-off-by: jinsolp <[email protected]>

fixing batch_size

650ec3c

Signed-off-by: jinsolp <[email protected]>

remove prints

77be4bc

Signed-off-by: jinsolp <[email protected]>

revert changed function args

37b8cda

Signed-off-by: jinsolp <[email protected]>

fix batch size

5147c4f

Signed-off-by: jinsolp <[email protected]>

Empty commit

0ef1b38

Signed-off-by: jinsolp <[email protected]>

cjnolet added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 18, 2025

cjnolet assigned jinsolp Feb 18, 2025

divyegala reviewed Feb 18, 2025

View reviewed changes

jinsolp added 2 commits February 19, 2025 13:05

Merge branch 'rapidsai:branch-25.04' into batch-nnd

6fd7a4f

Merge branch 'branch-25.04' into batch-nnd

7b977a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicate indices in batch NN Descent #702

Fix duplicate indices in batch NN Descent #702

jinsolp commented Feb 14, 2025

divyegala Feb 18, 2025

jinsolp Feb 19, 2025 •

edited

Loading

divyegala Feb 19, 2025 •

edited

Loading

jinsolp Feb 19, 2025

jcrist Feb 19, 2025

jinsolp Feb 19, 2025

divyegala Feb 19, 2025

jinsolp Feb 20, 2025

Fix duplicate indices in batch NN Descent #702

Are you sure you want to change the base?

Fix duplicate indices in batch NN Descent #702

Conversation

jinsolp commented Feb 14, 2025

Purpose of this PR

Notes

divyegala Feb 18, 2025

Choose a reason for hiding this comment

jinsolp Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

divyegala Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

jinsolp Feb 19, 2025

Choose a reason for hiding this comment

jcrist Feb 19, 2025

Choose a reason for hiding this comment

jinsolp Feb 19, 2025

Choose a reason for hiding this comment

divyegala Feb 19, 2025

Choose a reason for hiding this comment

jinsolp Feb 20, 2025

Choose a reason for hiding this comment

jinsolp Feb 19, 2025 •

edited

Loading

divyegala Feb 19, 2025 •

edited

Loading