-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HNSW connect components can take an inordinate amount of time #14214
Comments
this can be reproduced with either of the following tests public void testSameVectorIndexedMultipleTimes() throws IOException {
try (Directory d = newDirectory()) {
float[] vector = new vector[16];
Arrays.fill(vector, 0.5f);
try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
for (int j = 1; j <= 1_000_000; j++) {
Document doc = new Document();
doc.add(getKnnVectorField("field", vector, DOT_PRODUCT));
w.addDocument(doc);
if (j % 100 == 0) {
w.flush();
}
}
w.commit();
}
}
}
public void testFewDistinctVectors() throws IOException {
try (Directory d = newDirectory()) {
try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
float[][] f = new float[16][];
for (int i = 0; i < 16; i++) {
f[i] = new float[16];
Arrays.fill(f[i], (float) i);
}
for (int i =0; i < 1_000_000; i++) {
Document doc = new Document();
doc.add(getKnnVectorField("field", f[random().nextInt(16)], DOT_PRODUCT));
w.addDocument(doc);
}
}
}
} they take tens of minutes and thread dump shows thread stuck at whereas a similar test even with random vectors takes 10x less time to finish. |
I see 3 problems why it could be if it is poor ** Inefficient Component Exploration (
|
Description
Connect components on Flush or merge, while good for graphs that are "almost OK" but need to be better connected, can just destroy performance if the vector distribution is poor.
I don't readily have test data, but if you have tightly clustered, or many duplicate vectors, it can take until the "heat death of the universe" to complete.
It seems to me that since connect Components is a "best effort fix up" of the graph, we should add a "cap" on the amount of work this does.
Marking as a bug as I have seen this for real users and it effectively takes a CPU hostage for hours (maybe days).
Version and environment details
No response
The text was updated successfully, but these errors were encountered: