Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HNSW connect components can take an inordinate amount of time #14214

Open
benwtrent opened this issue Feb 7, 2025 · 2 comments
Open

HNSW connect components can take an inordinate amount of time #14214

benwtrent opened this issue Feb 7, 2025 · 2 comments
Labels

Comments

@benwtrent
Copy link
Member

Description

Connect components on Flush or merge, while good for graphs that are "almost OK" but need to be better connected, can just destroy performance if the vector distribution is poor.

I don't readily have test data, but if you have tightly clustered, or many duplicate vectors, it can take until the "heat death of the universe" to complete.

It seems to me that since connect Components is a "best effort fix up" of the graph, we should add a "cap" on the amount of work this does.

Marking as a bug as I have seen this for real users and it effectively takes a CPU hostage for hours (maybe days).

Version and environment details

No response

@tteofili
Copy link
Contributor

tteofili commented Feb 7, 2025

this can be reproduced with either of the following tests

public void testSameVectorIndexedMultipleTimes() throws IOException {
    try (Directory d = newDirectory()) {
      float[] vector = new vector[16];
      Arrays.fill(vector, 0.5f);
      try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
        for (int j = 1; j <= 1_000_000; j++) {
          Document doc = new Document();
          doc.add(getKnnVectorField("field", vector, DOT_PRODUCT));
          w.addDocument(doc);
          if (j % 100 == 0) {
            w.flush();
          }
        }
        w.commit();
      }

    }
  }

  public void testFewDistinctVectors() throws IOException {
    try (Directory d = newDirectory()) {
      try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
        float[][] f = new float[16][];
        for (int i = 0; i < 16; i++) {
          f[i] = new float[16];
          Arrays.fill(f[i], (float) i);
        }
        for (int i =0; i < 1_000_000; i++) {
          Document doc = new Document();
          doc.add(getKnnVectorField("field", f[random().nextInt(16)], DOT_PRODUCT));
          w.addDocument(doc);
        }
      }
    }
  }

they take tens of minutes and thread dump shows thread stuck at hnsw.HnswGraphBuilder.connectComponents

whereas a similar test even with random vectors takes 10x less time to finish.

@Vikasht34
Copy link

private void connectComponents() {
    BitSet visited = new BitSet(graph.size());
    List<Integer> entryPoints = new ArrayList<>();
    
    for (int node = 0; node < graph.size(); node++) {
        if (!visited.get(node)) {
            List<Integer> component = new ArrayList<>();
            exploreComponent(node, visited, component);

            if (!entryPoints.isEmpty()) {
                int bestCandidate = findBestCandidate(entryPoints, component);
                connectNodes(entryPoints.get(0), bestCandidate);
            }

            entryPoints.add(component.get(0));  // Add first node as entry
        }
    }
}

I see 3 problems why it could be if it is poor

** Inefficient Component Exploration (exploreComponent)**

  • Recursively visits every unconnected node, leading to O(N²) complexity in worst-case scenarios.
  • Causes excessive CPU utilization when dealing with large numbers of unconnected components.

** Unbounded Connection Attempts (findBestCandidate)**

  • The method tries to optimally connect every component, which becomes extremely expensive for highly similar vectors.
  • Leads to exponential slowdowns when vector clusters are densely packed or contain duplicates.

** Repeated Work (connectNodes)**

  • If multiple small components exist, the function makes too many unnecessary connections.
  • This results in high CPU overhead as the method attempts to fully optimize the graph, even when minimal connectivity is sufficient.

Idea of CAP is as a quick fix to prevent infinite execution in connectComponents(), but it does not solve the root problem and remains computationally expensive. The function still performs O(N²) connectivity checks but gets forcefully terminated instead of completing properly.This means that most of the CPU time is still wasted on redundant checks, and the graph may remain unoptimized or disconnected.

What do u think of union find to solve this ?

We replace the brute-force merging with Union-Find, which tracks components dynamically:

  1. Reduces Complexity to O(N log N)

    • Union-Find efficiently tracks connected components, avoiding repeated work.
  2. Eliminates Unnecessary Checks

    • Instead of checking every node again, we only merge truly disconnected components.
  3. Stops Early When Graph is Mostly Connected

    • We stop merging early if 95% of the graph is already connected, preventing wasted work.

Some Sudo code

public class HnswGraphConnectivity {
    private final int[] parent;
    private final int[] rank;
    private final int size;

    public HnswGraphConnectivity(int size) {
        this.size = size;
        parent = new int[size];
        rank = new int[size];
        for (int i = 0; i < size; i++) parent[i] = i; // Each node starts in its own component
    }

    // Path compression: Makes find() O(1) amortized
    public int find(int x) {
        if (parent[x] != x) parent[x] = find(parent[x]); 
        return parent[x];
    }

    // Union by rank: Ensures efficient merging
    public void union(int x, int y) {
        int rootX = find(x);
        int rootY = find(y);
        if (rootX != rootY) {
            if (rank[rootX] > rank[rootY]) parent[rootY] = rootX;
            else if (rank[rootX] < rank[rootY]) parent[rootX] = rootY;
            else {
                parent[rootY] = rootX;
                rank[rootX]++;
            }
        }
    }

    public boolean isConnected(int x, int y) {
        return find(x) == find(y);
    }

    // Returns size of the largest connected component
    public int largestComponentSize() {
        Map<Integer, Integer> componentSize = new HashMap<>();
        for (int i = 0; i < size; i++) {
            int root = find(i);
            componentSize.put(root, componentSize.getOrDefault(root, 0) + 1);
        }
        return Collections.max(componentSize.values());
    }
}
private boolean connectComponents(int level) throws IOException {
        int graphSize = hnsw.size();
        HnswGraphConnectivity connectivity = new HnswGraphConnectivity(graphSize);
        
        // Step 1: Initialize connectivity tracking
        FixedBitSet notFullyConnected = new FixedBitSet(graphSize);
        int maxConn = (level == 0) ? M * 2 : M;
        List<Component> components = HnswUtil.components(hnsw, level, notFullyConnected, maxConn);

        if (infoStream.isEnabled("HNSW")) {
            infoStream.message("HNSW", "Found " + components.size() + " components on level=" + level);
        }

        // Step 2: Use parallel stream to process connections efficiently
        IntStream.range(0, components.size()).parallel().forEach(i -> {
            Component c = components.get(i);
            for (int neighbor : c.nodes()) {
                if (neighbor != c.start()) {
                    connectivity.union(c.start(), neighbor);
                }
            }
        });

        // Step 3: Stop early if graph is sufficiently connected (~95%)
        if (connectivity.largestComponentSize() > (CONNECTIVITY_THRESHOLD_PERCENT / 100.0) * graphSize) {
            if (infoStream.isEnabled("HNSW")) {
                infoStream.message("HNSW", "Early stopping: " + CONNECTIVITY_THRESHOLD_PERCENT + "% of components are connected.");
            }
            return true;
        }

        // Step 4: Connect remaining components intelligently
        GraphBuilderKnnCollector beam = new GraphBuilderKnnCollector(2);
        int[] eps = new int[1];
        UpdateableRandomVectorScorer scorer = scorerSupplier.scorer();

        for (Component c : components) {
            if (!connectivity.isConnected(c.start(), eps[0])) {
                beam.clear();
                eps[0] = c.start();
                scorer.setScoringOrdinal(eps[0]);
                
                // Find the best connection candidate
                try {
                    graphSearcher.searchLevel(beam, scorer, level, eps, hnsw, notFullyConnected);
                    int bestNode = beam.popNode();
                    float score = beam.minimumScore();
                    link(level, bestNode, c.start(), score, notFullyConnected);
                    connectivity.union(bestNode, c.start());
                } catch (Exception e) {
                    if (infoStream.isEnabled("HNSW")) {
                        infoStream.message("HNSW", "Failed to connect component: " + c.start());
                    }
                }
            }
        }

        return true;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants