Better support for storing points in multiple lists #5

thomasahle · 2023-04-11T18:32:46Z

Since 2df6a42 it is possible to store every datapoint in n lists by building with ivf.build(n_probes=n).
This increases performance recall/qps quite a lot, but only when going from n=1 to n=2, as seen in the attached figure.

The problem is probably that duplicate matches aren't handled well.
When calling ctop from ivf, we should somehow tell it about the indices we have already collected so the distance table can focus on telling us about alternative interesting candidates.

One option is to even reuse the query_pq_sse(transformed_data, self.tables, indices, values, True) call by calling it on multiple (transformed_data, tables) pairs while keeping (indices, values) fixed. That way we also would only do rescoring/pass-2 a single time on all candidates retrieved from different lists.

The issue is that
(1) the binary heap data structure we use can't recognize duplicates, and
(2) the query_pq_sse function only knows the "local" id of a point in a list, not the global id.

To solve (2) we could pass a list with the global ids of all the points considered. This would be some extra overhead for query_pq_sse to pass around, but perhaps not that much. And we wouldn't have to "relabel" the returned ids afterwards.

For (1) we could switch back to using insertion sort, or just try heuristically to remove some of the duplicates the heap is able to find.

The text was updated successfully, but these errors were encountered:

thomasahle · 2023-04-12T02:41:09Z

Improved a lot now after adding a new query function.
However, there's still an issue where later query_sse calls don't know which ids have already been found, and then as we do more and more build probes, the duplicate points start clouding out other useful points, actually reducing our recall.

thomasahle · 2023-04-19T00:42:17Z

We now use the same priority queue for all probes and check for duplicate ids.
While this has made things faster, it unfortunately hasn't change the fast that we still don't get much of a benefit for storing points in multiple lists.

thomasahle · 2023-04-19T00:57:02Z

Maybe the issue is that while the recall goes up with more build_probes, the size of the clusters also increase, which makes the query/cluster slower.
Currently we don't change the number of clusters as we increase build_probes, but we probably should make it something like sqrt(n * build_probes).

calvinmccarter · 2023-07-04T16:56:05Z

There are advantages to query_probes that do not exist for build_probes.

As one increases n_clusters (ie reduces the size of the clusters) and increases query_probes, one converges towards searching through a ball centered at the query. Thus, query_probes improves the query-adaptivity of the search process. Searching many small clusters gives better recall than searching a few large clusters: in the latter case, retrieval fails whenever the query and the true nearest neighbor are close yet on opposite sides of a boundary between clusters.

Increasing build_probes does not yield the same query-adaptivity. Instead, the IVF clusters are essentially covering overlapping regions. But there's no way to make a cluster region expand in the direction of a query-point, without also expanding in the opposite direction, because cluster regions are expanded at build-time, not query-time.

thomasahle added the enhancement New feature or request label Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for storing points in multiple lists #5

Better support for storing points in multiple lists #5

thomasahle commented Apr 11, 2023

thomasahle commented Apr 12, 2023 •

edited

Loading

thomasahle commented Apr 19, 2023

thomasahle commented Apr 19, 2023 •

edited

Loading

calvinmccarter commented Jul 4, 2023

Better support for storing points in multiple lists #5

Better support for storing points in multiple lists #5

Comments

thomasahle commented Apr 11, 2023

thomasahle commented Apr 12, 2023 • edited Loading

thomasahle commented Apr 19, 2023

thomasahle commented Apr 19, 2023 • edited Loading

calvinmccarter commented Jul 4, 2023

thomasahle commented Apr 12, 2023 •

edited

Loading

thomasahle commented Apr 19, 2023 •

edited

Loading