Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for storing points in multiple lists #5

Open
thomasahle opened this issue Apr 11, 2023 · 4 comments
Open

Better support for storing points in multiple lists #5

thomasahle opened this issue Apr 11, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@thomasahle
Copy link
Owner

Since 2df6a42 it is possible to store every datapoint in n lists by building with ivf.build(n_probes=n).
This increases performance recall/qps quite a lot, but only when going from n=1 to n=2, as seen in the attached figure.
Figure_1

The problem is probably that duplicate matches aren't handled well.
When calling ctop from ivf, we should somehow tell it about the indices we have already collected so the distance table can focus on telling us about alternative interesting candidates.

One option is to even reuse the query_pq_sse(transformed_data, self.tables, indices, values, True) call by calling it on multiple (transformed_data, tables) pairs while keeping (indices, values) fixed. That way we also would only do rescoring/pass-2 a single time on all candidates retrieved from different lists.

The issue is that
(1) the binary heap data structure we use can't recognize duplicates, and
(2) the query_pq_sse function only knows the "local" id of a point in a list, not the global id.

To solve (2) we could pass a list with the global ids of all the points considered. This would be some extra overhead for query_pq_sse to pass around, but perhaps not that much. And we wouldn't have to "relabel" the returned ids afterwards.

For (1) we could switch back to using insertion sort, or just try heuristically to remove some of the duplicates the heap is able to find.

@thomasahle thomasahle added the enhancement New feature or request label Apr 11, 2023
@thomasahle
Copy link
Owner Author

thomasahle commented Apr 12, 2023

Improved a lot now after adding a new query function.
However, there's still an issue where later query_sse calls don't know which ids have already been found, and then as we do more and more build probes, the duplicate points start clouding out other useful points, actually reducing our recall.

plot

@thomasahle
Copy link
Owner Author

We now use the same priority queue for all probes and check for duplicate ids.
While this has made things faster, it unfortunately hasn't change the fast that we still don't get much of a benefit for storing points in multiple lists.
plot

@thomasahle
Copy link
Owner Author

thomasahle commented Apr 19, 2023

Maybe the issue is that while the recall goes up with more build_probes, the size of the clusters also increase, which makes the query/cluster slower.
Currently we don't change the number of clusters as we increase build_probes, but we probably should make it something like sqrt(n * build_probes).

@calvinmccarter
Copy link

There are advantages to query_probes that do not exist for build_probes.

As one increases n_clusters (ie reduces the size of the clusters) and increases query_probes, one converges towards searching through a ball centered at the query. Thus, query_probes improves the query-adaptivity of the search process. Searching many small clusters gives better recall than searching a few large clusters: in the latter case, retrieval fails whenever the query and the true nearest neighbor are close yet on opposite sides of a boundary between clusters.

Increasing build_probes does not yield the same query-adaptivity. Instead, the IVF clusters are essentially covering overlapping regions. But there's no way to make a cluster region expand in the direction of a query-point, without also expanding in the opposite direction, because cluster regions are expanded at build-time, not query-time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants