-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better support for storing points in multiple lists #5
Comments
Maybe the issue is that while the recall goes up with more build_probes, the size of the clusters also increase, which makes the query/cluster slower. |
There are advantages to query_probes that do not exist for build_probes. As one increases n_clusters (ie reduces the size of the clusters) and increases query_probes, one converges towards searching through a ball centered at the query. Thus, query_probes improves the query-adaptivity of the search process. Searching many small clusters gives better recall than searching a few large clusters: in the latter case, retrieval fails whenever the query and the true nearest neighbor are close yet on opposite sides of a boundary between clusters. Increasing build_probes does not yield the same query-adaptivity. Instead, the IVF clusters are essentially covering overlapping regions. But there's no way to make a cluster region expand in the direction of a query-point, without also expanding in the opposite direction, because cluster regions are expanded at build-time, not query-time. |
Since 2df6a42 it is possible to store every datapoint in
n
lists by building withivf.build(n_probes=n)
.This increases performance recall/qps quite a lot, but only when going from
n=1
ton=2
, as seen in the attached figure.The problem is probably that duplicate matches aren't handled well.
When calling
ctop
from ivf, we should somehow tell it about the indices we have already collected so the distance table can focus on telling us about alternative interesting candidates.One option is to even reuse the
query_pq_sse(transformed_data, self.tables, indices, values, True)
call by calling it on multiple(transformed_data, tables)
pairs while keeping(indices, values)
fixed. That way we also would only do rescoring/pass-2 a single time on all candidates retrieved from different lists.The issue is that
(1) the binary heap data structure we use can't recognize duplicates, and
(2) the
query_pq_sse
function only knows the "local" id of a point in a list, not the global id.To solve (2) we could pass a list with the global ids of all the points considered. This would be some extra overhead for
query_pq_sse
to pass around, but perhaps not that much. And we wouldn't have to "relabel" the returned ids afterwards.For (1) we could switch back to using insertion sort, or just try heuristically to remove some of the duplicates the heap is able to find.
The text was updated successfully, but these errors were encountered: