-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merging: In compound shards, sort repos by id #887
base: main
Are you sure you want to change the base?
Conversation
We sort shards in compound shards by repository ID so that we can use binary search to quickly find a specific repo when calling `build.findShard`. We have seen that for very large compound shards it can take a long time to find a specific repo which can slow down initial indexing even for no-op indexes after a restart of indexserver. The sort order will impact ranking of non-bm25 search, because doc-order acts as a tie-breaker. However, the actual impact is likely very small, because "LastCommitDate" is a stronger signal. Test plan: New unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! No longer sorting on priority
makes sense to me. I think this will have a nice performance impact.
I didn't see an update to buildShard
... did you mean to include that in this PR?
sort.Slice(ds, func(i, j int) bool { | ||
return ds[i].repoMetaData[0].priority > ds[j].repoMetaData[0].priority | ||
return ds[i].repoMetaData[0].ID < ds[j].repoMetaData[0].ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to document this invariant on indexData.repoMetaData
(that this list must always be sorted by repo ID).
I first wanted to find out if there are any self-hosted customers who have enabled shard merging. If not, we don't have to worry about falling back to brute force search if binary search doesn't return a result. |
I thought a bit more about this. We have to assume that all self-hosted customers have shard merging enabled. I think the best course of action is to increment the indexFormatVersion and use that to signal when we can use binary search. I will update this PR once I have something ready. |
We sort shards in compound shards by repository ID so that we can use binary search instead of looping through the entire list of repos when when calling
build.findShard
.We have seen that for very large compound shards (5k+ repos) it can take a long time to find a specific repo which can slow down initial indexing even for no-op indexes after a restart of indexserver.
The sort order will impact ranking of non-bm25 search, because doc-order acts as a tie-breaker. However, the actual impact is very small, because "LastCommitDate" is a stronger signal.
Note:
We moved from
priority
toLatestCommitDate
as tie breaker in #832. Hence, sorting repos in compound shards by priority doesn't make much sense anymore.Test plan:
New unit test