Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merging: In compound shards, sort repos by id #887

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

stefanhengl
Copy link
Member

We sort shards in compound shards by repository ID so that we can use binary search instead of looping through the entire list of repos when when calling build.findShard.

We have seen that for very large compound shards (5k+ repos) it can take a long time to find a specific repo which can slow down initial indexing even for no-op indexes after a restart of indexserver.

The sort order will impact ranking of non-bm25 search, because doc-order acts as a tie-breaker. However, the actual impact is very small, because "LastCommitDate" is a stronger signal.

Note:
We moved from priority to LatestCommitDate as tie breaker in #832. Hence, sorting repos in compound shards by priority doesn't make much sense anymore.

Test plan:
New unit test

We sort shards in compound shards by repository ID so that we can use
binary search to quickly find a specific repo when calling
`build.findShard`.

We have seen that for very large compound shards it can take a long time
to find a specific repo which can slow down initial indexing even for
no-op indexes after a restart of indexserver.

The sort order will impact ranking of non-bm25 search, because doc-order acts
as a tie-breaker. However, the actual impact is likely very small,
because "LastCommitDate" is a stronger signal.

Test plan:
New unit test
@stefanhengl stefanhengl requested a review from a team January 13, 2025 15:10
@cla-bot cla-bot bot added the cla-signed label Jan 13, 2025
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! No longer sorting on priority makes sense to me. I think this will have a nice performance impact.

I didn't see an update to buildShard... did you mean to include that in this PR?

sort.Slice(ds, func(i, j int) bool {
return ds[i].repoMetaData[0].priority > ds[j].repoMetaData[0].priority
return ds[i].repoMetaData[0].ID < ds[j].repoMetaData[0].ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to document this invariant on indexData.repoMetaData (that this list must always be sorted by repo ID).

@stefanhengl
Copy link
Member Author

I didn't see an update to buildShard... did you mean to include that in this PR?

I first wanted to find out if there are any self-hosted customers who have enabled shard merging. If not, we don't have to worry about falling back to brute force search if binary search doesn't return a result.

@stefanhengl
Copy link
Member Author

I didn't see an update to buildShard... did you mean to include that in this PR?

I first wanted to find out if there are any self-hosted customers who have enabled shard merging. If not, we don't have to worry about falling back to brute force search if binary search doesn't return a result.

I thought a bit more about this. We have to assume that all self-hosted customers have shard merging enabled. I think the best course of action is to increment the indexFormatVersion and use that to signal when we can use binary search. I will update this PR once I have something ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants