-
Notifications
You must be signed in to change notification settings - Fork 689
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MB-58901: Introduce support for BM25 scoring (#2113)
Introducing support for BM25 scoring Key stats necessary for the scoring - fieldLength - the number of terms in a field within a doc. - avgDocLength - the average of terms in a field across all the docs in the index. - totalDocs - total number of docs in an index. Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a `bleve.IndexAlias`. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit. In order to enable this global scoring mechanism, the user needs to set the `context` argument of the SearchInContext with: `ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring)` Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism at `indexMapping.ScoringModel` level to actually use this scoring mechanism. This parameter is a global setting, i.e. when performing a search on multiple fields, all the fields are scored with the same scoring model. The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the `avgDocLength`. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score. Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether. --------- Co-authored-by: Aditi Ahuja <[email protected]> Co-authored-by: Abhinav Dangeti <[email protected]>
- Loading branch information
1 parent
bd57cb6
commit cbafdca
Showing
20 changed files
with
668 additions
and
83 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.