-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MB-58901: Introduce support for BM25 scoring #2113
Conversation
2b54a8d
to
738dfe1
Compare
8b10cdf
to
d58474f
Compare
4b626d0
to
45efde1
Compare
f385ba6
to
e83cca0
Compare
@Thejas-bhat it seems you'll need to push up the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Thejas-bhat Make sure to pull the latest from origin/bm25-refactor
before you push more commits here :)
@Thejas-bhat a thought regarding the "global scoring" code path for bm25 - what is the default behavior in elastic? Would you add a couple of GO benchmark test to differentiate between bm25 with and without global scoring and record these numbers within the commit message^ - trying to decide whether to enable "global scoring" by default. |
By default, elastic disables the feature. It'll be a bit difficult to benchmark this at golang unit level over here, because the latency is mainly visible when the index alias has multiple shards and each of which is spread across multiple nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor refactor suggestion, looks good to me otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a unit test comparing bm25 to tf-idf scoring over the same dataset?
I've added a unit test on a sample 10 doc dataset, however i don't think there something concrete we can compare over there apart from tfidf scores being higher than bm25 (due to algo difference and its more like a placeholder honestly). a much better test would be to have a large enough dataset, and two indexes - one with bm25 and the other with tfidf and then compare the relevancy of the hits from each of them (https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity) |
Introducing support for BM25 scoring
Key stats necessary for the scoring
Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a
bleve.IndexAlias
. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit. In order to enable this global scoring mechanism, the user needs to set thecontext
argument of the SearchInContext with:ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring)
Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism at
indexMapping.ScoringModel
level to actually use this scoring mechanism. This parameter is a global setting, i.e. when performing a search on multiple fields, all the fields are scored with the same scoring model.The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the
avgDocLength
. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score.Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether.