-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove special merging behavior for line matches #888
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. Dredging back, I believe I implemented it this way to fully preserve the pre-existing behavior of while chunk matches were still unstable.
I believe we still have some non-Sourcegraph users of Zoekt (other platform folks can speak more closely to that), and this would be an observable behavior change. I'm not really sure what our maintenance policy is here.
One request: I'm surprised this didn't require any test changes. Could you add something that exercises overlapping matches?
Will do! I'd still love @sourcegraph/search-platform feedback, in case there are concerns around changing this behavior for OSS users, or any more historical context I'm missing. |
@keegancsmith I agree! I will ponder this once more and see if there's a way to preserve the scoring behavior I want, while maintaining the nice behavior where we don't drop matched ranges... |
Interestingly, I think I have the opposite preference becuase it's more consistent. Regex matching already doesn't return overlapping matches (example), and merging or truncating matches means that some of the "matched ranges" may not actually match any of the individual terms. This is probably mostly okay for human consumption (though match counting gets complex), but the non-guarantee of "each range actually represents a full match" is pretty problematic for computer consumption, which is a significant reason why we implemented chunk matches in the first place. If we do pursue this, I'd prefer not to take the merging approach that line matches currently takes, and instead just support overlapping ranges. |
Okay, this still feels worth it to eliminate the difference in behavior between |
Follow up to #888, where I forgot to improve the test coverage.
Usually, if there are candidate matches with overlapping ranges, then we just remove matches that overlap. However, when
opts.ChunkMatches = false
, we had special logic to merge overlapping matches.This PR removes the overlapping logic to simplify the behavior. I couldn't see a good reason to keep this special handling. Plus, we are moving towards making
ChunkMatches
the default.Another benefit of this change is that it makes the BM25 behavior easier to understand. If we merged together ranges, then we would be calculating term frequencies for spurious terms (like
new
,queue
,newqueue
,queuenew
, etc.) Note: we currently only use BM25 withChunkMatches = true
, so there's not an active bug here.Relates to SPLF-40