Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove special merging behavior for line matches #888

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Jan 13, 2025

Usually, if there are candidate matches with overlapping ranges, then we just remove matches that overlap. However, when opts.ChunkMatches = false, we had special logic to merge overlapping matches.

This PR removes the overlapping logic to simplify the behavior. I couldn't see a good reason to keep this special handling. Plus, we are moving towards making ChunkMatches the default.

Another benefit of this change is that it makes the BM25 behavior easier to understand. If we merged together ranges, then we would be calculating term frequencies for spurious terms (like new, queue, newqueue, queuenew, etc.) Note: we currently only use BM25 with ChunkMatches = true, so there's not an active bug here.

Relates to SPLF-40

@cla-bot cla-bot bot added the cla-signed label Jan 13, 2025
@jtibshirani jtibshirani requested review from camdencheek and a team January 13, 2025 22:35
Copy link
Member

@camdencheek camdencheek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Dredging back, I believe I implemented it this way to fully preserve the pre-existing behavior of while chunk matches were still unstable.

I believe we still have some non-Sourcegraph users of Zoekt (other platform folks can speak more closely to that), and this would be an observable behavior change. I'm not really sure what our maintenance policy is here.

One request: I'm surprised this didn't require any test changes. Could you add something that exercises overlapping matches?

@jtibshirani
Copy link
Member Author

One request: I'm surprised this didn't require any test changes. Could you add something that exercises overlapping matches?

Will do! I'd still love @sourcegraph/search-platform feedback, in case there are concerns around changing this behavior for OSS users, or any more historical context I'm missing.

@keegancsmith
Copy link
Member

I think this is a regression in reasonable behaviour in zoekt, which we special cased away just for chunkmatches. To be honest I like it and we should probably think about how to support it in sourcegraph's keyword search? See this screenshot. On the left is zoekt before this change, right is zoekt after this change.

Screenshot 2025-01-15 at 18 14 13

Even though it is reasonable behaviour, I can get behind this PR simplifying code paths and as such is a worthy change to make. It certainly simplifies expectations on the API, but maybe not so much from how zoekt presents results.

@jtibshirani
Copy link
Member Author

@keegancsmith I agree! I will ponder this once more and see if there's a way to preserve the scoring behavior I want, while maintaining the nice behavior where we don't drop matched ranges...

@camdencheek
Copy link
Member

camdencheek commented Jan 15, 2025

Interestingly, I think I have the opposite preference becuase it's more consistent. Regex matching already doesn't return overlapping matches (example), and merging or truncating matches means that some of the "matched ranges" may not actually match any of the individual terms.

This is probably mostly okay for human consumption (though match counting gets complex), but the non-guarantee of "each range actually represents a full match" is pretty problematic for computer consumption, which is a significant reason why we implemented chunk matches in the first place.

If we do pursue this, I'd prefer not to take the merging approach that line matches currently takes, and instead just support overlapping ranges.

@jtibshirani
Copy link
Member Author

jtibshirani commented Jan 15, 2025

Okay, this still feels worth it to eliminate the difference in behavior between LineMatch and ChunkMatch. If anyone yells at me about it (and OSS users feel free to do that!), I will figure out a plan to address it. Maybe we need an option to return overlapping ranges in the API, as @camdencheek suggests.

@jtibshirani jtibshirani merged commit d301e83 into main Jan 15, 2025
12 checks passed
@jtibshirani jtibshirani deleted the jtibs/merge branch January 15, 2025 21:23
jtibshirani added a commit that referenced this pull request Jan 15, 2025
Follow up to #888, where I forgot to improve the test coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants