Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix query timeout handling by checking searchContext.isSearchTimedOut() #16882

Closed
wants to merge 1 commit into from

Conversation

reta
Copy link
Collaborator

@reta reta commented Dec 18, 2024

Description

I only briefly looked at overall implementation and it looks like we may lost some timeout handling guarantees, likely while implementing the concurrent search. Fix query timeout handling by checking searchContext.isSearchTimedOut().

Related Issues

Resolves #16056 #9401

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

✅ Gradle check result for 43e54c0: SUCCESS

Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.13%. Comparing base (b5f651f) to head (751ad99).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ensearch/search/internal/ContextIndexSearcher.java 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16882      +/-   ##
============================================
- Coverage     72.21%   72.13%   -0.08%     
+ Complexity    65335    65325      -10     
============================================
  Files          5318     5318              
  Lines        304081   304083       +2     
  Branches      43995    43996       +1     
============================================
- Hits         219578   219365     -213     
- Misses        66541    66740     +199     
- Partials      17962    17978      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

✅ Gradle check result for e47aab2: SUCCESS

@reta reta force-pushed the fix.query.timeout branch from e47aab2 to da3160c Compare December 18, 2024 21:05
@@ -305,6 +305,10 @@ protected void search(List<LeafReaderContext> leaves, Weight weight, Collector c
*/
@Override
protected void searchLeaf(LeafReaderContext ctx, Weight weight, Collector collector) throws IOException {
// If search is already flagged as timed out, do not continue
if (searchContext.isSearchTimedOut()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 @sohami @kkewwei it seems like when we were adapting to concurrent search, the exception driven flow was changed to state driven one, and the leaves are being processed even when the search request has exceeded the time budget (timed out).

Copy link
Collaborator

@jed326 jed326 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reta, I will have more time to take a closer look at this later this week.

From what I recall off the top of my head the cancellable.checkCancelled(); right below this should already be doing the timeout checking and then throwing the QueryPhase.TimeExceededException. The searchContext.setSearchTimedOut(true); logic was needed because with concurrent search any exceptions thrown in searchLeaf would be handled by the Lucene TaskExecutor and it was difficult to figure out from the query phase thread if the search had actually timed out or not.

Copy link
Collaborator Author

@reta reta Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reta, I will have more time to take a closer look at this later this week.

Thanks @jed326 , I will look closely as well, but cancellable.checkCancelled() seems to be somehow not short-circuiting the search, I could clearly see that by checking how long tests are taking w/o this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancellable.checkCancelled() seems to be somehow not short-circuiting the search

That's definitely strange. I do see that searchLeaf gets called for each leaf even post the timeout and we leave it to cancellable.checkCancelled() to throw an exception which we than catch for each leaf. I wonder if this repeated exception handling is what is causing it to take longer?

Maybe we should even have the searchContext.isSearchTimedOut() check in the search method above so we can short circuit those for loops when the timeout is hit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should even have the searchContext.isSearchTimedOut() check in the search method above so we can short circuit those for loops when the timeout is hit?

What I've seen, the timeout is hit by first searchLeaf call (those are called in loop), so the check in search method may not need this check (I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reta, In the next code: cancellable.checkCancelled()(line 316) will do the timeout check, it looks like duplicate work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could clearly see that by checking how long tests are taking w/o this change.

Currently the only reason I can think why this would be the case is because cancellable.checkCancelled() involves throwing and catching an exception which is slowing things down. I think we may need to run this piece of code through a profiler to understand why.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 @kkewwei thanks for the feedback folks, will spend more time looking into the cause

@reta reta force-pushed the fix.query.timeout branch from da3160c to 751ad99 Compare December 18, 2024 21:16
Copy link
Contributor

✅ Gradle check result for 751ad99: SUCCESS

@reta
Copy link
Collaborator Author

reta commented Dec 19, 2024

Closing, not solving the issue

@reta reta closed this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SearchTimeoutIT
3 participants