Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix racing condition in BaseQueryRewriteContext #17124

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mingshl
Copy link
Contributor

@mingshl mingshl commented Jan 25, 2025

Description

Fixing #17103

to avoid racing in registering and execution:
use AtomicReference<List<BiConsumer<Client, ActionListener<?>>>> asyncActionsRef,

while registerAsyncAction, use updateAndGet and return newList,
while executeAsyncActions, use getAndSet and return new List before execution.

added unit tests to help verify the fix.

Related Issues

[[Resolves #[Issue number to be closed when this PR is merged]

](https://github.com/opensearch-project/OpenSearch/issues/17103)](https://github.com/opensearch-project/OpenSearch/issues/17103)

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@mingshl
Copy link
Contributor Author

mingshl commented Jan 25, 2025

@dbwiddis @msfroh @bugmakerrrrrr ready for review, thank you!!!

This change should go to 3.0 version, so I added to changelog in 3.0

Copy link
Contributor

❌ Gradle check result for 30333a7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines +94 to +98
asyncActionsRef.updateAndGet(list -> {
List<BiConsumer<Client, ActionListener<?>>> newList = new ArrayList<>(list);
newList.add(asyncAction);
return newList;
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to copy the list here; this gives us no value over a CopyOnWriteArrayList which we wouldn't need the atomic reference for. Can't we simply just update the list in-place but wrapped in the atomic reference for thread safety? (CC @msfroh to confirm this suggestion):

Suggested change
asyncActionsRef.updateAndGet(list -> {
List<BiConsumer<Client, ActionListener<?>>> newList = new ArrayList<>(list);
newList.add(asyncAction);
return newList;
});
asyncActionsRef.updateAndGet(list -> {
list.add(asyncAction);
return list;
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, if not returning new list, I am having test failure, so it still cause racing condition during registration.

BaseQueryRewriteContextTests > testRacingConditionFixed FAILED
    java.lang.AssertionError: expected:<5000> but was:<4874>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.index.query.BaseQueryRewriteContextTests.testRacingConditionFixed(BaseQueryRewriteContextTests.java:167)

BaseQueryRewriteContextTests > testConcurrentRegistrationAndExecution FAILED
    java.lang.AssertionError: expected:<10000> but was:<7802>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.index.query.BaseQueryRewriteContextTests.testConcurrentRegistrationAndExecution(BaseQueryRewriteContextTests.java:108)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... if we need to copy the whole list anyway, I think I like @dbwiddis's suggestion of using CopyOnWriteArrayList over my AtomicReference suggestion.

@dbwiddis dbwiddis added the v2.19.0 Issues and PRs related to version 2.19.0 label Jan 25, 2025
@bugmakerrrrrr
Copy link
Contributor

IMO, I think there is no racing condition. Before rewriting again, we should have executed all registered async actions and notifies the listener, so no new actions will be registered during rewriteAndFetch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2.19.0 Issues and PRs related to version 2.19.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants