-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't detect PlainActionFuture deadlock on concurrent complete #110361
Don't detect PlainActionFuture deadlock on concurrent complete #110361
Conversation
// when multiple threads from the same pool are completing the future | ||
while (isDone() == false) { | ||
Thread.onSpinWait(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine we stay in this state extremely briefly (and rarely) as we're just waiting on two assignments and the release, perhaps busy wait is OK here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use some other synchronisation mechanism here to wait/notify (that didn't add the waiter to the waiters for the Sync
), but that might be overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me; possibly this is even more efficient, not that efficiency particularly matters in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me as well. I think acquire
internally also does busy waiting.
future.set(new PlainActionFuture<>()); | ||
} | ||
future.get().onResponse(null); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue only manifests in the moment the PlainActionFuture
is first resolved, this test reproduces fairly quickly when we do the above. I wish we didn't have to create objects in a tight loop.
} | ||
startBarrier.countDown(); | ||
safeAwait(startBarrier); | ||
cs.poll(250, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will unfortunately run the full 250ms in the happy case, on my desktop it seemed to fail around the 80ms mark fairly consistently before the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather a test which fails sometimes but always runs quickly, because we run this so often in CI that it'll catch the bug pretty soon anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I reduced to 20ms which is still surprisingly reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why we need such a complicated test to reproduce this. Something like this seems to work too:
diff --git a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
index 2ca914eb23c..68c706c7f3c 100644
--- a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
+++ b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
@@ -14,6 +14,8 @@ import org.elasticsearch.common.util.concurrent.UncategorizedExecutionException;
import org.elasticsearch.core.Assertions;
import org.elasticsearch.core.TimeValue;
import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.threadpool.TestThreadPool;
+import org.elasticsearch.threadpool.ThreadPool;
import org.elasticsearch.transport.RemoteTransportException;
import java.util.concurrent.CancellationException;
@@ -180,4 +182,19 @@ public class PlainActionFutureTests extends ESTestCase {
}
};
}
+
+ public void testConcurrentCompletion() {
+ try (var threadPool = new TestThreadPool(getTestName())) {
+ final var future = new PlainActionFuture<>();
+ final var threadCount = threadPool.info(ThreadPool.Names.GENERIC).getMax();
+ final var barrier = new CyclicBarrier(threadCount);
+ for (int taskIndex = 0; taskIndex < threadCount; taskIndex++) {
+ threadPool.generic().execute(() -> {
+ safeAwait(barrier);
+ future.onResponse(null);
+ });
+ }
+ assertNull(safeGet(future));
+ }
+ }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly nicer:
diff --git a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
index 2ca914eb23c..4893dfc91e8 100644
--- a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
+++ b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
@@ -9,11 +9,14 @@
package org.elasticsearch.action.support;
import org.elasticsearch.ElasticsearchException;
+import org.elasticsearch.action.ActionRunnable;
import org.elasticsearch.common.util.concurrent.FutureUtils;
import org.elasticsearch.common.util.concurrent.UncategorizedExecutionException;
import org.elasticsearch.core.Assertions;
import org.elasticsearch.core.TimeValue;
import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.threadpool.TestThreadPool;
+import org.elasticsearch.threadpool.ThreadPool;
import org.elasticsearch.transport.RemoteTransportException;
import java.util.concurrent.CancellationException;
@@ -180,4 +183,18 @@ public class PlainActionFutureTests extends ESTestCase {
}
};
}
+
+ public void testConcurrentCompletion() {
+ try (var threadPool = new TestThreadPool(getTestName())) {
+ final var future = new PlainActionFuture<>();
+ final var executorName = randomFrom(ThreadPool.Names.GENERIC, ThreadPool.Names.MANAGEMENT);
+ final var threadCount = threadPool.info(executorName).getMax();
+ final var executor = threadPool.executor(executorName);
+ final var barrier = new CyclicBarrier(threadCount);
+ for (int i = 0; i < threadCount; i++) {
+ executor.execute(ActionRunnable.run(future, () -> safeAwait(barrier)));
+ }
+ assertNull(safeGet(future));
+ }
+ }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, it fails pretty reliably for me. Although 5k iterations isn't very many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try running with -Dtests.iters=1000 -Dtests.failfast=true
, no need to wait for seconds+ for Gradle to start up on each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the difference is, but I can't get it to fail. This M3 MacBook seems to eat race conditions for breakfast (I also had trouble reproducing this one).
I ran for 800,000 iterations which took ~48 minutes and produced zero failures, where the latest iteration of my longer test fails almost immediately. While I accept that CI Is probably going to yield these failures more reliably, I think it's probably best if we have a test that reproduces on typical dev hardware.
Some theories on why the longer test is more reliable
- it produces ~20
PlainActionFuture
s, so there's 20x the opportunity for it to occur (it can only occur once per future) - the threads hitting
onResponse
are hot, possibly running JIT-ed code, as opposed to just woken up from a wait. Perhaps this increases the likelihood of the critical piece of concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably best if we have a test that reproduces on typical dev hardware.
Hmm I think this is the wrong way round - if "typical dev hardware" isn't able to reproduce genuine race conditions that happen in CI/production then I don't think we should work around that by making one specific test more complex. I'd recommend using a GCP machine for long-running repro attempts rather than your laptop since it'll be closer to real production infra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my version of the test is about as slender as it can be now (thanks for the feedback), but I'm OK with replacing it with the much shorter test if you'd prefer.
I'm satisfied that the issue was there and that the change fixes it, and it sounds like that test will fail in CI if we regress, so I'm happy to go with consensus on the cost/benefit of the additional complexity.
Pinging @elastic/es-delivery (Team:Delivery) |
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @nicktindall, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice catch! TIL!
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
// when multiple threads from the same pool are completing the future | ||
while (isDone() == false) { | ||
Thread.onSpinWait(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me as well. I think acquire
internally also does busy waiting.
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try and simplify the test
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
} | ||
startBarrier.countDown(); | ||
safeAwait(startBarrier); | ||
cs.poll(250, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably best if we have a test that reproduces on typical dev hardware.
Hmm I think this is the wrong way round - if "typical dev hardware" isn't able to reproduce genuine race conditions that happen in CI/production then I don't think we should work around that by making one specific test more complex. I'd recommend using a GCP machine for long-running repro attempts rather than your laptop since it'll be closer to real production infra.
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think the test is simple enough now for my tastes, although IMO it deserves a few more comments explaining why it's not any simpler.
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for the extra iterations Nick
Closes #110360
Closes #110181
I don't love the test for this, but it does reproduce the issue reliably. Any suggestions for making it less verbose are appreciated.