Don't detect PlainActionFuture deadlock on concurrent complete #110361

nicktindall · 2024-07-02T05:21:30Z

I don't love the test for this, but it does reproduce the issue reliably. Any suggestions for making it less verbose are appreciated.

Closes elastic#110360

nicktindall · 2024-07-02T05:26:34Z

server/src/main/java/org/elasticsearch/action/support/PlainActionFuture.java

+                // when multiple threads from the same pool are completing the future
+                while (isDone() == false) {
+                    Thread.onSpinWait();
+                }


I imagine we stay in this state extremely briefly (and rarely) as we're just waiting on two assignments and the release, perhaps busy wait is OK here?

We could also use some other synchronisation mechanism here to wait/notify (that didn't add the waiter to the waiters for the Sync), but that might be overkill?

Seems ok to me; possibly this is even more efficient, not that efficiency particularly matters in these cases.

Seems fine to me as well. I think acquire internally also does busy waiting.

nicktindall · 2024-07-02T05:29:35Z

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

+                        future.set(new PlainActionFuture<>());
+                    }
+                    future.get().onResponse(null);
+                }


The issue only manifests in the moment the PlainActionFuture is first resolved, this test reproduces fairly quickly when we do the above. I wish we didn't have to create objects in a tight loop.

nicktindall · 2024-07-02T05:38:14Z

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

+            }
+            startBarrier.countDown();
+            safeAwait(startBarrier);
+            cs.poll(250, TimeUnit.MILLISECONDS);


This test will unfortunately run the full 250ms in the happy case, on my desktop it seemed to fail around the 80ms mark fairly consistently before the fix.

I'd rather a test which fails sometimes but always runs quickly, because we run this so often in CI that it'll catch the bug pretty soon anyway.

Cool, I reduced to 20ms which is still surprisingly reliable.

I'm wondering why we need such a complicated test to reproduce this. Something like this seems to work too:

diff --git a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java index 2ca914eb23c..68c706c7f3c 100644 --- a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java +++ b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java @@ -14,6 +14,8 @@ import org.elasticsearch.common.util.concurrent.UncategorizedExecutionException; import org.elasticsearch.core.Assertions; import org.elasticsearch.core.TimeValue; import org.elasticsearch.test.ESTestCase; +import org.elasticsearch.threadpool.TestThreadPool; +import org.elasticsearch.threadpool.ThreadPool; import org.elasticsearch.transport.RemoteTransportException; import java.util.concurrent.CancellationException; @@ -180,4 +182,19 @@ public class PlainActionFutureTests extends ESTestCase { } }; } + + public void testConcurrentCompletion() { + try (var threadPool = new TestThreadPool(getTestName())) { + final var future = new PlainActionFuture<>(); + final var threadCount = threadPool.info(ThreadPool.Names.GENERIC).getMax(); + final var barrier = new CyclicBarrier(threadCount); + for (int taskIndex = 0; taskIndex < threadCount; taskIndex++) { + threadPool.generic().execute(() -> { + safeAwait(barrier); + future.onResponse(null); + }); + } + assertNull(safeGet(future)); + } + } }

slightly nicer:

diff --git a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java index 2ca914eb23c..4893dfc91e8 100644 --- a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java +++ b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java @@ -9,11 +9,14 @@ package org.elasticsearch.action.support; import org.elasticsearch.ElasticsearchException; +import org.elasticsearch.action.ActionRunnable; import org.elasticsearch.common.util.concurrent.FutureUtils; import org.elasticsearch.common.util.concurrent.UncategorizedExecutionException; import org.elasticsearch.core.Assertions; import org.elasticsearch.core.TimeValue; import org.elasticsearch.test.ESTestCase; +import org.elasticsearch.threadpool.TestThreadPool; +import org.elasticsearch.threadpool.ThreadPool; import org.elasticsearch.transport.RemoteTransportException; import java.util.concurrent.CancellationException; @@ -180,4 +183,18 @@ public class PlainActionFutureTests extends ESTestCase { } }; } + + public void testConcurrentCompletion() { + try (var threadPool = new TestThreadPool(getTestName())) { + final var future = new PlainActionFuture<>(); + final var executorName = randomFrom(ThreadPool.Names.GENERIC, ThreadPool.Names.MANAGEMENT); + final var threadCount = threadPool.info(executorName).getMax(); + final var executor = threadPool.executor(executorName); + final var barrier = new CyclicBarrier(threadCount); + for (int i = 0; i < threadCount; i++) { + executor.execute(ActionRunnable.run(future, () -> safeAwait(barrier))); + } + assertNull(safeGet(future)); + } + } }

Weird, it fails pretty reliably for me. Although 5k iterations isn't very many.

Try running with -Dtests.iters=1000 -Dtests.failfast=true, no need to wait for seconds+ for Gradle to start up on each iteration.

I'm not sure what the difference is, but I can't get it to fail. This M3 MacBook seems to eat race conditions for breakfast (I also had trouble reproducing this one).

I ran for 800,000 iterations which took ~48 minutes and produced zero failures, where the latest iteration of my longer test fails almost immediately. While I accept that CI Is probably going to yield these failures more reliably, I think it's probably best if we have a test that reproduces on typical dev hardware.

Some theories on why the longer test is more reliable

it produces ~20 PlainActionFutures, so there's 20x the opportunity for it to occur (it can only occur once per future)

the threads hitting onResponse are hot, possibly running JIT-ed code, as opposed to just woken up from a wait. Perhaps this increases the likelihood of the critical piece of concurrency.

I think it's probably best if we have a test that reproduces on typical dev hardware.

Hmm I think this is the wrong way round - if "typical dev hardware" isn't able to reproduce genuine race conditions that happen in CI/production then I don't think we should work around that by making one specific test more complex. I'd recommend using a GCP machine for long-running repro attempts rather than your laptop since it'll be closer to real production infra.

I think my version of the test is about as slender as it can be now (thanks for the feedback), but I'm OK with replacing it with the much shorter test if you'd prefer.
I'm satisfied that the issue was there and that the change fixes it, and it sounds like that test will fail in CI if we regress, so I'm happy to go with consensus on the cost/benefit of the additional complexity.

elasticsearchmachine · 2024-07-02T05:56:15Z

Pinging @elastic/es-delivery (Team:Delivery)

elasticsearchmachine · 2024-07-02T05:59:09Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2024-07-02T05:59:11Z

Hi @nicktindall, I've created a changelog YAML for you.

ywangd

LGTM

Nice catch! TIL!

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

ywangd · 2024-07-02T06:51:11Z

server/src/main/java/org/elasticsearch/action/support/PlainActionFuture.java

+                // when multiple threads from the same pool are completing the future
+                while (isDone() == false) {
+                    Thread.onSpinWait();
+                }


Seems fine to me as well. I think acquire internally also does busy waiting.

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

DaveCTurner

I think we should try and simplify the test

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

DaveCTurner · 2024-07-03T04:57:13Z

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

+            }
+            startBarrier.countDown();
+            safeAwait(startBarrier);
+            cs.poll(250, TimeUnit.MILLISECONDS);


I think it's probably best if we have a test that reproduces on typical dev hardware.

Hmm I think this is the wrong way round - if "typical dev hardware" isn't able to reproduce genuine race conditions that happen in CI/production then I don't think we should work around that by making one specific test more complex. I'd recommend using a GCP machine for long-running repro attempts rather than your laptop since it'll be closer to real production infra.

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

DaveCTurner

Ok I think the test is simple enough now for my tastes, although IMO it deserves a few more comments explaining why it's not any simpler.

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java

DaveCTurner

LGTM thanks for the extra iterations Nick

Don't detect PlainActionFuture deadlock on concurrent complete

fe192b8

Closes elastic#110360

nicktindall requested a review from ywangd July 2, 2024 05:21

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 labels Jul 2, 2024

nicktindall commented Jul 2, 2024

View reviewed changes

Tidy up

d4e6244

nicktindall commented Jul 2, 2024

View reviewed changes

nicktindall requested a review from henningandersen July 2, 2024 05:43

nicktindall added :Delivery/Build Build or test infrastructure >bug labels Jul 2, 2024

elasticsearchmachine added the Team:Delivery Meta label for Delivery team label Jul 2, 2024

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jul 2, 2024

DaveCTurner added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team labels Jul 2, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jul 2, 2024

Update docs/changelog/110361.yaml

4b51721

Reduce timeout

0f0d57e

ywangd approved these changes Jul 2, 2024

View reviewed changes

nicktindall added 2 commits July 2, 2024 17:14

Improve variable names

f6ef0b5

More naming

6f82c8a

DaveCTurner requested changes Jul 2, 2024

View reviewed changes

nicktindall added 2 commits July 2, 2024 18:27

Attempt at simplifying test

9ac71cd

Further attempt at simplifying test

2138cee

DaveCTurner reviewed Jul 3, 2024

View reviewed changes

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java Outdated Show resolved Hide resolved

server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java Outdated Show resolved Hide resolved

Further simplification

88fd0bc

DaveCTurner reviewed Jul 3, 2024

View reviewed changes

nicktindall added 2 commits July 3, 2024 18:01

Add explanation and tidy

39d7b80

Remove assumption

ef706da

DaveCTurner approved these changes Jul 3, 2024

View reviewed changes

nicktindall merged commit fcaef59 into elastic:main Jul 3, 2024
15 checks passed

nicktindall deleted the fix/110360_assertCompleteAllowed_false_positive branch July 3, 2024 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't detect PlainActionFuture deadlock on concurrent complete #110361

Don't detect PlainActionFuture deadlock on concurrent complete #110361

nicktindall commented Jul 2, 2024 •

edited

Loading

nicktindall Jul 2, 2024 •

edited

Loading

nicktindall Jul 2, 2024

DaveCTurner Jul 2, 2024

ywangd Jul 2, 2024

nicktindall Jul 2, 2024 •

edited

Loading

nicktindall Jul 2, 2024 •

edited

Loading

DaveCTurner Jul 2, 2024

nicktindall Jul 2, 2024

DaveCTurner Jul 2, 2024

DaveCTurner Jul 2, 2024

DaveCTurner Jul 2, 2024

DaveCTurner Jul 2, 2024

nicktindall Jul 3, 2024 •

edited

Loading

DaveCTurner Jul 3, 2024

nicktindall Jul 3, 2024 •

edited

Loading

elasticsearchmachine commented Jul 2, 2024

elasticsearchmachine commented Jul 2, 2024

elasticsearchmachine commented Jul 2, 2024

ywangd left a comment

ywangd Jul 2, 2024

DaveCTurner left a comment

DaveCTurner Jul 3, 2024

DaveCTurner left a comment

DaveCTurner left a comment

Don't detect PlainActionFuture deadlock on concurrent complete #110361

Don't detect PlainActionFuture deadlock on concurrent complete #110361

Conversation

nicktindall commented Jul 2, 2024 • edited Loading

nicktindall Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

nicktindall Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

elasticsearchmachine commented Jul 2, 2024

elasticsearchmachine commented Jul 2, 2024

elasticsearchmachine commented Jul 2, 2024

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

nicktindall commented Jul 2, 2024 •

edited

Loading

nicktindall Jul 2, 2024 •

edited

Loading

nicktindall Jul 2, 2024 •

edited

Loading

nicktindall Jul 2, 2024 •

edited

Loading

nicktindall Jul 3, 2024 •

edited

Loading

nicktindall Jul 3, 2024 •

edited

Loading