Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't detect PlainActionFuture deadlock on concurrent complete #110361

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Jul 2, 2024

Closes #110360
Closes #110181

I don't love the test for this, but it does reproduce the issue reliably. Any suggestions for making it less verbose are appreciated.

@nicktindall nicktindall requested a review from ywangd July 2, 2024 05:21
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 labels Jul 2, 2024
// when multiple threads from the same pool are completing the future
while (isDone() == false) {
Thread.onSpinWait();
}
Copy link
Contributor Author

@nicktindall nicktindall Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine we stay in this state extremely briefly (and rarely) as we're just waiting on two assignments and the release, perhaps busy wait is OK here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use some other synchronisation mechanism here to wait/notify (that didn't add the waiter to the waiters for the Sync), but that might be overkill?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me; possibly this is even more efficient, not that efficiency particularly matters in these cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me as well. I think acquire internally also does busy waiting.

future.set(new PlainActionFuture<>());
}
future.get().onResponse(null);
}
Copy link
Contributor Author

@nicktindall nicktindall Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue only manifests in the moment the PlainActionFuture is first resolved, this test reproduces fairly quickly when we do the above. I wish we didn't have to create objects in a tight loop.

}
startBarrier.countDown();
safeAwait(startBarrier);
cs.poll(250, TimeUnit.MILLISECONDS);
Copy link
Contributor Author

@nicktindall nicktindall Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will unfortunately run the full 250ms in the happy case, on my desktop it seemed to fail around the 80ms mark fairly consistently before the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather a test which fails sometimes but always runs quickly, because we run this so often in CI that it'll catch the bug pretty soon anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I reduced to 20ms which is still surprisingly reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we need such a complicated test to reproduce this. Something like this seems to work too:

diff --git a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
index 2ca914eb23c..68c706c7f3c 100644
--- a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
+++ b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
@@ -14,6 +14,8 @@ import org.elasticsearch.common.util.concurrent.UncategorizedExecutionException;
 import org.elasticsearch.core.Assertions;
 import org.elasticsearch.core.TimeValue;
 import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.threadpool.TestThreadPool;
+import org.elasticsearch.threadpool.ThreadPool;
 import org.elasticsearch.transport.RemoteTransportException;

 import java.util.concurrent.CancellationException;
@@ -180,4 +182,19 @@ public class PlainActionFutureTests extends ESTestCase {
             }
         };
     }
+
+    public void testConcurrentCompletion() {
+        try (var threadPool = new TestThreadPool(getTestName())) {
+            final var future = new PlainActionFuture<>();
+            final var threadCount = threadPool.info(ThreadPool.Names.GENERIC).getMax();
+            final var barrier = new CyclicBarrier(threadCount);
+            for (int taskIndex = 0; taskIndex < threadCount; taskIndex++) {
+                threadPool.generic().execute(() -> {
+                    safeAwait(barrier);
+                    future.onResponse(null);
+                });
+            }
+            assertNull(safeGet(future));
+        }
+    }
 }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly nicer:

diff --git a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
index 2ca914eb23c..4893dfc91e8 100644
--- a/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
+++ b/server/src/test/java/org/elasticsearch/action/support/PlainActionFutureTests.java
@@ -9,11 +9,14 @@
 package org.elasticsearch.action.support;

 import org.elasticsearch.ElasticsearchException;
+import org.elasticsearch.action.ActionRunnable;
 import org.elasticsearch.common.util.concurrent.FutureUtils;
 import org.elasticsearch.common.util.concurrent.UncategorizedExecutionException;
 import org.elasticsearch.core.Assertions;
 import org.elasticsearch.core.TimeValue;
 import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.threadpool.TestThreadPool;
+import org.elasticsearch.threadpool.ThreadPool;
 import org.elasticsearch.transport.RemoteTransportException;

 import java.util.concurrent.CancellationException;
@@ -180,4 +183,18 @@ public class PlainActionFutureTests extends ESTestCase {
             }
         };
     }
+
+    public void testConcurrentCompletion() {
+        try (var threadPool = new TestThreadPool(getTestName())) {
+            final var future = new PlainActionFuture<>();
+            final var executorName = randomFrom(ThreadPool.Names.GENERIC, ThreadPool.Names.MANAGEMENT);
+            final var threadCount = threadPool.info(executorName).getMax();
+            final var executor = threadPool.executor(executorName);
+            final var barrier = new CyclicBarrier(threadCount);
+            for (int i = 0; i < threadCount; i++) {
+                executor.execute(ActionRunnable.run(future, () -> safeAwait(barrier)));
+            }
+            assertNull(safeGet(future));
+        }
+    }
 }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, it fails pretty reliably for me. Although 5k iterations isn't very many.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try running with -Dtests.iters=1000 -Dtests.failfast=true, no need to wait for seconds+ for Gradle to start up on each iteration.

Copy link
Contributor Author

@nicktindall nicktindall Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the difference is, but I can't get it to fail. This M3 MacBook seems to eat race conditions for breakfast (I also had trouble reproducing this one).

I ran for 800,000 iterations which took ~48 minutes and produced zero failures, where the latest iteration of my longer test fails almost immediately. While I accept that CI Is probably going to yield these failures more reliably, I think it's probably best if we have a test that reproduces on typical dev hardware.

Some theories on why the longer test is more reliable

  • it produces ~20 PlainActionFutures, so there's 20x the opportunity for it to occur (it can only occur once per future)
  • the threads hitting onResponse are hot, possibly running JIT-ed code, as opposed to just woken up from a wait. Perhaps this increases the likelihood of the critical piece of concurrency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably best if we have a test that reproduces on typical dev hardware.

Hmm I think this is the wrong way round - if "typical dev hardware" isn't able to reproduce genuine race conditions that happen in CI/production then I don't think we should work around that by making one specific test more complex. I'd recommend using a GCP machine for long-running repro attempts rather than your laptop since it'll be closer to real production infra.

Copy link
Contributor Author

@nicktindall nicktindall Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my version of the test is about as slender as it can be now (thanks for the feedback), but I'm OK with replacing it with the much shorter test if you'd prefer.
I'm satisfied that the issue was there and that the change fixes it, and it sounds like that test will fail in CI if we regress, so I'm happy to go with consensus on the cost/benefit of the additional complexity.

@nicktindall nicktindall added :Delivery/Build Build or test infrastructure >bug labels Jul 2, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Delivery Meta label for Delivery team label Jul 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jul 2, 2024
@DaveCTurner DaveCTurner added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team labels Jul 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jul 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nice catch! TIL!

// when multiple threads from the same pool are completing the future
while (isDone() == false) {
Thread.onSpinWait();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me as well. I think acquire internally also does busy waiting.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try and simplify the test

}
startBarrier.countDown();
safeAwait(startBarrier);
cs.poll(250, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably best if we have a test that reproduces on typical dev hardware.

Hmm I think this is the wrong way round - if "typical dev hardware" isn't able to reproduce genuine race conditions that happen in CI/production then I don't think we should work around that by making one specific test more complex. I'd recommend using a GCP machine for long-running repro attempts rather than your laptop since it'll be closer to real production infra.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think the test is simple enough now for my tastes, although IMO it deserves a few more comments explaining why it's not any simpler.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the extra iterations Nick

@nicktindall nicktindall merged commit fcaef59 into elastic:main Jul 3, 2024
15 checks passed
@nicktindall nicktindall deleted the fix/110360_assertCompleteAllowed_false_positive branch July 3, 2024 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team v8.15.0
Projects
None yet
4 participants