Skip to content

Commit

Permalink
[CELEBORN-1643] DataPusher handle InterruptedException
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

### Why are the changes needed?
The kill task will interrupt `pushThread`, `pushThread` may not call `reclaimTask`, and `idleQueue` is still free at this time, causing the Task to be in the `waitIdleQueueFullWithLock` state and not exit.

Problems caused by CELEBORN-1544.

```java
24/10/10 15:43:43,103 [Executor task launch worker for task 356.1 in stage 4447.0 (TID 1126065)] ERROR DataPusher: DataPusher thread interrupted while adding push task.
24/10/10 15:43:43,103 [DataPusher-1126065] INFO DataPushQueue: Thread interrupted while waiting push task.
24/10/10 15:43:43,103 [DataPusher-1126065] ERROR DataPusher: DataPusher push thread interrupted while pushing data.
24/10/10 15:43:53,099 [Task reaper-6] WARN Executor: Killed task 1126065 is still running after 10000 ms
24/10/10 15:43:53,157 [Task reaper-6] WARN Executor: Thread dump from task 1126065:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
org.apache.celeborn.client.write.DataPusher.waitIdleQueueFullWithLock(DataPusher.java:215)
org.apache.celeborn.client.write.DataPusher.waitOnTermination(DataPusher.java:167)
org.apache.spark.shuffle.celeborn.SortBasedPusher.close(SortBasedPusher.java:453)
org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.cleanupPusher(SortBasedShuffleWriter.java:379)
org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:240)
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)

```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2805 from cxzl25/CELEBORN-1643.

Authored-by: sychen <[email protected]>
Signed-off-by: Shaoyun Chen <[email protected]>
  • Loading branch information
cxzl25 committed Oct 15, 2024
1 parent 991f1c2 commit 6b3df47
Showing 1 changed file with 3 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,9 @@ protected void pushData(PushTask task) throws IOException {

private void waitIdleQueueFullWithLock() throws InterruptedException {
try {
while (idleQueue.remainingCapacity() > 0 && exceptionRef.get() == null) {
while (idleQueue.remainingCapacity() > 0
&& exceptionRef.get() == null
&& (pushThread != null && pushThread.isAlive())) {
idleFull.await(WAIT_TIME_NANOS, TimeUnit.NANOSECONDS);
}
} catch (InterruptedException e) {
Expand Down

0 comments on commit 6b3df47

Please sign in to comment.