Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEZ-4580: Slow preemption of new containers when re-use is enabled #374

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

himanshu-mishra
Copy link
Contributor

When container reuse is enabled, preemption of lower priority containers that are not yet assigned to task, takes long time as they are released one at a time, and not the number of containers based when tez.am.preemption.percentage is high added in https://issues.apache.org/jira/browse/TEZ-1742.

Further investigation lead to following conclusion:

  1. Warn log / Assertion error thrown because in preemptIfNeeded(), when releasing new containers, the loop counter is being decremented with each releaseUnassignedContainers, leading to looping only half number of times. By using another counter, assertion passes because of condition method returns with check if (numPendingRequestsToService < 1) {.

  2. In releaseContainer(), the container is not getting removed from delayedContainers queue and only from heldContainers map, hence same container is being picked up for release in every iteration till next cycle of DelayedContainerManager finds out that the container is not in heldContainers and skips it with log Skipping delayed container as container is no longer running, containerId=...

This change adds a method in DelayedContainerManager to allow removal of delayed container and invokes it in releaseContainer method, which so far only removed it from heldContainers map.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 9s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 15m 50s master passed
+1 💚 compile 0m 26s master passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 compile 0m 22s master passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+1 💚 checkstyle 1m 21s master passed
+1 💚 javadoc 0m 31s master passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 javadoc 0m 16s master passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+0 🆗 spotbugs 1m 20s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 19s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 15s the patch passed
+1 💚 compile 0m 17s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 javac 0m 17s the patch passed
+1 💚 compile 0m 14s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+1 💚 javac 0m 14s the patch passed
-0 ⚠️ checkstyle 0m 13s tez-dag: The patch generated 1 new + 146 unchanged - 0 fixed = 147 total (was 146)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 7s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 javadoc 0m 7s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+1 💚 findbugs 0m 46s the patch passed
_ Other Tests _
+1 💚 unit 4m 25s tez-dag in the patch passed.
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
27m 34s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/artifact/out/Dockerfile
GITHUB PR #374
JIRA Issue TEZ-4580
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 7ae9360fbab1 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 607b2bc
Default Java Private Build-1.8.0_422-8u422-b05-1~22.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~22.04-b05
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/artifact/out/diff-checkstyle-tez-dag.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/testReport/
Max. process+thread count 201 (vs. ulimit of 5500)
modules C: tez-dag U: tez-dag
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

if (delayedContainers.remove(container)) {
LOG.info("Removed {} from delayed containers", container.getContainer().getId());
} else {
LOG.warn("Unknown container {} sent for removal. Ignoring.", container.getContainer().getId());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valid case here is when a new container is allocated - added to delayedContainers, it is polled (removed) from queue but if there is no pending task request releaseContainer() is invoked. Please suggest if the log level should be changed in info or for both the newly added logs should be at debug.

prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 14, 2024
prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
shubhluck pushed a commit to acceldata-io/tez that referenced this pull request Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants