You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few issues that report "Failed to poll" during benchmark,
Such as #136
I have been also hitting this issue, and it seems to happen more often when I have many sites and partitions. After reading related source code, I suspect the cause might be a too-short wait before polling site processes in ProcessSetManager.
Looks like the code has a hard-coded wait time before it starts polling site processes, and after the poll it shuts down if there is any site process that is not reachable by that time. When there are many sites/partitions, seems like they take longer and longer to respond (with some variations, that seems why this issue sometimes happens sometimes doesn't).
The current wait time is 2.5sec for initial wait and 2sec for polling wait. As far as I increase them to 25sec/20sec, I no longer see this error.
Of course, I might have missed something. Please check if this is the case.
If it is, I'd propose either making the wait time configurable or proportional to the number of sites or partitions.
The text was updated successfully, but these errors were encountered:
diff --git a/src/frontend/org/voltdb/processtools/ProcessSetManager.java b/src/frontend/org/voltdb/processtools/ProcessSetManager.java
index 34ad279..4b70482 100644
--- a/src/frontend/org/voltdb/processtools/ProcessSetManager.java
+++ b/src/frontend/org/voltdb/processtools/ProcessSetManager.java
@@ -74,7 +74,7 @@ public class ProcessSetManager implements Shutdownable {
* How long to wait after a process starts before we will check whether
* it's still alive.
*/
private static final int POLLING_DELAY = 2000; // ms
private static final int POLLING_DELAY = 60000; // ms
/**
Regular expressions of strings that we want to exclude from the remote
@@ -326,7 +326,7 @@ public class ProcessSetManager implements Shutdownable {
}
There are a few issues that report "Failed to poll" during benchmark,
Such as #136
I have been also hitting this issue, and it seems to happen more often when I have many sites and partitions. After reading related source code, I suspect the cause might be a too-short wait before polling site processes in ProcessSetManager.
Looks like the code has a hard-coded wait time before it starts polling site processes, and after the poll it shuts down if there is any site process that is not reachable by that time. When there are many sites/partitions, seems like they take longer and longer to respond (with some variations, that seems why this issue sometimes happens sometimes doesn't).
The current wait time is 2.5sec for initial wait and 2sec for polling wait. As far as I increase them to 25sec/20sec, I no longer see this error.
Of course, I might have missed something. Please check if this is the case.
If it is, I'd propose either making the wait time configurable or proportional to the number of sites or partitions.
The text was updated successfully, but these errors were encountered: