Benchmark 'Failed to poll' errors, probably figured out the cause #180

hkimura · 2014-10-20T03:00:54Z

There are a few issues that report "Failed to poll" during benchmark,
Such as #136

I have been also hitting this issue, and it seems to happen more often when I have many sites and partitions. After reading related source code, I suspect the cause might be a too-short wait before polling site processes in ProcessSetManager.

Looks like the code has a hard-coded wait time before it starts polling site processes, and after the poll it shuts down if there is any site process that is not reachable by that time. When there are many sites/partitions, seems like they take longer and longer to respond (with some variations, that seems why this issue sometimes happens sometimes doesn't).

The current wait time is 2.5sec for initial wait and 2sec for polling wait. As far as I increase them to 25sec/20sec, I no longer see this error.

Of course, I might have missed something. Please check if this is the case.
If it is, I'd propose either making the wait time configurable or proportional to the number of sites or partitions.

apavlo · 2014-10-20T19:10:54Z

This sounds reasonable. Do you want to send a pull request?

hkimura · 2014-10-20T19:57:17Z

My local fix is just a hack (another hard-coded values), so I guess you need to come up with a "correct" fix.

hkimura · 2014-10-23T20:11:45Z

In case you'd like to try the hack solution:

diff --git a/src/frontend/org/voltdb/processtools/ProcessSetManager.java b/src/frontend/org/voltdb/processtools/ProcessSetManager.java
index 34ad279..4b70482 100644
--- a/src/frontend/org/voltdb/processtools/ProcessSetManager.java
+++ b/src/frontend/org/voltdb/processtools/ProcessSetManager.java
@@ -74,7 +74,7 @@ public class ProcessSetManager implements Shutdownable {
* How long to wait after a process starts before we will check whether
* it's still alive.
*/

private static final int POLLING_DELAY = 2000; // ms
private static final int POLLING_DELAY = 60000; // ms

/**

Regular expressions of strings that we want to exclude from the remote
@@ -326,7 +326,7 @@ public class ProcessSetManager implements Shutdownable {
}

public ProcessSetManager() {
```
   this(null, false, 10000, null);
```
```
   this(null, false, 60000, null);
```
}

// ============================================================================

1 minute is a long wait, but a TPC-C run with many partitions anyway takes several minutes to populate the initial data, wouldn't be a big issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark 'Failed to poll' errors, probably figured out the cause #180

Benchmark 'Failed to poll' errors, probably figured out the cause #180

hkimura commented Oct 20, 2014

apavlo commented Oct 20, 2014

hkimura commented Oct 20, 2014

hkimura commented Oct 23, 2014

Benchmark 'Failed to poll' errors, probably figured out the cause #180

Benchmark 'Failed to poll' errors, probably figured out the cause #180

Comments

hkimura commented Oct 20, 2014

apavlo commented Oct 20, 2014

hkimura commented Oct 20, 2014

hkimura commented Oct 23, 2014