Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark 'Failed to poll' errors, probably figured out the cause #180

Open
hkimura opened this issue Oct 20, 2014 · 3 comments
Open

Benchmark 'Failed to poll' errors, probably figured out the cause #180

hkimura opened this issue Oct 20, 2014 · 3 comments

Comments

@hkimura
Copy link

hkimura commented Oct 20, 2014

There are a few issues that report "Failed to poll" during benchmark,
Such as #136

I have been also hitting this issue, and it seems to happen more often when I have many sites and partitions. After reading related source code, I suspect the cause might be a too-short wait before polling site processes in ProcessSetManager.

Looks like the code has a hard-coded wait time before it starts polling site processes, and after the poll it shuts down if there is any site process that is not reachable by that time. When there are many sites/partitions, seems like they take longer and longer to respond (with some variations, that seems why this issue sometimes happens sometimes doesn't).

The current wait time is 2.5sec for initial wait and 2sec for polling wait. As far as I increase them to 25sec/20sec, I no longer see this error.

Of course, I might have missed something. Please check if this is the case.
If it is, I'd propose either making the wait time configurable or proportional to the number of sites or partitions.

@apavlo
Copy link
Owner

apavlo commented Oct 20, 2014

This sounds reasonable. Do you want to send a pull request?

@hkimura
Copy link
Author

hkimura commented Oct 20, 2014

My local fix is just a hack (another hard-coded values), so I guess you need to come up with a "correct" fix.

@hkimura
Copy link
Author

hkimura commented Oct 23, 2014

In case you'd like to try the hack solution:

diff --git a/src/frontend/org/voltdb/processtools/ProcessSetManager.java b/src/frontend/org/voltdb/processtools/ProcessSetManager.java
index 34ad279..4b70482 100644
--- a/src/frontend/org/voltdb/processtools/ProcessSetManager.java
+++ b/src/frontend/org/voltdb/processtools/ProcessSetManager.java
@@ -74,7 +74,7 @@ public class ProcessSetManager implements Shutdownable {
* How long to wait after a process starts before we will check whether
* it's still alive.
*/

  • private static final int POLLING_DELAY = 2000; // ms
  • private static final int POLLING_DELAY = 60000; // ms

/**

  • Regular expressions of strings that we want to exclude from the remote
    @@ -326,7 +326,7 @@ public class ProcessSetManager implements Shutdownable {
    }

    public ProcessSetManager() {

  •    this(null, false, 10000, null);
    
  •    this(null, false, 60000, null);
    

    }

    // ============================================================================

1 minute is a long wait, but a TPC-C run with many partitions anyway takes several minutes to populate the initial data, wouldn't be a big issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants