You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Certain commands poll the server as often as every 5 seconds. During periods of high server load and/or long queues, this can effectively cause imputationbot to denial of service (DoS) the server.
Suggested changes
Reduce polling times on "download once job finished" and "retry after 500 error" to 60 seconds. This should give more time to resolve temporary server downtime issues due to slow queries, GC pauses, or connection pool saturation.
In some cases, it will reduce routine server load by > 90%.
Notes
Open questions
Proof of concept patch is below. Very limited testing: verify these are the correct constants for the intended behaviors.
In particular I'm not sure on the organizational level of project, and want to be sure that sleep timers aren't being double counted. Unclear whether these can be abstracted to a new shared constant, or if there's a reason for the different timers.
Future changes
The "download when job complete" check could be extended past 1 minute.
For 500 errors, a capped exponential backoff might be preferred, but would entail larger refactoring.
Proposed patch
diff --git a/src/main/java/genepi/imputationbot/client/CloudgeneClient.java b/src/main/java/genepi/imputationbot/client/CloudgeneClient.java
index d03dd80..1479016 100644
--- a/src/main/java/genepi/imputationbot/client/CloudgeneClient.java
+++ b/src/main/java/genepi/imputationbot/client/CloudgeneClient.java
@@ -30,7 +30,7 @@ public class CloudgeneClient {
public static int MAX_ATTEMPTS = 5;
- public static int WAIT_BETWEEN_ATTEMPTS_SEC = 5;
+ public static int WAIT_BETWEEN_ATTEMPTS_SEC = 60;
public static final String USER_AGENT = "imputation-bot " + App.VERSION + " (OS: " + OperatingSystem.NAME
+ ", Java: " + System.getProperty("java.version") + ")";
@@ -134,7 +134,7 @@ public class CloudgeneClient {
}
public void waitForProject(Project project) throws CloudgeneException, InterruptedException {
- waitForProject(project, 10000);
+ waitForProject(project, 60000);
}
public void waitForProject(Project project, int pollingTime) throws CloudgeneException, InterruptedException {
@@ -152,7 +152,7 @@ public class CloudgeneClient {
public void waitForJob(String id) throws CloudgeneException, InterruptedException {
// noinspection StatementWithEmptyBody
- while (!waitForJob(id, 10000)) {
+ while (!waitForJob(id, 60000)) {
}
}
diff --git a/src/main/java/genepi/imputationbot/commands/DownloadResults.java b/src/main/java/genepi/imputationbot/commands/DownloadResults.java
index ece3a64..452de32 100644
--- a/src/main/java/genepi/imputationbot/commands/DownloadResults.java
+++ b/src/main/java/genepi/imputationbot/commands/DownloadResults.java
@@ -62,7 +62,7 @@ public class DownloadResults extends BaseCommand {
if (job.isRunning()) {
println("Job " + job.getId() + " is running. Download starts automatically when job is finished...");
// noinspection StatementWithEmptyBody
- while(!client.waitForJob(job.getId(), 30 * 1000)){}
+ while(!client.waitForJob(job.getId(), 60 * 1000)){}
job = client.getJobDetails(job.getId());
The text was updated successfully, but these errors were encountered:
Purpose
Certain commands poll the server as often as every 5 seconds. During periods of high server load and/or long queues, this can effectively cause imputationbot to denial of service (DoS) the server.
Suggested changes
Reduce polling times on "download once job finished" and "retry after 500 error" to 60 seconds. This should give more time to resolve temporary server downtime issues due to slow queries, GC pauses, or connection pool saturation.
In some cases, it will reduce routine server load by > 90%.
Notes
Open questions
Proof of concept patch is below. Very limited testing: verify these are the correct constants for the intended behaviors.
In particular I'm not sure on the organizational level of
project
, and want to be sure that sleep timers aren't being double counted. Unclear whether these can be abstracted to a new shared constant, or if there's a reason for the different timers.Future changes
The "download when job complete" check could be extended past 1 minute.
For 500 errors, a capped exponential backoff might be preferred, but would entail larger refactoring.
Proposed patch
The text was updated successfully, but these errors were encountered: