Add retryIntervalJitter option (follow-on from #163) #212

timsutton · 2020-05-25T17:59:52Z

👋 This is a follow-up to #163 for the same use case of adding a random jitter interval to the retry wait period.

I tried to follow the suggested approach of taking the value as an integer and simply adding it to the retry wait period. I opted to name the option -retryIntervalJitter, but open to any suggestions for naming or any other improvements here. Thanks!

Update to latest upstream swarm-client

…base * origin/master: (45 commits) Remove usage of deprecated buildPlugin.recommendedConfigurations() Bump oshi-core from 5.0.1 to 5.0.2 Adapt to oshi/oshi#1191 Bump oshi-core from 4.7.1 to 5.0.1 Bump oshi-core from 4.6.1 to 4.7.1 Update badges Bump plugin dependencies Update parent POMs [JENKINS-62033] - Swarm client -disableSslVerification option does not disable SSL hostname verification Bump maven-shade-plugin from 3.2.2 to 3.2.3 Bump oshi-core from 4.5.2 to 4.6.1 [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release swarm-plugin-3.19 Bump remoting from 4.2 to 4.3 Bump oshi-core from 4.5.0 to 4.5.2 Bump oshi-core from 4.4.2 to 4.5.0 Bump httpclient from 4.5.11 to 4.5.12 Lost test cases added (jenkinsci#176). Fix double-merge [JENKINS-44115] Enable Remoting working directories by default in Swarm plugin (jenkinsci#176) ...

- also a readability adjustment

- addressing feedback in jenkinsci#163

justinseanmartin · 2020-05-28T18:53:32Z

@basil - Any chance we could get this looked at?

basil · 2020-06-04T20:07:34Z

Hey @timsutton, thanks for the PR! The implementation looks good, but I need to think a little bit more about the API. I need to make sure we're designing an API that is supportable and extensible in the long run, given all the different types of retry algorithms and strategies. I'll try and take a closer look at this in the next week or two.

timsutton · 2020-06-04T23:05:50Z

@basil Thanks! Happy to chat meanwhile about the use case, semantics, etc.

basil · 2020-06-05T19:46:20Z

Rebuilding to hopefully get a clean test run.

basil

Hey @timsutton, first of all, thank you again for contributing this PR! This looks great overall. The devil is in the details, and I have spent some time preparing feedback below on the API, implementation, use cases, documentation, and defaults. This might seem like a lot of feedback, but once it is addressed I think this plugin's retrying support will be in great shape. Let me know if I can help explain any of this further.

API

I spent some time thinking about the API and comparing it to the APIs available in Tenacity and Failsafe. I came to a few conclusions:

The new -retryIntervalJitter option makes sense and matches the jitter API available in Tenacity and Failsafe. Combining this jitter implementation with the existing NONE (i.e., fixed-wait delay) strategy should solve the thundering herd problem! :party:
The existing -retryInterval option is confusingly-named. For the NONE (i.e., fixed-wait delay) strategy, it makes sense. But for the LINEAR (i.e., incremental wait) strategy, it is used as the amount by which delays are incremented, and for the EXPONENTIAL (i.e., exponential backoff) strategy it is used as the factor by which delays are multiplied. The command-line help is not clear in this regard.
Tenacity and Failsafe both offer options to wait a random amount of time between a minimum and maximum value. This seems like a generally desirable feature to have along with exponential backoff. This PR comes very close to providing that in Swarm, but it falls a little short. With this PR, if one sets the strategy to NONE, -retryInterval to 0, and -retryIntervalJitter to the desired maximum value, one can effectively implement waiting a random amount of time between 0 and the value of -retryIntervalJitter. This would technically satisfy Kirk's use case from Support jitter in client reconnect logic #163, but this API is non-obvious if one isn't knowledgeable about the implementation. It also doesn't allow one to implement waiting a random amount of time between an arbitrary minimum (not 0) and a maximum.

To deal with the last point, may I suggest adding an explicit backoff strategy for random waits? For example, between NONE and LINEAR we could add a new RANDOM strategy. This would be trivial to implement with your new jitter(int, int) method, e.g.:

/** Wait a random amount of time between the retry interval and the maximum retry interval. */
RANDOM {
    /** @param interval The minimum amount of time to wait between each attempt. */
    @Override
    int waitForRetry(int retry, int interval, int maxTime) {
        return Math.min(maxTime, jitter(interval, retryIntervalJitter));
    }
},

One advantage of this is that it is a more straightforward way to satisfy Kirk's use case from #163. Another advantage is that it allows the user to implement random waits with a configurable minimum value. One disadvantage of this is that it worsens the (already bad) overloading of the -retryInterval parameter with yet another semantic. But this is already a problem that I do not think can be solved in this project without breaking backward compatibility, which I am not eager to do. So I suggest that we just update the command-line help for the -retryInterval option to clarify the multitude of meanings -retryInterval can have:

For the NONE (i.e., fixed wait) strategy, it is the fixed amount of time to wait between each attempt, bounded by the value of -maxRetryInterval.
For the (proposed) RANDOM (i.e., random wait) strategy, it is the minimum amount of time to wait between each attempt.
For the LINEAR (i.e., incremental wait) strategy, it is the amount by which the delay is incremented.
For the EXPONENTIAL (i.e., exponential backoff) strategy, it the factor by which delays are multiplied.

Please update the command-line help text for the -retryInterval parameter to include this information, and please also add the same text as a @param interval Javadoc comment above each waitForRetry(int, int, int) override in hudson.plugins.swarm.RetryBackOffStrategy.

The command-line help text for the -retryBackOffStrategy parameter could also use clarification. Here are my suggestions:

For the NONE (i.e., fixed wait) strategy: Wait a fixed amount of time between each attempt as given by the value of -retryInterval, bounded by the value of -maxRetryInterval.
For the (proposed) RANDOM (i.e., random wait) strategy: Wait a random amount of time between the values of -retryInterval and -maxRetryInterval.
For the LINEAR (i.e., incremental wait) strategy: Wait an incremental amount of time after each attempt, starting at the value of -retryInterval and incrementing by the value of -retryInterval for each attempt.
For the EXPONENTIAL (i.e., exponential backoff): Wait for the value of -retryInterval after each attempt, exponentially backing off to the value of -maxRetryInterval and multiplying successive delays by a factor of the value of -retryInterval.

Please update the command-line help text for the -retryBackOffStrategy parameter to include this information, and please also add it as a Javadoc to enumeration constant in hudson.plugins.swarm.RetryBackOffStrategy.

Implementation

The implementation looks good overall. I just left three code review comments. They should be trivial to address. Let me know if anything is unclear or if you need any help addressing this feedback.

Use Cases and Documentation

This PR will open up a lot of new use cases, effectively doubling the complexity of each existing strategy (by adding the ability to set a jitter amount) as well as adding a new random wait strategy (if you take my suggestion above). This is a bewildering array of possibilities, so I think we need to provide users with documentation explaining the common use cases and recommended configuration, with examples. I suggest creating a new file called docs/retry.md and linking to it from README.md. The document should cover:

How to call each strategy. This should be a duplicate of the information already in the command-line help.
The high-level use cases for each strategy in the context of Jenkins and the Swarm plugin, including the case where Jenkins is restarted and Swarm clients have to reconnect. The document should cover what are the recommended options for use with Jenkins to avoid thundering herds.
For each recommended option, examples of reasonable command-line arguments and illustrations of the resulting wait intervals.

This might actually be the hardest part of this PR. I admittedly haven't thought about this much other than in the generic context of fixed waits and jitter helping to avoid thundering herds, but why don't you try writing something up that covers at least your use case and Kirk's use case from #163. Then I will read through it and chime in with any additional thoughts I might have from my use case at Delphix.

Defaults

Whatever we decide is the recommended approach in the documentation, the CLI should default to it. In the current PR, the CLI is defaulting to no fixed wait delays with no jitter, which I doubt is going to end up as the recommended approach in the documentation.

Finally, we need to update README.md with any changes made to the command-line options.

basil · 2020-06-06T19:14:30Z

client/src/main/java/hudson/plugins/swarm/Options.java

+    @Option(name = "-retryIntervalJitter", usage = "Apply a random jitter amount to retryInterval in seconds, "+
+                                                   "still subject to retryBackOffStrategy and maxRetryInterval. "+
+                                                   "Default is 0 seconds")
+    public int retryIntervalJitter = 0;


If this number is set to a negative value, unexpected results could ensue. Can we add a check to hudson.plugins.swarm.Client to fail fast if the user sets it to a negative value?

basil · 2020-06-06T19:26:22Z

client/src/main/java/hudson/plugins/swarm/RetryBackOffStrategy.java

+        return Math.round(min + (rng.nextFloat() * (max - min)));
+    };
+
+    abstract int waitForRetry(int retry, int interval, int maxTime, int retryIntervalJitter);


The existing logic already violated the DRY principle by repeating the Math.min(maxTime, <result>) call in each method, but this PR makes the situation worse by also repeating the jitter(int, int) call in each method. I think the code would be more maintainable if we factored out the common logic into its own method, e.g.:

/** * @param retry One less than the number of attempts that have been performed. * @return The amount of time to wait before retrying, in seconds. */ abstract int waitForRetry(int retry, int interval, int maxTime); /** * @param retry One less than the number of attempts that have been performed. * @param maxTime The maximum amount of time to delay each retry attempt. * @param retryIntervalJitter The jitter to randomly vary retry delays by. For each retry delay, * a random portion of the jitter will be added to the delay. For example, a jitter of 5 * seconds will randomly add between 0 and 5 seconds to each retry delay. * @return The amount of time to wait before retrying, in seconds. */ int waitForRetry(int retry, int interval, int maxTime, int retryIntervalJitter) { int jitterAmount = jitter(0, retryIntervalJitter); int result = waitForRetry(retry, interval, maxTime) + jitterAmount; return Math.min(result, maxTime); }

This way we can rip out all the Math.min and jitter-related calls and arguments from the individual strategy waitForRetry methods.

basil · 2020-06-06T19:32:59Z

client/src/main/java/hudson/plugins/swarm/RetryBackOffStrategy.java

        }
    };

-    abstract int waitForRetry(int retry, int interval, int maxTime);
+    private static int jitter(int min, int max) {
+        SecureRandom rng = new SecureRandom();


I suggest pulling this out into a field so that we don't have to instantiate it every time we compute a retry interval, e.g.:

private static final Random rng = new SecureRandom();

timsutton · 2020-06-17T15:01:13Z

Thanks @basil for all this detailed feedback. Going to digest it and send you back some comments to make sure we're on the right page for how to implement the feature :)

KirkMartinez and others added 6 commits November 6, 2019 11:22

Merge pull request jenkinsci#1 from jenkinsci/master

6f2076f

Update to latest upstream swarm-client

Add jitter option to reconnect logic

c5b2ae1

Fix PREDICTABLE_RANDOM from spotbugs by using SecureRandom

ee52b7d

Fix SecureRandom var

e08d260

- also a readability adjustment

useJitter bool -> retryIntervalJitter int

f3efcfe

- addressing feedback in jenkinsci#163

Merge branch 'master' into timsutton/jitter-rebase

7a204b8

basil closed this Jun 5, 2020

basil reopened this Jun 5, 2020

basil reviewed Jun 6, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retryIntervalJitter option (follow-on from #163) #212

Add retryIntervalJitter option (follow-on from #163) #212

timsutton commented May 25, 2020

justinseanmartin commented May 28, 2020

basil commented Jun 4, 2020

timsutton commented Jun 4, 2020

basil commented Jun 5, 2020

basil left a comment

basil Jun 6, 2020

basil Jun 6, 2020

basil Jun 6, 2020

timsutton commented Jun 17, 2020

Add retryIntervalJitter option (follow-on from #163) #212

Are you sure you want to change the base?

Add retryIntervalJitter option (follow-on from #163) #212

Conversation

timsutton commented May 25, 2020

justinseanmartin commented May 28, 2020

basil commented Jun 4, 2020

timsutton commented Jun 4, 2020

basil commented Jun 5, 2020

basil left a comment

Choose a reason for hiding this comment

API

Implementation

Use Cases and Documentation

Defaults

basil Jun 6, 2020

Choose a reason for hiding this comment

basil Jun 6, 2020

Choose a reason for hiding this comment

basil Jun 6, 2020

Choose a reason for hiding this comment

timsutton commented Jun 17, 2020