-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retryIntervalJitter option (follow-on from #163) #212
base: master
Are you sure you want to change the base?
Conversation
Update to latest upstream swarm-client
…base * origin/master: (45 commits) Remove usage of deprecated buildPlugin.recommendedConfigurations() Bump oshi-core from 5.0.1 to 5.0.2 Adapt to oshi/oshi#1191 Bump oshi-core from 4.7.1 to 5.0.1 Bump oshi-core from 4.6.1 to 4.7.1 Update badges Bump plugin dependencies Update parent POMs [JENKINS-62033] - Swarm client -disableSslVerification option does not disable SSL hostname verification Bump maven-shade-plugin from 3.2.2 to 3.2.3 Bump oshi-core from 4.5.2 to 4.6.1 [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release swarm-plugin-3.19 Bump remoting from 4.2 to 4.3 Bump oshi-core from 4.5.0 to 4.5.2 Bump oshi-core from 4.4.2 to 4.5.0 Bump httpclient from 4.5.11 to 4.5.12 Lost test cases added (jenkinsci#176). Fix double-merge [JENKINS-44115] Enable Remoting working directories by default in Swarm plugin (jenkinsci#176) ...
- also a readability adjustment
- addressing feedback in jenkinsci#163
@basil - Any chance we could get this looked at? |
Hey @timsutton, thanks for the PR! The implementation looks good, but I need to think a little bit more about the API. I need to make sure we're designing an API that is supportable and extensible in the long run, given all the different types of retry algorithms and strategies. I'll try and take a closer look at this in the next week or two. |
@basil Thanks! Happy to chat meanwhile about the use case, semantics, etc. |
Rebuilding to hopefully get a clean test run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @timsutton, first of all, thank you again for contributing this PR! This looks great overall. The devil is in the details, and I have spent some time preparing feedback below on the API, implementation, use cases, documentation, and defaults. This might seem like a lot of feedback, but once it is addressed I think this plugin's retrying support will be in great shape. Let me know if I can help explain any of this further.
API
I spent some time thinking about the API and comparing it to the APIs available in Tenacity and Failsafe. I came to a few conclusions:
- The new
-retryIntervalJitter
option makes sense and matches the jitter API available in Tenacity and Failsafe. Combining this jitter implementation with the existingNONE
(i.e., fixed-wait delay) strategy should solve the thundering herd problem! :party: - The existing
-retryInterval
option is confusingly-named. For theNONE
(i.e., fixed-wait delay) strategy, it makes sense. But for theLINEAR
(i.e., incremental wait) strategy, it is used as the amount by which delays are incremented, and for theEXPONENTIAL
(i.e., exponential backoff) strategy it is used as the factor by which delays are multiplied. The command-line help is not clear in this regard. - Tenacity and Failsafe both offer options to wait a random amount of time between a minimum and maximum value. This seems like a generally desirable feature to have along with exponential backoff. This PR comes very close to providing that in Swarm, but it falls a little short. With this PR, if one sets the strategy to
NONE
,-retryInterval
to 0, and-retryIntervalJitter
to the desired maximum value, one can effectively implement waiting a random amount of time between 0 and the value of-retryIntervalJitter
. This would technically satisfy Kirk's use case from Support jitter in client reconnect logic #163, but this API is non-obvious if one isn't knowledgeable about the implementation. It also doesn't allow one to implement waiting a random amount of time between an arbitrary minimum (not 0) and a maximum.
To deal with the last point, may I suggest adding an explicit backoff strategy for random waits? For example, between NONE
and LINEAR
we could add a new RANDOM
strategy. This would be trivial to implement with your new jitter(int, int)
method, e.g.:
/** Wait a random amount of time between the retry interval and the maximum retry interval. */
RANDOM {
/** @param interval The minimum amount of time to wait between each attempt. */
@Override
int waitForRetry(int retry, int interval, int maxTime) {
return Math.min(maxTime, jitter(interval, retryIntervalJitter));
}
},
One advantage of this is that it is a more straightforward way to satisfy Kirk's use case from #163. Another advantage is that it allows the user to implement random waits with a configurable minimum value. One disadvantage of this is that it worsens the (already bad) overloading of the -retryInterval
parameter with yet another semantic. But this is already a problem that I do not think can be solved in this project without breaking backward compatibility, which I am not eager to do. So I suggest that we just update the command-line help for the -retryInterval
option to clarify the multitude of meanings -retryInterval
can have:
- For the
NONE
(i.e., fixed wait) strategy, it is the fixed amount of time to wait between each attempt, bounded by the value of-maxRetryInterval
. - For the (proposed)
RANDOM
(i.e., random wait) strategy, it is the minimum amount of time to wait between each attempt. - For the
LINEAR
(i.e., incremental wait) strategy, it is the amount by which the delay is incremented. - For the
EXPONENTIAL
(i.e., exponential backoff) strategy, it the factor by which delays are multiplied.
Please update the command-line help text for the -retryInterval
parameter to include this information, and please also add the same text as a @param interval
Javadoc comment above each waitForRetry(int, int, int)
override in hudson.plugins.swarm.RetryBackOffStrategy
.
The command-line help text for the -retryBackOffStrategy
parameter could also use clarification. Here are my suggestions:
- For the
NONE
(i.e., fixed wait) strategy: Wait a fixed amount of time between each attempt as given by the value of-retryInterval
, bounded by the value of-maxRetryInterval
. - For the (proposed)
RANDOM
(i.e., random wait) strategy: Wait a random amount of time between the values of-retryInterval
and-maxRetryInterval
. - For the
LINEAR
(i.e., incremental wait) strategy: Wait an incremental amount of time after each attempt, starting at the value of-retryInterval
and incrementing by the value of-retryInterval
for each attempt. - For the
EXPONENTIAL
(i.e., exponential backoff): Wait for the value of-retryInterval
after each attempt, exponentially backing off to the value of-maxRetryInterval
and multiplying successive delays by a factor of the value of-retryInterval
.
Please update the command-line help text for the -retryBackOffStrategy
parameter to include this information, and please also add it as a Javadoc to enumeration constant in hudson.plugins.swarm.RetryBackOffStrategy
.
Implementation
The implementation looks good overall. I just left three code review comments. They should be trivial to address. Let me know if anything is unclear or if you need any help addressing this feedback.
Use Cases and Documentation
This PR will open up a lot of new use cases, effectively doubling the complexity of each existing strategy (by adding the ability to set a jitter amount) as well as adding a new random wait strategy (if you take my suggestion above). This is a bewildering array of possibilities, so I think we need to provide users with documentation explaining the common use cases and recommended configuration, with examples. I suggest creating a new file called docs/retry.md
and linking to it from README.md
. The document should cover:
- How to call each strategy. This should be a duplicate of the information already in the command-line help.
- The high-level use cases for each strategy in the context of Jenkins and the Swarm plugin, including the case where Jenkins is restarted and Swarm clients have to reconnect. The document should cover what are the recommended options for use with Jenkins to avoid thundering herds.
- For each recommended option, examples of reasonable command-line arguments and illustrations of the resulting wait intervals.
This might actually be the hardest part of this PR. I admittedly haven't thought about this much other than in the generic context of fixed waits and jitter helping to avoid thundering herds, but why don't you try writing something up that covers at least your use case and Kirk's use case from #163. Then I will read through it and chime in with any additional thoughts I might have from my use case at Delphix.
Defaults
Whatever we decide is the recommended approach in the documentation, the CLI should default to it. In the current PR, the CLI is defaulting to no fixed wait delays with no jitter, which I doubt is going to end up as the recommended approach in the documentation.
Finally, we need to update README.md
with any changes made to the command-line options.
@Option(name = "-retryIntervalJitter", usage = "Apply a random jitter amount to retryInterval in seconds, "+ | ||
"still subject to retryBackOffStrategy and maxRetryInterval. "+ | ||
"Default is 0 seconds") | ||
public int retryIntervalJitter = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this number is set to a negative value, unexpected results could ensue. Can we add a check to hudson.plugins.swarm.Client
to fail fast if the user sets it to a negative value?
return Math.round(min + (rng.nextFloat() * (max - min))); | ||
}; | ||
|
||
abstract int waitForRetry(int retry, int interval, int maxTime, int retryIntervalJitter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing logic already violated the DRY principle by repeating the Math.min(maxTime, <result>)
call in each method, but this PR makes the situation worse by also repeating the jitter(int, int)
call in each method. I think the code would be more maintainable if we factored out the common logic into its own method, e.g.:
/**
* @param retry One less than the number of attempts that have been performed.
* @return The amount of time to wait before retrying, in seconds.
*/
abstract int waitForRetry(int retry, int interval, int maxTime);
/**
* @param retry One less than the number of attempts that have been performed.
* @param maxTime The maximum amount of time to delay each retry attempt.
* @param retryIntervalJitter The jitter to randomly vary retry delays by. For each retry delay,
* a random portion of the jitter will be added to the delay. For example, a jitter of 5
* seconds will randomly add between 0 and 5 seconds to each retry delay.
* @return The amount of time to wait before retrying, in seconds.
*/
int waitForRetry(int retry, int interval, int maxTime, int retryIntervalJitter) {
int jitterAmount = jitter(0, retryIntervalJitter);
int result = waitForRetry(retry, interval, maxTime) + jitterAmount;
return Math.min(result, maxTime);
}
This way we can rip out all the Math.min
and jitter-related calls and arguments from the individual strategy waitForRetry
methods.
} | ||
}; | ||
|
||
abstract int waitForRetry(int retry, int interval, int maxTime); | ||
private static int jitter(int min, int max) { | ||
SecureRandom rng = new SecureRandom(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest pulling this out into a field so that we don't have to instantiate it every time we compute a retry interval, e.g.:
private static final Random rng = new SecureRandom();
Thanks @basil for all this detailed feedback. Going to digest it and send you back some comments to make sure we're on the right page for how to implement the feature :) |
👋 This is a follow-up to #163 for the same use case of adding a random jitter interval to the retry wait period.
I tried to follow the suggested approach of taking the value as an integer and simply adding it to the retry wait period. I opted to name the option
-retryIntervalJitter
, but open to any suggestions for naming or any other improvements here. Thanks!