Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retryIntervalJitter option (follow-on from #163) #212

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

timsutton
Copy link

👋 This is a follow-up to #163 for the same use case of adding a random jitter interval to the retry wait period.

I tried to follow the suggested approach of taking the value as an integer and simply adding it to the retry wait period. I opted to name the option -retryIntervalJitter, but open to any suggestions for naming or any other improvements here. Thanks!

KirkMartinez and others added 6 commits November 6, 2019 11:22
Update to latest upstream swarm-client
…base

* origin/master: (45 commits)
  Remove usage of deprecated buildPlugin.recommendedConfigurations()
  Bump oshi-core from 5.0.1 to 5.0.2
  Adapt to oshi/oshi#1191
  Bump oshi-core from 4.7.1 to 5.0.1
  Bump oshi-core from 4.6.1 to 4.7.1
  Update badges
  Bump plugin dependencies
  Update parent POMs
  [JENKINS-62033] - Swarm client -disableSslVerification option does not disable SSL hostname verification
  Bump maven-shade-plugin from 3.2.2 to 3.2.3
  Bump oshi-core from 4.5.2 to 4.6.1
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release swarm-plugin-3.19
  Bump remoting from 4.2 to 4.3
  Bump oshi-core from 4.5.0 to 4.5.2
  Bump oshi-core from 4.4.2 to 4.5.0
  Bump httpclient from 4.5.11 to 4.5.12
  Lost test cases added (jenkinsci#176).
  Fix double-merge
  [JENKINS-44115] Enable Remoting working directories by default in Swarm plugin (jenkinsci#176)
  ...
- also a readability adjustment
@justinseanmartin
Copy link

@basil - Any chance we could get this looked at?

@basil
Copy link
Member

basil commented Jun 4, 2020

Hey @timsutton, thanks for the PR! The implementation looks good, but I need to think a little bit more about the API. I need to make sure we're designing an API that is supportable and extensible in the long run, given all the different types of retry algorithms and strategies. I'll try and take a closer look at this in the next week or two.

@timsutton
Copy link
Author

@basil Thanks! Happy to chat meanwhile about the use case, semantics, etc.

@basil
Copy link
Member

basil commented Jun 5, 2020

Rebuilding to hopefully get a clean test run.

@basil basil closed this Jun 5, 2020
@basil basil reopened this Jun 5, 2020
Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @timsutton, first of all, thank you again for contributing this PR! This looks great overall. The devil is in the details, and I have spent some time preparing feedback below on the API, implementation, use cases, documentation, and defaults. This might seem like a lot of feedback, but once it is addressed I think this plugin's retrying support will be in great shape. Let me know if I can help explain any of this further.

API

I spent some time thinking about the API and comparing it to the APIs available in Tenacity and Failsafe. I came to a few conclusions:

  • The new -retryIntervalJitter option makes sense and matches the jitter API available in Tenacity and Failsafe. Combining this jitter implementation with the existing NONE (i.e., fixed-wait delay) strategy should solve the thundering herd problem! :party:
  • The existing -retryInterval option is confusingly-named. For the NONE (i.e., fixed-wait delay) strategy, it makes sense. But for the LINEAR (i.e., incremental wait) strategy, it is used as the amount by which delays are incremented, and for the EXPONENTIAL (i.e., exponential backoff) strategy it is used as the factor by which delays are multiplied. The command-line help is not clear in this regard.
  • Tenacity and Failsafe both offer options to wait a random amount of time between a minimum and maximum value. This seems like a generally desirable feature to have along with exponential backoff. This PR comes very close to providing that in Swarm, but it falls a little short. With this PR, if one sets the strategy to NONE, -retryInterval to 0, and -retryIntervalJitter to the desired maximum value, one can effectively implement waiting a random amount of time between 0 and the value of -retryIntervalJitter. This would technically satisfy Kirk's use case from Support jitter in client reconnect logic #163, but this API is non-obvious if one isn't knowledgeable about the implementation. It also doesn't allow one to implement waiting a random amount of time between an arbitrary minimum (not 0) and a maximum.

To deal with the last point, may I suggest adding an explicit backoff strategy for random waits? For example, between NONE and LINEAR we could add a new RANDOM strategy. This would be trivial to implement with your new jitter(int, int) method, e.g.:

/** Wait a random amount of time between the retry interval and the maximum retry interval. */
RANDOM {
    /** @param interval The minimum amount of time to wait between each attempt. */
    @Override
    int waitForRetry(int retry, int interval, int maxTime) {
        return Math.min(maxTime, jitter(interval, retryIntervalJitter));
    }
},

One advantage of this is that it is a more straightforward way to satisfy Kirk's use case from #163. Another advantage is that it allows the user to implement random waits with a configurable minimum value. One disadvantage of this is that it worsens the (already bad) overloading of the -retryInterval parameter with yet another semantic. But this is already a problem that I do not think can be solved in this project without breaking backward compatibility, which I am not eager to do. So I suggest that we just update the command-line help for the -retryInterval option to clarify the multitude of meanings -retryInterval can have:

  • For the NONE (i.e., fixed wait) strategy, it is the fixed amount of time to wait between each attempt, bounded by the value of -maxRetryInterval.
  • For the (proposed) RANDOM (i.e., random wait) strategy, it is the minimum amount of time to wait between each attempt.
  • For the LINEAR (i.e., incremental wait) strategy, it is the amount by which the delay is incremented.
  • For the EXPONENTIAL (i.e., exponential backoff) strategy, it the factor by which delays are multiplied.

Please update the command-line help text for the -retryInterval parameter to include this information, and please also add the same text as a @param interval Javadoc comment above each waitForRetry(int, int, int) override in hudson.plugins.swarm.RetryBackOffStrategy.

The command-line help text for the -retryBackOffStrategy parameter could also use clarification. Here are my suggestions:

  • For the NONE (i.e., fixed wait) strategy: Wait a fixed amount of time between each attempt as given by the value of -retryInterval, bounded by the value of -maxRetryInterval.
  • For the (proposed) RANDOM (i.e., random wait) strategy: Wait a random amount of time between the values of -retryInterval and -maxRetryInterval.
  • For the LINEAR (i.e., incremental wait) strategy: Wait an incremental amount of time after each attempt, starting at the value of -retryInterval and incrementing by the value of -retryInterval for each attempt.
  • For the EXPONENTIAL (i.e., exponential backoff): Wait for the value of -retryInterval after each attempt, exponentially backing off to the value of -maxRetryInterval and multiplying successive delays by a factor of the value of -retryInterval.

Please update the command-line help text for the -retryBackOffStrategy parameter to include this information, and please also add it as a Javadoc to enumeration constant in hudson.plugins.swarm.RetryBackOffStrategy.

Implementation

The implementation looks good overall. I just left three code review comments. They should be trivial to address. Let me know if anything is unclear or if you need any help addressing this feedback.

Use Cases and Documentation

This PR will open up a lot of new use cases, effectively doubling the complexity of each existing strategy (by adding the ability to set a jitter amount) as well as adding a new random wait strategy (if you take my suggestion above). This is a bewildering array of possibilities, so I think we need to provide users with documentation explaining the common use cases and recommended configuration, with examples. I suggest creating a new file called docs/retry.md and linking to it from README.md. The document should cover:

  • How to call each strategy. This should be a duplicate of the information already in the command-line help.
  • The high-level use cases for each strategy in the context of Jenkins and the Swarm plugin, including the case where Jenkins is restarted and Swarm clients have to reconnect. The document should cover what are the recommended options for use with Jenkins to avoid thundering herds.
  • For each recommended option, examples of reasonable command-line arguments and illustrations of the resulting wait intervals.

This might actually be the hardest part of this PR. I admittedly haven't thought about this much other than in the generic context of fixed waits and jitter helping to avoid thundering herds, but why don't you try writing something up that covers at least your use case and Kirk's use case from #163. Then I will read through it and chime in with any additional thoughts I might have from my use case at Delphix.

Defaults

Whatever we decide is the recommended approach in the documentation, the CLI should default to it. In the current PR, the CLI is defaulting to no fixed wait delays with no jitter, which I doubt is going to end up as the recommended approach in the documentation.

Finally, we need to update README.md with any changes made to the command-line options.

Comment on lines +63 to +66
@Option(name = "-retryIntervalJitter", usage = "Apply a random jitter amount to retryInterval in seconds, "+
"still subject to retryBackOffStrategy and maxRetryInterval. "+
"Default is 0 seconds")
public int retryIntervalJitter = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this number is set to a negative value, unexpected results could ensue. Can we add a check to hudson.plugins.swarm.Client to fail fast if the user sets it to a negative value?

return Math.round(min + (rng.nextFloat() * (max - min)));
};

abstract int waitForRetry(int retry, int interval, int maxTime, int retryIntervalJitter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing logic already violated the DRY principle by repeating the Math.min(maxTime, <result>) call in each method, but this PR makes the situation worse by also repeating the jitter(int, int) call in each method. I think the code would be more maintainable if we factored out the common logic into its own method, e.g.:

/**
 * @param retry One less than the number of attempts that have been performed.
 * @return The amount of time to wait before retrying, in seconds.
 */
abstract int waitForRetry(int retry, int interval, int maxTime);

/**
 * @param retry One less than the number of attempts that have been performed.
 * @param maxTime The maximum amount of time to delay each retry attempt.
 * @param retryIntervalJitter The jitter to randomly vary retry delays by. For each retry delay,
 *     a random portion of the jitter will be added to the delay. For example, a jitter of 5
 *     seconds will randomly add between 0 and 5 seconds to each retry delay.
 * @return The amount of time to wait before retrying, in seconds.
 */
int waitForRetry(int retry, int interval, int maxTime, int retryIntervalJitter) {
    int jitterAmount = jitter(0, retryIntervalJitter);
    int result = waitForRetry(retry, interval, maxTime) + jitterAmount;
    return Math.min(result, maxTime);
}

This way we can rip out all the Math.min and jitter-related calls and arguments from the individual strategy waitForRetry methods.

}
};

abstract int waitForRetry(int retry, int interval, int maxTime);
private static int jitter(int min, int max) {
SecureRandom rng = new SecureRandom();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest pulling this out into a field so that we don't have to instantiate it every time we compute a retry interval, e.g.:

private static final Random rng = new SecureRandom();

@timsutton
Copy link
Author

Thanks @basil for all this detailed feedback. Going to digest it and send you back some comments to make sure we're on the right page for how to implement the feature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants