-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error rate circuit breaker #264
base: main
Are you sure you want to change the base?
Conversation
f17bd19
to
bad8a75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.
However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.
|
||
def maybe_with_half_open_resource_timeout(resource, &block) | ||
result = | ||
if half_open? && @half_open_resource_timeout && resource.respond_to?(:with_resource_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use two spaces for indentation on this block (rather than 4)
end | ||
|
||
def error_threshold_reached? | ||
return false if @window.empty? or @window.length < @request_volume_threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request_volume_threshold
is an interesting parameter because its desired value varies depending on the volume of requests.
If the volume of requests happening in window_size ever drops bellow request_volume_threshold
, the circuit will never open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intentional, I modeled it after this property in Hystrix: https://github.com/Netflix/Hystrix/wiki/Configuration#circuitBreaker.requestVolumeThreshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I can see how this is useful now. If you configure it to be above time_window / (timeout * request_volume_threshold)
then it acts as a damper for the problem I was describing in #264 (comment)
I was talking about this with a colleague, and we both agreed it would be better to specifically specify which circuit breaker implementation you want instead of determining based on the options passed in. I plan on making that change. I'm currently working on other things at the moment, but should have time to cycle back on this in the next week. |
Thank you for the suggestions. I just pushed an implementation that calculates the time spent in error vs success. Please let me know your thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work. I like the direction this is heading. Just a few minor comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you for adding this feature! Feel free to keep me posted on how it works out for you in a live production environment. 👍
Hi @RyanAD any progress on this one? |
We've been successfully using this within Instacart for a while now. Let me confirm we don't have any changes that should be incorporated here. I'll follow up by the end of next week. |
I pushed two commits for code cleanup and a small thread safety fix in TimeSlidingWindow. This code has been running successfully within Instacart for several months. What are next steps for merging? |
@RyanAD next steps:
|
@miry is this still valid? What's missing, beside the conflict, to push this out? |
end | ||
|
||
def disabled? | ||
ENV['SEMIAN_CIRCUIT_BREAKER_DISABLED'] || ENV['SEMIAN_DISABLED'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires rethinking of changes in master branch. Idea to not use ENV inside business logic, only during the configuration phase.
success_threshold: 2, | ||
half_open_resource_timeout: nil, | ||
time_source: -> {Time.now.to_f * 1000}) | ||
Timecop.travel(-1.1) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timecop does not work well with Monotonic clocks.
Replaced it with custom made helper to travel in time.
This PR adds a circuit breaker that opens when the error percentage crosses some threshold (ex, 25% of calls failed in the last 10 seconds), also discussed here: #245. Ideally trying to solve for the problem mentioned in this comment: #245 (comment) . We have unpredictable spikes in traffic, which make tuning an absolute # of errors in a time window difficult.
There are some methods that technically violate the DRY principle, in particular the
acquire
andmaybe_with_half_open_resource_timeout
methods on lib/semian/error_rate_circuit_breaker.rb and lib/semian/circuit_breaker.rb are the same. I'm open to suggestions on if this should be fixed, and the best way to do so.This has not been battle-tested in a high volume production environment yet, and I'm still working on documentation. Also working on using Timecop in the new unit tests.