-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Share circuit breakers between workers #227
base: implement_lru_cache
Are you sure you want to change the base?
[DRAFT] Share circuit breakers between workers #227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some high-level first thoughts.
} | ||
|
||
return (int*)val; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@name
doesn't look to be changing after initialization. Could we cache the shmid
in semian_simple_integer_t
? How about the void *val
?
Similarly for circuit_breaker.c
and sliding_window.c
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I actually addressed this in a later commit.
Basically, we want to store the key
generated from the name on the object itself.
I'll double check I've done this everywhere.
If we have this, I'm fairly sure we should also decouple bulkheads from circuits. At the moment (or at least as of the end of 2018), a bulkhead acquisition failure marks an error in the same resource's circuit. There's a lengthy discussion in https://github.com/Shopify/shopify/pull/178393. I'll reproduce the big points for open source sake. At the end, my conclusion was that sharing circuits between processes and decoupling bulkheads would be an attractive option, which is possible after this PR!
|
My big concern with this implementation is the number of shared memory segments that will be created. It looks like we're using distinct shared memory segments for each instance of |
@csfrancis: The data I have for core suggests that it's hundreds, not thousands: Our |
db7215f
to
0d55fb6
Compare
43efef8
to
17e1477
Compare
c83159e
to
1a46fb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first high-level pass. :)
You describe the problem in the opening comment, but not why you think or know this approach will work. Why do we think this will solve it (Simulations proved it?)? What were alternatives considered? How will we prove whether it works?
README.md
Outdated
sensitive. | ||
|
||
To disable host-based circuits, set the environment variable | ||
`SEMIAN_CIRCUIT_BREAKER_IMPL` to `ruby`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are disabled by default for backwards compatibility, might be worth noting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/simple_integer_test.rb
Outdated
@@ -41,6 +42,24 @@ def test_reset | |||
@integer.reset | |||
assert_equal(0, @integer.value) | |||
end | |||
|
|||
if ENV['SEMIAN_CIRCUIT_BREAKER_IMPL'] != 'ruby' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of where you missed the ruby
/ worker
duality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This test can run for both implementations.
this configuration, we can reduce the time-to-open for a circuit from _E * T_ | ||
to simply _T_ (provided that _N_ is greater than _E_). | ||
|
||
You should run a simulation with your workloads to determine an efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really tough advice to give someone. That can be days to weeks of work. Can we, with our simulation tooling, provide some sample values that we can include here? I imagine we can generate a somewhat magic constant based on simulations on (1) ss_error_rate
steady-state error rates, (2) n
number of Semian consumers (threads * workers), (3) error_threshold
, (4) resource_timeout
, (5) whatever other inputs you all use?
I'd like the advice to then be something along the lines of...
While your own simulation will likely find you a superior value for the scale factor, we recognize how much work this is. We've done simulations internally and have found that a scale factor that is `<magic constant> * <threads>` works fairly well and use that for almost all services.
@@ -64,4 +66,30 @@ void Init_semian() | |||
|
|||
/* Maximum number of tickets available on this system. */ | |||
rb_define_const(cSemian, "MAX_TICKETS", INT2FIX(system_max_semaphore_count)); | |||
|
|||
if (use_c_circuits()) { | |||
Init_SimpleInteger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you choose to override them rather than define them and switch at the Ruby-layer instead? The latter seems a bit cleaner to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly because that's roughly the way the bulkhead was implemented, and I wanted to keep the codebase familiar. One could argue that the Ruby implementation of the bulkhead is actually just a stub, but then those functions should have NotImplemented
assertions in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me. 👍
ext/semian/simple_integer.c
Outdated
|
||
dprintf("Initializing simple integer '%s' (key: %lu)", to_s(name), res->key); | ||
res->sem_id = initialize_single_semaphore(res->key, SEM_DEFAULT_PERMISSIONS); | ||
res->shmem = get_or_create_shared_memory(res->key, &init_fn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you decide on shared memory rather than the existing semaphore wrappers for storing this number? (No strong opinion, but curious).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a brief period where there was something else in the shared memory (I can't really remember now). There's no real issue with using a semaphore, and it would get rid of the need for locks, assuming we could guarantee that the operations would not block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think more about this when I review the actual code and not just the approach.
} | ||
|
||
VALUE | ||
semian_simple_sliding_window_push(VALUE self, VALUE value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many other circuits just use a number and not a sliding window. What are the advantages of either approach? One huge advantage would be that you could just use the Integer class instead of a 450 line sliding window implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do circuit breaking without a sliding window, you'd have to use fixed time intervals for error counts, and reset the counter at the end of each fixed-time interval. There are several issues with this, off the top of my head:
- That's a breaking change, compared to the current implementation.
- You don't open circuits as fast, if a series of failure straddles a reset boundary. There can be cases when the circuit never opens at all (which you could argue is a benefit).
- A master/slave election might be necessary to determine which worker resets the counter, which could be more complex than the sliding window implementation.
This approach is admittedly more complex than a single integer counter but I'm not sure that it's worth sacrificing the benefits of a sliding window.
It's hard to reason about the effect of this without millisecond-accurate production data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to ensure we're on the same page, the alternative approach I'm considering is that you have two integers: (1) Number of failures, (2) Timestamp of when to reset the number of failures. This makes the code significantly simpler (in Ruby, it's a few lines, but it's quite a bit more here).
I changed this a few years ago in #24. It's actually addressing a separate point, which was false positives. I think the reason why you didn't arrive at that conclusion was that the previous solution would refresh the TTL every time it was modified.
Overall, likely the way we should look at circuits (I believe @hkdsun suggested this? Or maybe @csfrancis? Not my idea) is to talk about an "error percentage", i.e. if 10% of requests were errors in some window of time, open the circuits. That model is compatible with both, I think.
Again, I have not reviewed the code in detail here (just the approach), but it's a lot of code to maintain for what might appear to be a somewhat marginal advantage. I think my concern from 2015 was more a matter of how easy it was in that approach than real value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pushrax suggested this in our shared doc as well.
My primary motivation for host-based circuits is to address the variable latency problem, where a single worker doesn't see error rates above a threshold but the superposition of all errors on a host shows an anomaly.
I'm not sure it works without some sort of window, because in the implementation before #24, the previous errors would only be purged when there was a period of time since the last observed error with no new errors occurring. If the rate of new errors is greater than the error_timeout
, then the pseudo-window would grow without bound and the circuit would eventually open. So if I'm reading that code correctly, the circuit opens quickly when there are a lot of errors, slowly-but-surely when there are some errors, and only remains closed when there are fewer than one error per error_timeout
.
Using that sort of approach in a host-based circuit implementation likely regresses to the slowly-but-surely case, where the circuit constantly opens at a low frequency. Combined with capacity loss during the half_open_resource_timeout
period, this would be an overall drop in available capacity.
17e1477
to
8f2e24c
Compare
2edd17f
to
c840a9c
Compare
8f2e24c
to
c385e86
Compare
e58b147
to
d16c6ab
Compare
@csfrancis has some legitimate concerns about the number of shared memory segments this PR would create. Scott suggested storing all the sliding windows in a single shared memory segment, but I have concerns with this approach. Specifically, complexity that would be required to manage the indexing into that data structure. Right now, we generate a key and the IPC subsystem handles the lookup. Another difficulty is cleanup. Given that we set For |
@csfrancis was concerned last week that the host-based circuits PR would require too many shared memory segments. The default on Linux is 4,096 and we were going to be using somewhere in the neightborhood of 1,500 with a I took a long stab at it, but ultimately concluded that it was too hard to do garbage collection. With semaphores, it's easy to do garbage collection with To alleviate concerns about the number of shared memory segments, I converted the As far as shipping this is concerned, I have https://github.com/Shopify/shopify/pull/206065 to bump core to use the new branch, with the previous behaviour. Then #243 adds support for an environment variable |
…etter-reject Don't force sliding window entries to be monotonic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not finished reviewing this, but I'm leaving the comments that I've left so far.
ext/semian/sysv_shared_memory.c
Outdated
wait_for_shared_memory(uint64_t key) | ||
{ | ||
for (int i = 0; i < RETRIES; ++i) { | ||
int shmid = shmget(key, SHM_DEFAULT_SIZE, SHM_DEFAULT_PERMISSIONS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This retry logic is interesting - from looking at the docs, is there a case where shmget
will fail and you expect that the retry will succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first try uses IPC_CREAT | IPC_EXCL
. If that fails, then we assume it's because another process has created the segment. I think my intention was to check if that other process had run the shared_memory_init_fn
but I've tried to optimize that out by building data structures that reset to zero'ed out memory.
I removed the wait loop - we can add that feature if we need it.
// or else sem_get will complain that we have requested an incorrect number of sems | ||
// for the desired key, and have changed the number of semaphores for a given key | ||
const int NUM_SEMAPHORES = 4; | ||
sprintf(semset_size_key, "_NUM_SEMS_%d", NUM_SEMAPHORES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you didn't add this, but this logic is just weird to me. If NUM_SEMAPHORES
is a constant, why are we using sprintf
here? Couldn't this be simplified to:
char semset_size_key[] = "_NUM_SEMS_4";
...
uniq_id_str = malloc(strlen(name) + strlen(semset_size_key) + 1);
sprintf(uniq_id_str, "%s%s", name, semset_size_key);
Should also probably be checking for malloc
failures and raising.
static VALUE | ||
resize_window(int sem_id, semian_simple_sliding_window_shared_t* window, int new_max_size) | ||
{ | ||
if (new_max_size > SLIDING_WINDOW_MAX_SIZE) return Qnil; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be raising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ext/semian/sysv_semaphores.c
Outdated
|
||
int sem_id = semget(key, 1, permissions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this produce a warning? I thought that key_t
was a 32-bit value and you've changed the type of 64 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see one, and I compile with -Wall
.
All the values are generated with:
key_t generate_key(const char *name);
I was following what was done in semian_resource_t
.
…imple-integer Force-enable host-based circuits
This PR fixes #27.
What
Moves the implementation of circuit breakers to shared memory so circuit errors can be shared between workers on the same host.
Why
Currently, for single-threaded workers, we need to see
error_threshold
consecutive failures in order to open a circuit. For applications where timeouts are necessarily high (e.g. 25 seconds for MySQL) that translates into up to 75 seconds of blocking behaviour when a resource is unhealthy.If all workers on a host experience the outage simultaneously, that information can be shared and the circuit can be tripped in a single timeout iterations, increasing the speed of detecting the unhealthy resource and opening the circuit.
In addition, in the case of a variable latency outage of an upstream resource, the collective hivemind can share information about the increased latency and detect trends earlier and more reliably. This feature is not implemented yet, but becomes possible after this change.
How
Note: This is very much a draft, and needs a lot of refactoring, but it's a start.
The main idea is
to move the data forto move the data forSimple::Integer
andSimple::SlidingWindow
to shared memorySimple::Integer
to a semaphore and `Simple::SlidingWindow to shared memory.Simple::State
simply uses the sharedSimple::Integer
as its backing store to easily share state between workers on the same host.Feel free to take a look around, but I've still got some refactoring to do.
In particular, this implementation isn't thread-safe yet and requires some serious locking.cc: @Shopify/servcomm
cc: @sirupsen, @csfrancis, @byroot