Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Share circuit breakers between workers #227

Open
wants to merge 51 commits into
base: implement_lru_cache
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
e74c999
WIP: Move CircuitBreaker implementation to shared memory
Jun 11, 2019
e89c2c5
WIP: Move Simple::State to shared memory
Jun 12, 2019
bb0e96e
WIP: Working Simple::Integer and Simple::State
Jun 12, 2019
1d04c0c
WIP: Working Simple::SlidingWindow in shared memory
Jun 12, 2019
3d78bbe
Compile in C99 mode
Jun 12, 2019
e20c078
Fixes from code review
Jun 12, 2019
569a551
Added a dprintf function
Jun 13, 2019
f55e521
Write failing race test
spike01 Jun 13, 2019
f517f75
Allow fallback to Ruby to be toggled by env var
spike01 Jun 13, 2019
7c0fc80
Set up single semaphore
spike01 Jun 13, 2019
7325b47
Move default permissions into sysv_semaphore header
spike01 Jun 13, 2019
3c45b3f
Clarify critical sections
spike01 Jun 13, 2019
35217ae
Rename rb_val->retval
spike01 Jun 14, 2019
7ba1e9e
Change flag for choosing circuit implementation
spike01 Jun 14, 2019
8e64bd0
Clarify flag scope
spike01 Jun 14, 2019
def383d
Use semaphore locks around sliding window
spike01 Jun 14, 2019
efb62a9
Stylistic changes
Jun 14, 2019
7474bed
Quick script to reset IPC state
Jun 17, 2019
203cdfd
More locks around shared data structures
Jun 17, 2019
f251c8f
Reduce test flakiness with unique resource names
Jun 17, 2019
128788f
Grow/shrink the sliding window's max_size
Jun 18, 2019
5bb4d65
Ignore CLion project files
Jun 18, 2019
f6bcf7f
WIP: Registered worker based scaling factor for error_threshold
Jun 18, 2019
3498109
Fixed compile error
Jun 19, 2019
e4631b0
Removed spammy debug message
Jun 24, 2019
effaea6
Made the default for SEMIAN_CIRCUIT_BREAKER_IMPL to be worker/Ruby based
Jun 24, 2019
bff9225
Run both worker/host based circuit breaker implementations in CI
Jun 24, 2019
0ae9c55
Allow SEMIAN_DEBUG to enable debug messages
Jun 25, 2019
26aa0dd
Ignore log files
Jun 25, 2019
43f6a57
Review comments
Jun 25, 2019
dafe362
new build matrix
epk Jun 25, 2019
63495a9
make workdir similar to travis path
epk Jun 25, 2019
13763b1
Exclude running tests with host-based circuits for gemfiles
epk Jun 26, 2019
e2f8a5d
fix typo
epk Jun 26, 2019
efd72d0
Ignore BeyondCompare merge files
Jun 26, 2019
d16c6ab
Implement the scale_factor for host-based sliding windows
Jun 26, 2019
ddd7268
Enabled debug messages with SEMIAN_DEBUG
Jun 30, 2019
0a00e94
Store SimpleInteger in semaphores, not shared memory
Jun 30, 2019
4682bed
Print debug messages with timestamp and process ID
Jul 2, 2019
f43ccbf
Tests are too noisy
Jul 3, 2019
af854a5
Don't force sliding window entries to be monotonic
Jul 3, 2019
6445222
Review comments
Jul 3, 2019
3e04a43
Merge pull request #246 from Shopify/mkipper/global-circuit-breaker-b…
Jul 3, 2019
3655ae8
Allow SEMIAN_CIRCUIT_BREAKER_FORCE_HOST to force-enable host-based ci…
Jul 1, 2019
926074d
Build semian libraries before unit tests
Jul 2, 2019
05796e7
Force circuits in Ruby
Jul 3, 2019
be44b32
Review fixes
Jul 4, 2019
d3f252c
Review fixes
Jul 4, 2019
ea78022
Review comments
Jul 4, 2019
9fb115a
Merge pull request #243 from Shopify/mkipper/global-circuit-breaker-s…
Jul 4, 2019
286b6c1
Skip test
Jul 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
/tmp
/pkg
*.gem
*.log
*.orig
/html/
Gemfile.lock
vendor/
Expand All @@ -18,3 +20,5 @@ nohup.out

# IntelliJ/RubyMine/CLion project files
.idea
CMakeLists.txt
cmake-build-debug
29 changes: 23 additions & 6 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,32 @@ before_install:
services:
- docker

gemfile:
- Gemfile
- gemfiles/hiredis-0-6.gemfile
- gemfiles/mysql2-0-4-10.gemfile
- gemfiles/mysql2-0-5-0.gemfile

env:
- SEMIAN_CIRCUIT_BREAKER_IMPL=worker
- SEMIAN_CIRCUIT_BREAKER_IMPL=host

matrix:
include:
- env: GEMFILE=gemfiles/hiredis-0-6.gemfile
- env: GEMFILE=gemfiles/mysql2-0-4-10.gemfile
- env: GEMFILE=gemfiles/mysql2-0-5-0.gemfile
- env: GEMFILE=Gemfile
exclude:
- gemfile: gemfiles/hiredis-0-6.gemfile
env: SEMIAN_CIRCUIT_BREAKER_IMPL=host
- gemfile: gemfiles/mysql2-0-4-10.gemfile
env: SEMIAN_CIRCUIT_BREAKER_IMPL=host
- gemfile: gemfiles/mysql2-0-5-0.gemfile
env: SEMIAN_CIRCUIT_BREAKER_IMPL=host

script:
- docker build --build-arg BUNDLE_GEMFILE="$GEMFILE" -t shopify/semian-ci:latest -f dockerfiles/semian-ci .
- |
docker build \
--build-arg BUNDLE_GEMFILE="$BUNDLE_GEMFILE" \
--build-arg SEMIAN_CIRCUIT_BREAKER_IMPL="$SEMIAN_CIRCUIT_BREAKER_IMPL" \
-t shopify/semian-ci:latest \
-f dockerfiles/semian-ci .
- travis_retry docker-compose -f docker-compose.ci.yml up --force-recreate --exit-code-from semian

notifications:
Expand Down
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,33 @@ There are three configuration parameters for circuit breakers in Semian:
* **success_threshold**. The amount of successes on the circuit until closing it
again, that is to start accepting all requests to the circuit.
* **half_open_resource_timeout**. Timeout for the resource in seconds when the circuit is half-open (only supported for MySQL and Net::HTTP).
* **scale_factor**. When using [host-based circuits](#host-based-circuits), the
scaling factor to determine how to scale `error_threshold * num_workers` to
achieve faster circuit opens.

#### Host-Based Circuits

On systems with [SysV support][sysv], we can share circuit error information
between processes on a server. This means that those processes can effectively
share information between each other about resource health, leading to faster,
more efficient opening of circuits.

As an example, imagine a system with _N_ processes on a host with an error
threshold of _E_, and a client timeout of _T_. By default, the system needs to
see _N * E_ errors to open the circuit. We can reduce this by using the
`scale_factor` configuration parameter. If we set `scale_factor` to _1 / N_,
the total number of errors we'd need to see server-wide is still _E_. In
this configuration, we can reduce the time-to-open for a circuit from _E * T_
to simply _T_ (provided that _N_ is greater than _E_).

You should run a simulation with your workloads to determine an efficient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really tough advice to give someone. That can be days to weeks of work. Can we, with our simulation tooling, provide some sample values that we can include here? I imagine we can generate a somewhat magic constant based on simulations on (1) ss_error_rate steady-state error rates, (2) n number of Semian consumers (threads * workers), (3) error_threshold, (4) resource_timeout, (5) whatever other inputs you all use?

I'd like the advice to then be something along the lines of...

While your own simulation will likely find you a superior value for the scale factor, we recognize how much work this is. We've done simulations internally and have found that a scale factor that is `<magic constant> * <threads>` works fairly well and use that for almost all services.

scaling factor that will produce a time-to-open reduction but isn't too
sensitive.

The circuit breaker implementation is based on the environment variable
`SEMIAN_CIRCUIT_BREAKER_IMPL`. Set it to `worker` to disable sharing circuit
state and `host` to enable host-based circuits. The default is `worker` for
backward compatibility.

### Bulkheading

Expand Down
5 changes: 4 additions & 1 deletion dockerfiles/semian-ci
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ RUN apt-get update \

RUN gem install bundler

WORKDIR /app
WORKDIR /home/travis/build/Shopify/semian
COPY . .
ARG BUNDLE_GEMFILE
ENV BUNDLE_GEMFILE="${BUNDLE_GEMFILE}"

ARG SEMIAN_CIRCUIT_BREAKER_IMPL
ENV SEMIAN_CIRCUIT_BREAKER_IMPL="${SEMIAN_CIRCUIT_BREAKER_IMPL}"
67 changes: 67 additions & 0 deletions ext/semian/circuit_breaker.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#include "circuit_breaker.h"

#include "sysv_semaphores.h"
#include "sysv_shared_memory.h"
#include "types.h"
#include "util.h"

void
semian_circuit_breaker_free(void* ptr)
{
semian_circuit_breaker_t* res = (semian_circuit_breaker_t*)ptr;
free_shared_memory(res->shmem);
}

size_t
semian_circuit_breaker_size(const void* ptr)
{
return sizeof(semian_circuit_breaker_t);
}

static const rb_data_type_t semian_circuit_breaker_type = {
.wrap_struct_name = "semian_circuit_breaker",
.function = {
.dmark = NULL,
.dfree = semian_circuit_breaker_free,
.dsize = semian_circuit_breaker_size,
},
.data = NULL,
.flags = RUBY_TYPED_FREE_IMMEDIATELY,
};

void
Init_CircuitBreaker() {
dprintf("Init_CircuitBreaker");

VALUE cSemian = rb_const_get(rb_cObject, rb_intern("Semian"));
VALUE cCircuitBreaker = rb_const_get(cSemian, rb_intern("CircuitBreaker"));

rb_define_alloc_func(cCircuitBreaker, semian_circuit_breaker_alloc);
rb_define_method(cCircuitBreaker, "initialize_circuit_breaker", semian_circuit_breaker_initialize, 2);
}

VALUE
semian_circuit_breaker_alloc(VALUE klass)
{
semian_circuit_breaker_t *res;
VALUE obj = TypedData_Make_Struct(klass, semian_circuit_breaker_t, &semian_circuit_breaker_type, res);
return obj;
}

VALUE
semian_circuit_breaker_initialize(VALUE self, VALUE name)
{
semian_circuit_breaker_t *res = NULL;
TypedData_Get_Struct(self, semian_circuit_breaker_t, &semian_circuit_breaker_type, res);

char buffer[1024];
strcpy(buffer, to_s(name));
strcat(buffer, "_circuit_breaker");
res->key = generate_key(buffer);

dprintf("Initializing circuit breaker '%s' (key: %lu)", buffer, res->key);
res->sem_id = initialize_single_semaphore(res->key, SEM_DEFAULT_PERMISSIONS);
res->shmem = get_or_create_shared_memory(res->key, NULL);

return self;
}
12 changes: 12 additions & 0 deletions ext/semian/circuit_breaker.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#ifndef CIRCUIT_BREAKER_H
#define CIRCUIT_BREAKER_H

#include <ruby.h>

void Init_CircuitBreaker();

VALUE semian_circuit_breaker_alloc(VALUE klass);
VALUE semian_circuit_breaker_initialize(VALUE self, VALUE id);
VALUE semian_circuit_breaker_successes(VALUE self);

#endif // CIRCUIT_BREAKER_H
2 changes: 1 addition & 1 deletion ext/semian/extconf.rb
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
have_func 'rb_thread_blocking_region'
have_func 'rb_thread_call_without_gvl'

$CFLAGS = "-D_GNU_SOURCE -Werror -Wall "
$CFLAGS = "-D_GNU_SOURCE -Werror -Wall -std=gnu99 "
if ENV.key?('DEBUG')
$CFLAGS << "-O0 -g -DDEBUG"
else
Expand Down
42 changes: 10 additions & 32 deletions ext/semian/resource.c
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#include "resource.h"

#include "util.h"

static VALUE
cleanup_semian_resource_acquire(VALUE self);

Expand All @@ -15,9 +17,6 @@ check_tickets_arg(VALUE tickets);
static long
check_permissions_arg(VALUE permissions);

static const
char *check_id_arg(VALUE id);

static double
check_default_timeout_arg(VALUE default_timeout);

Expand Down Expand Up @@ -202,22 +201,16 @@ semian_resource_key(VALUE self)
VALUE
semian_resource_initialize(VALUE self, VALUE id, VALUE tickets, VALUE quota, VALUE permissions, VALUE default_timeout)
{
long c_permissions;
double c_timeout;
double c_quota;
int c_tickets;
semian_resource_t *res = NULL;
const char *c_id_str = NULL;

// Check and cast arguments
check_tickets_xor_quota_arg(tickets, quota);
c_quota = check_quota_arg(quota);
c_tickets = check_tickets_arg(tickets);
c_permissions = check_permissions_arg(permissions);
c_id_str = check_id_arg(id);
c_timeout = check_default_timeout_arg(default_timeout);
double c_quota = check_quota_arg(quota);
int c_tickets = check_tickets_arg(tickets);
long c_permissions = check_permissions_arg(permissions);
const char *c_id_str = check_id_arg(id);
double c_timeout = check_default_timeout_arg(default_timeout);

// Build semian resource structure
semian_resource_t *res = NULL;
TypedData_Get_Struct(self, semian_resource_t, &semian_resource_type, res);

// Populate struct fields
Expand Down Expand Up @@ -314,23 +307,6 @@ check_tickets_arg(VALUE tickets)
return c_tickets;
}

static const char*
check_id_arg(VALUE id)
{
const char *c_id_str = NULL;

if (TYPE(id) != T_SYMBOL && TYPE(id) != T_STRING) {
rb_raise(rb_eTypeError, "id must be a symbol or string");
}
if (TYPE(id) == T_SYMBOL) {
c_id_str = rb_id2name(rb_to_id(id));
} else if (TYPE(id) == T_STRING) {
c_id_str = RSTRING_PTR(id);
}

return c_id_str;
}

static double
check_default_timeout_arg(VALUE default_timeout)
{
Expand Down Expand Up @@ -361,6 +337,8 @@ static inline void
semian_resource_free(void *ptr)
{
semian_resource_t *res = (semian_resource_t *) ptr;
dprintf("Freeing resource sem_id:%d", res->sem_id);

if (res->name) {
free(res->name);
res->name = NULL;
Expand Down
28 changes: 28 additions & 0 deletions ext/semian/semian.c
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#include "semian.h"

static int use_c_circuits();

void Init_semian()
{
VALUE cSemian, cResource;
Expand Down Expand Up @@ -64,4 +66,30 @@ void Init_semian()

/* Maximum number of tickets available on this system. */
rb_define_const(cSemian, "MAX_TICKETS", INT2FIX(system_max_semaphore_count));

if (use_c_circuits()) {
Init_SimpleInteger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to override them rather than define them and switch at the Ruby-layer instead? The latter seems a bit cleaner to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly because that's roughly the way the bulkhead was implemented, and I wanted to keep the codebase familiar. One could argue that the Ruby implementation of the bulkhead is actually just a stub, but then those functions should have NotImplemented assertions in them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. 👍

Init_SlidingWindow();
Init_CircuitBreaker();
}
}

static int
use_c_circuits() {
char *circuit_impl = getenv("SEMIAN_CIRCUIT_BREAKER_IMPL");
if (circuit_impl == NULL) {
fprintf(stderr, "Warning: Defaulting to worker-based circuit breaker implementation\n");
return 0;
} else {
if (!strcmp(circuit_impl, "worker")) {
return 0;
} else if (!strcmp(circuit_impl, "host")) {
return 1;
} else {
fprintf(stderr, "Warning: Unknown circuit breaker implementation: '%s'\n", circuit_impl);
return 0;
}
}

rb_raise(rb_eArgError, "Unknown circuit breaker implementation");
}
3 changes: 3 additions & 0 deletions ext/semian/semian.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ Implements Init_semian, which is used as C/Ruby entrypoint.
#ifndef SEMIAN_H
#define SEMIAN_H

#include "circuit_breaker.h"
#include "resource.h"
#include "simple_integer.h"
#include "sliding_window.h"

void Init_semian();

Expand Down
Loading