-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async Healthcheck #301
Async Healthcheck #301
Conversation
I'm not entirely sure we need a new dependency that will span through all components just for its <100 rows of async collection of healths. Let's discuss this. |
If we use more than the 100 lines of async collection of healths its worth using it - otherwise not. If we decide to use it we should make sure the pkg package does not expose any imports to this library (e.g. its internal implementation detail of the sm framework) My idea was apart from the async collection of healths to also configure the library logging to use our logger so that we get health logging. Also the library stores health failure timestamps and consecutive number of failures and other details which we could look at and decide if we want to enrich our health response with these extra details. |
pkg/sm/sm.go
Outdated
} | ||
smb.RegisterControllers(healthcheck.NewController(healthz, smb.HealthAggregationPolicy, smb.health.FailuresTreshold)) | ||
|
||
err := healthz.Start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhere you should listen for ctx.Done()
and call healthz.Stop()
pkg/sm/sm.go
Outdated
smb.RegisterControllers(healthcheck.NewController(healthz, smb.HealthAggregationPolicy, smb.health.FailuresTreshold)) | ||
|
||
err := healthz.Start() | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline
@@ -182,7 +175,6 @@ type TransactionalRepositoryDecorator func(TransactionalRepository) (Transaction | |||
//go:generate counterfeiter . Storage | |||
type Storage interface { | |||
OpenCloser | |||
Pinger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works just because the storage implementation is hardcoded in sm.New
to use postgres.Storage
. Assuming it is not, this should be reverted.
pkg/health/types.go
Outdated
// Settings type to be loaded from the environment | ||
type Settings struct { | ||
FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"` | ||
Interval int64 `description:"seconds between health checks of components"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to time.Duration
and add mapstructure
. if time.Duration
adjust the description
pkg/health/types.go
Outdated
if s.FailuresTreshold < 0 { | ||
return fmt.Errorf("validate Settings: FailuresTreshold must be >= 0") | ||
} | ||
if s.Interval < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this should be something larger. Like 30s/60s at least?
pkg/health/aggregation_policy.go
Outdated
} | ||
return New().WithStatus(overallStatus).WithDetails(details) | ||
} | ||
|
||
// ConvertStatus converts go-health status to Status | ||
func ConvertStatus(status string) Status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't decide if it would be better to change our statuses from UP and Down to OK and Failed to avoid this.
pkg/sm/sm.go
Outdated
<-c.Done() | ||
log.C(c).Debug("Context cancelled. Stopping health checks...") | ||
if err := healthz.Stop(); err != nil { | ||
log.D().Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.C(c)
pkg/health/types.go
Outdated
FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"` | ||
Interval int64 `description:"seconds between health checks of components"` | ||
FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"` | ||
Interval time.Duration `mapstructure:"interval" description:"seconds between health checks of components"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description should be time between...
pkg/health/types.go
Outdated
|
||
// Validate validates the health settings | ||
func (s *Settings) Validate() error { | ||
if s.FailuresTreshold < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add tests for these
} | ||
return healthz.Up() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that these are not used, we might remove our health.Health
altogether because it seems unnecessary conversion from one type to another
pkg/health/aggregation_policy.go
Outdated
if len(healths) == 0 { | ||
return New().WithDetail("error", "no health indicators registered").Unknown() | ||
} | ||
overallStatus := StatusUp | ||
for _, health := range healths { | ||
if health.Status == StatusDown { | ||
if health.Status == "failed" && health.ContiguousFailures > failureTreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
health.ContiguousFailures > failureTreshold
introduces behavior, which I'm not sure we want for all health indicators.
If the storage is down when healthcheck call comes, you might get a response with body that says failed
, but you'll get status 200 OK
. This will happen if the time between the detection of storage down and the call is less than failureThreshold * interval
.
Assuming we configure interval=60s
and /v1/monitor/health
is called every 5 minutes - on the 4th minute the storage fails and now on the health call we get 200 OK
but with failure response. Assuming there is a storage outage for 4 minutes (1 minute before the next health call) then the next health call will report UP again. However, the service has not been operational for 4 minutes and all storage operations were failing.
Does it make sense to have the health configuration be a map[string]Settings
where each health indicator can be separately configured with failureThreshold
and interval
? Then we might configure the storage healthcheck to report down even after 1 failure.
} | ||
|
||
// NewController returns a new healthcheck controller with the given indicators and aggregation policy | ||
func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller { | ||
func NewController(health h.IHealth, aggPolicy health.AggregationPolicy, failuresTreshold int64) web.Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not a good idea (for projects with vendor folders) for pkg methods that will be used in other modules to expect input or return output that requires the caller of this to also import this dependency. this may complicate dependency management for any other project (proxy for example) - ideally creating a controller should require only imports from github.com/Peripli/service-manager
pkg/sm/sm.go
Outdated
err := healthz.AddCheck(&h.Config{ | ||
Name: indicator.Name(), | ||
Checker: indicator, | ||
Interval: smb.health.Interval * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be configurable per HealthIndicator as the library allows - the way you implemented it limits us to having to use the same interval for all indicators which might be ok if the indicator didnt specify its own value
storage/healthcheck_test.go
Outdated
healthIndicator = &storage.HealthIndicator{ | ||
Pinger: pinger, | ||
} | ||
healthIndicator, _ = storage.NewStorageHealthIndicator(storage.PingFunc(ping)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a good practice to skip errors - even if its tests - normally you should handle the err by doing Expect(err).ShouldNot(HaveOccured()) - otherwise in case someone changed the logic in NewStorageHealthIndicator and the err occurred and was ignored here troubleshooting the failing test would be harder
pkg/health/aggregation_policy.go
Outdated
// DefaultAggregationPolicy aggregates the healths by constructing a new Health based on the given | ||
// where the overall health status is negative if one of the healths is negative and positive if all are positive | ||
type DefaultAggregationPolicy struct { | ||
} | ||
|
||
// Apply aggregates the given healths | ||
func (*DefaultAggregationPolicy) Apply(healths map[string]*Health) *Health { | ||
func (*DefaultAggregationPolicy) Apply(healths map[string]health.State, failureTreshold int64) *Health { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The State also includes details such as
CheckTime Time "json:\"check_time\""
ContiguousFailures int64 "json:\"num_failures\""
TimeOfFirstFailure Time "json:\"first_failure_at\""
It would be useful to have these in the json response for each of the indicators
} | ||
|
||
healthz := h.New() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The library allows registering IStatusListener
that gets called back when the health status changes. We could register one so that we can log health status changes - rough example here https://github.com/InVisionApp/go-health/tree/master/examples/status-listener
Since Recovered can also log the times failures happened before recovery this could be useful information in logs
pkg/sm/sm.go
Outdated
@@ -167,9 +178,12 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) ( | |||
|
|||
// Build builds the Service Manager | |||
func (smb *ServiceManagerBuilder) Build() *ServiceManager { | |||
// setup server and add relevant global middleware | |||
smb.installHealth() | |||
err := smb.installHealth() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ..; err != nil { ... }
pkg/sm/sm.go
Outdated
healthz.Logger = l.New(logger) | ||
|
||
for _, indicator := range smb.HealthIndicators { | ||
err := healthz.AddCheck(&h.Config{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ..; err != nil { ... }
pkg/health/aggregation_policy.go
Outdated
overallStatus = StatusDown | ||
break | ||
} | ||
} | ||
details := make(map[string]interface{}) | ||
for k, v := range healths { | ||
details[k] = v | ||
details[k] = ConvertStatus(v.Status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you include the name in the details ? it would probably make sense to know what status comes from what indicator
@@ -17,27 +17,22 @@ | |||
package storage_test | |||
|
|||
import ( | |||
"context" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coverage has dropped with 0.4% which seems a bit too much - could you have a look at new missed lines and see if they can be covered with meaningful tests?
pkg/sm/sm.go
Outdated
return err | ||
} | ||
|
||
// Handles safe termination of sm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like // Gracefully stop health checks or no comment ?
85c9baf
to
b0e93e9
Compare
9594304
to
f197ae0
Compare
pkg/health/types.go
Outdated
} | ||
|
||
// ConfigureIndicators configures registry's indicators with provided settings | ||
func (r *Registry) ConfigureIndicators() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a hidden extra step that needs to be performed. What do you think about hiding the indicators and add a func (Registry) AddHealthIndicator(Inidicator)
that configures it?
pkg/sm/sm.go
Outdated
// setup server and add relevant global middleware | ||
smb.installHealth() | ||
if err := smb.installHealth(); err != nil { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.C(smb.ctx).Panic()
// NewController returns a new healthcheck controller with the given indicators and aggregation policy | ||
func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller { | ||
// NewController returns a new healthcheck controller with the given health and tresholds | ||
func NewController(health h.IHealth, indicators []health.Indicator) web.Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass only the minimum that is required. you don't need the whole indicators slice.
pkg/sm/sm.go
Outdated
@@ -183,7 +183,7 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) ( | |||
// Build builds the Service Manager | |||
func (smb *ServiceManagerBuilder) Build() *ServiceManager { | |||
if err := smb.installHealth(); err != nil { | |||
panic(err) | |||
log.C(smb.ctx).Panic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panic(err)
88129ba
to
ce4d4bb
Compare
pkg/sm/sm.go
Outdated
@@ -132,7 +134,13 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) ( | |||
return nil, fmt.Errorf("error creating core api: %s", err) | |||
} | |||
|
|||
API.HealthIndicators = append(API.HealthIndicators, &storage.HealthIndicator{Pinger: storage.PingFunc(smStorage.Ping)}) | |||
storageHealthIndicator, err := storage.NewStorageHealthIndicator(storage.PingFunc(smStorage.PingContext)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage.NewHealthIndicator (do not repeat storage)
pkg/sm/sm.go
Outdated
} | ||
|
||
API.HealthIndicators = append(API.HealthIndicators, storageHealthIndicator) | ||
API.HealthSettings = cfg.Health.IndicatorsSettings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if its necessary to copy the settings over in the registry as the cfg is acessible in most places anyway?
pkg/sm/sm.go
Outdated
for _, indicator := range smb.HealthIndicators { | ||
if configurableIndicator, ok := indicator.(health.ConfigurableIndicator); ok { | ||
if settings, ok := smb.HealthSettings[configurableIndicator.Name()]; ok { | ||
configurableIndicator.Configure(settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think its necessary to let the indicators know about these settings - Indicator can have .State() and .Name() and no other interfaced methods and the first touchpoint between the settings and the actual indicator will be when calling healthz.AddCheck
storage/healthcheck.go
Outdated
return i.settings.Fatal | ||
} | ||
|
||
func NewStorageHealthIndicator(pingFunc PingFunc) (health.Indicator, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file order - constructors that return concrete types (typically) go right before the struct they return (readability)
pkg/health/types.go
Outdated
FailuresTreshold() int64 | ||
|
||
// Fatal returns if the health indicator is fatal for the overall status | ||
Fatal() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does a fatal = false with failureTreshold = 5 indicator mean ? Meaning do we need both or does having both just bring confusion
} | ||
overallStatus := health.StatusUp | ||
for i, v := range state { | ||
if v.Fatal && v.ContiguousFailures >= c.tresholds[i] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be >= or > (if threshold is the maximum allowed failures, the = should be removed)
return health.New().WithDetail("error", "no health indicators registered") | ||
} | ||
overallStatus := health.StatusUp | ||
for i, v := range state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont really use i
so put _ instead. Also rename state
to overrallState
and v
to state
return nil, fmt.Errorf("error creating storage health indicator: %s", err) | ||
} | ||
|
||
API.HealthIndicators = append(API.HealthIndicators, storageHealthIndicator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could wrap the access to the HealthIndicators in a register method that also adds defaultsettings for this indicator if they are missing and "completes" any incomplete settings for the indicator that is being registered - this will help get rid of the Configurable interface
pkg/health/types.go
Outdated
Apply(healths map[string]*Health) *Health | ||
// ConfigurableIndicator is an interface to provide configurable health of a component | ||
//go:generate counterfeiter . ConfigurableIndicator | ||
type ConfigurableIndicator interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest tey to get rid of this interface. - thereare some ideas in the other comments
// NewController returns a new healthcheck controller with the given indicators and aggregation policy | ||
func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller { | ||
// NewController returns a new healthcheck controller with the given health and tresholds | ||
func NewController(health h.IHealth, tresholds map[string]int64) web.Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NewController is a public api that will be used in other projects- so you have two options - go for a stable api (pass settings and know that when new settings are added the controller public api wont change) or pass minimal and risk having to refactor a public api afterwards. Both approaches are acceptable.
} else { | ||
configurableIndicator.Configure(health.DefaultIndicatorSettings()) | ||
} | ||
settings, ok := smb.cfg.Health.Indicators[indicator.Name()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should probably also handle partcially configured settings that are present in the map - meaning "complete" them with default values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed and it seems that now partically configured indicator will result in confiuration validation error - which is also ok
sqlConfig := &checkers.SQLConfig{ | ||
Pinger: pingFunc, | ||
} | ||
sqlChecker, err := checkers.NewSQL(sqlConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is implementation specific in the storage package. Move this to postgres package or rename this to SQLHealthIndicator
/ PostgresHealthIndicator
.
Motivation
Currently SM healthcheck works synchronously (calls all dependant components synchronously every time a call to /v1/health comes). This is not optimal and introduces unnecessary load both on SM and the dependencies.
Using https://github.com/InVisionApp/go-health healthchecks can happen asynchronously from the API call. We could remember the results of the last few executions so that we can return 500 on /v1/health only if the pg health indicator has failed 3-4 times in a row. This way we can avoid micro outages (e.g. if we couldnt reach the db for 5min straight, then we report status 500) otherwise we report 200 OK with body that mentions that db is down. This way hopefully we avoid microoutages reports caused by other components. We can leverage logic from the dependency in 1. that would help implement this
Approach
Each indicator could be configured separately in application.yml. The properties which could be configured are if the indicator is fatal or not, which says if the health of this indicator will affect the overall health. Failures treshold which is how much times in a row an indicator must report down until the component is considered down. And last is the interval time between each check for the component.
Example structure is:
Positive response sample
Negative response sample (Treshold not exceeded and overall status is UP)
I made a PR in the library to make some of the fields in response optional since some of them are most of the time zero values like
first_failure_at
andnum_failures
when the component is UP, but still no response.InVisionApp/go-health#64