You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Howdy, we have a custom check that retrieves a metric value from Prometheus using curl.
Edit: we are using Slurm as our resource manager.
The check works great, however I need to add code to the check to prevent NHC from changing the state of the node (drained, un-drained) if the curl command fails, examples:
The Prometheus server is not responding
The query doesn't return any metric (could happen if node_exporter died on the node)
Is there a way to return from the function where NHC would not make any changes to the node?
return 0 indicates no failure and triggers an un-drain if the node is already drained, so I can't use that
return 1 or any number indicates failure and drains the node.
Thanks,
Mike Hanby
UAB IT Research Computing
The text was updated successfully, but these errors were encountered:
I ran a few tests and it appears that calling nhcmain_finish works to bypass the code that drain/un-drains the node, however I believe that this would also bypass processing checks further down the line.
I guess putting this particular check at the end of nhc.conf would mitigate this, but it's still hacky.
So to make sure I understand... You want the check to fail if the correctly curl'd metric is above a certain threshold, but you want it to pass if it can't obtain a valid metric to test against, though in this case you don't want the node put back into service either?
At present, NHC doesn't really have a "soft fail" or a concept of a partially (un)healthy node, and that was really by design. You can, however, make changes to existing configuration values from within the code for your check. So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service. Is that what you're wanting?
Feel free to share the code in question if that might help clarify what you're shooting for here! 😀
So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service.
Howdy, we have a custom check that retrieves a metric value from Prometheus using
curl
.Edit: we are using Slurm as our resource manager.
The check works great, however I need to add code to the check to prevent NHC from changing the state of the node (drained, un-drained) if the curl command fails, examples:
Is there a way to return from the function where NHC would not make any changes to the node?
return 0
indicates no failure and triggers anun-drain
if the node is already drained, so I can't use thatreturn 1
or any number indicates failure and drains the node.Thanks,
Mike Hanby
UAB IT Research Computing
The text was updated successfully, but these errors were encountered: