Question: Custom Check, How to exit without any changes, i.e. leave node in current state? #139

flakrat · 2023-08-15T17:51:35Z

Howdy, we have a custom check that retrieves a metric value from Prometheus using curl.

Edit: we are using Slurm as our resource manager.

The check works great, however I need to add code to the check to prevent NHC from changing the state of the node (drained, un-drained) if the curl command fails, examples:

The Prometheus server is not responding
The query doesn't return any metric (could happen if node_exporter died on the node)

Is there a way to return from the function where NHC would not make any changes to the node?

return 0 indicates no failure and triggers an un-drain if the node is already drained, so I can't use that
return 1 or any number indicates failure and drains the node.

Thanks,

Mike Hanby
UAB IT Research Computing

The text was updated successfully, but these errors were encountered:

flakrat · 2023-08-16T20:31:07Z

I ran a few tests and it appears that calling nhcmain_finish works to bypass the code that drain/un-drains the node, however I believe that this would also bypass processing checks further down the line.

I guess putting this particular check at the end of nhc.conf would mitigate this, but it's still hacky.

mej · 2023-08-29T00:06:35Z

So to make sure I understand... You want the check to fail if the correctly curl'd metric is above a certain threshold, but you want it to pass if it can't obtain a valid metric to test against, though in this case you don't want the node put back into service either?

At present, NHC doesn't really have a "soft fail" or a concept of a partially (un)healthy node, and that was really by design. You can, however, make changes to existing configuration values from within the code for your check. So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service. Is that what you're wanting?

Feel free to share the code in question if that might help clarify what you're shooting for here! 😀

flakrat · 2023-08-31T15:23:51Z

So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service.

This is what I'm after, thanks:

Here's the code: https://gitlab.rc.uab.edu/rc/rc-nhc/-/blob/main/uabrc_hw.nhc

mej self-assigned this Aug 29, 2023

mej added the question label Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Custom Check, How to exit without any changes, i.e. leave node in current state? #139

Question: Custom Check, How to exit without any changes, i.e. leave node in current state? #139

flakrat commented Aug 15, 2023 •

edited

Loading

flakrat commented Aug 16, 2023

mej commented Aug 29, 2023

flakrat commented Aug 31, 2023

Question: Custom Check, How to exit without any changes, i.e. leave node in current state? #139

Question: Custom Check, How to exit without any changes, i.e. leave node in current state? #139

Comments

flakrat commented Aug 15, 2023 • edited Loading

flakrat commented Aug 16, 2023

mej commented Aug 29, 2023

flakrat commented Aug 31, 2023

flakrat commented Aug 15, 2023 •

edited

Loading