-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slingshot_metrics samler function returns an error causing ldmsd to stop the sampler #1533
Comments
Every time we see the slingshot_metrics stop, The only kernel message I see is for stopping automatic telemetry cache refresh. |
@johnstile we probably need to get HPE involved to explain the kernel message; what it means and what causes it. Obviously, it's not terminal or restarting the sampler would not have worked, i.e. we would have restarted the sampler and it would have stopped right away with another error. |
This is a graph showing the rate at which we see the oversampling of this sampler, which is caused by the slingshot_metrics sampler deactivating, which shows a linear growth impact on monitoring the system. Where the slope restart is where I restarted ldmsd on all the nodes which had deactivated the sampler, so this is reoccurring. |
Wow this sure looks like a ccil library bug. |
@johnstile is that because the period during which the ccil library actually works is getting shorter? Is the Y axis oversampled messages per second? |
@morrone, are you going to make the log-message/return code changes to slingshot_metrics? I can do it, but I don't want to duplicate the work. |
@tom95858 I am looking at it now. One question in my mind is: when is it appropriate for a sampler plugin to return non-zero from sample()? And there would seem to be a problem elsewhere in the ldms code, is there not? The aggregator shouldn't flood the logs when a sampler returns non-zero to sample(), right? Do we know what is causing the "there is an outstanding update" message? |
@morrone, in this context, a return code of zero just means "Don't stop the sampler." I think we need to do some discovery and interaction with the ccil authors to properly know what the right thing to do is. Which errors are critical, which are retry errors, etc.
When there is a critical error that can't be resolved by a retry.
I don't think this is an infrastructure error. But this is a valid question, should the log be flooded with messages? What happens at midnight when the logs wrap and the 1st error is now gone? The admin comes in at 8AM and every thing is great, except that you're not collecting data any more in one or more subsystems. In this particular investigation the cccil error (and loss of data collection) would not have been discovered unless the logs were flooded. As an aside, in 4.5, we have sophisticated message filtering that gives the administrator greater control over the messages that appear in the log, so that they are able to effectively filter chatty plugins ; but also update the log level for plugins that are already running and suspected of having issues. |
This is the number of oversampling messages per 2 mins across all cpu compute nodes. It is a side effect that I can measure pretty easy. it is harder to search though all logs on all computes for some string. |
@johnstile, @morrone We should probably open an issue on this for LDMS internals. For example, should an updater that is getting 350k over samples a second not be stopped (assuming the sampler is dead), and periodically restarted? |
I don't know what "over samples" are. Or why there would be hundreds of thousands of them a second. So maybe? :) |
I imagine that means a sampler should (almost?) never return an error. Because short of a kernel failure, almost everything else on the node could resolve itself with time. Do we document the semantics of sample() error codes anywhere (the fact that ldmsd will stop the sampler when an error is returned, rather than just knowing that round of sampling failed). Is there any way to communicate to ldmsd that just this one round of sampling failed? Should we leave the metric set in "inconsistent" mode or something instead of telling ldmsd that more directly? |
@johnstile I could be wrong, but the kernel message associated with these issues makes me think that your slingshot interface is bouncing (tearing down, and maybe coming back?). Could that be what is going on? I am thinking I should flush slingshot_metric's device list cache too when an error is encountered, so it doesn't potentially keep trying a device that isn't around. |
Oh, cache_cxil_dev_close() removes it from the cache already. Right. |
The ldmsd shows the sampler was stopped, but the other samplers continue working:
Nov 19 10:19:53 nidXXXXXX ldmsd[NNNNN]: Tue Nov 19 10:19:53 2024: ERROR : 'slingshot_metrics': failed to sample. Stopping the plug-in.
When cxi_read_counter fails it closes and returns an error.
The sample function in slingshot_metrics returns the error.
ldmsd disables the sampler.
The generation number stops incrementing.
ldmsd aggregators start generating thousands of "there is an outstanding update" messages due to the producer not
I was able to use ldms_controller to resume the sampler, so I'm not sure why it had an error (need console investigation, but the nic is alive).
Proposed fix discussion for @morrone or @tom95858
Never return an error from the sample function in slingshot_info or slingshot_metrics samplers,
instead log the error and return zero from the sample function to prevent ldmsd from stopping the sampler.
The text was updated successfully, but these errors were encountered: