Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slingshot_metrics samler function returns an error causing ldmsd to stop the sampler #1533

Open
johnstile opened this issue Nov 22, 2024 · 14 comments

Comments

@johnstile
Copy link

The ldmsd shows the sampler was stopped, but the other samplers continue working:

Nov 19 10:19:53 nidXXXXXX ldmsd[NNNNN]: Tue Nov 19 10:19:53 2024: ERROR : 'slingshot_metrics': failed to sample. Stopping the plug-in.

When cxi_read_counter fails it closes and returns an error.
The sample function in slingshot_metrics returns the error.
ldmsd disables the sampler.
The generation number stops incrementing.
ldmsd aggregators start generating thousands of "there is an outstanding update" messages due to the producer not
I was able to use ldms_controller to resume the sampler, so I'm not sure why it had an error (need console investigation, but the nic is alive).

Proposed fix discussion for @morrone or @tom95858

Never return an error from the sample function in slingshot_info or slingshot_metrics samplers,
instead log the error and return zero from the sample function to prevent ldmsd from stopping the sampler.

@johnstile
Copy link
Author

Every time we see the slingshot_metrics stop, The only kernel message I see is for stopping automatic telemetry cache refresh.
Nov 22 03:28:55 nidXXXXXX ldmsd[92889]: Fri Nov 22 03:28:55 2024: ERROR : 'slingshot_metrics': failed to sample. Stopping the plug-in.
There is a corresponding kernel messages
Nov 22 03:28:55 nidXXXXXX kernel: cxi_core 0000:21:00.0: cxi0[hsn0] stopping automatic telemetry cache refresh

@tom95858
Copy link
Collaborator

@johnstile we probably need to get HPE involved to explain the kernel message; what it means and what causes it. Obviously, it's not terminal or restarting the sampler would not have worked, i.e. we would have restarted the sampler and it would have stopped right away with another error.

@johnstile
Copy link
Author

johnstile commented Nov 22, 2024

This is a graph showing the rate at which we see the oversampling of this sampler, which is caused by the slingshot_metrics sampler deactivating, which shows a linear growth impact on monitoring the system. Where the slope restart is where I restarted ldmsd on all the nodes which had deactivated the sampler, so this is reoccurring.

Screenshot_20241122_094408

@tom95858
Copy link
Collaborator

This is a graph showing the rate at which we see the oversampling of this sampler, which is caused by the slingshot_metrics sampler deactivating, which shows a linear growth impact on monitoring the system. Where the slope restart is where I restarted ldmsd on all the nodes which had deactivated the sampler, so this is reoccurring.

Wow this sure looks like a ccil library bug.

@tom95858
Copy link
Collaborator

@johnstile is that because the period during which the ccil library actually works is getting shorter? Is the Y axis oversampled messages per second?

@tom95858
Copy link
Collaborator

@morrone, are you going to make the log-message/return code changes to slingshot_metrics? I can do it, but I don't want to duplicate the work.

@morrone
Copy link
Collaborator

morrone commented Nov 22, 2024

@tom95858 I am looking at it now. One question in my mind is: when is it appropriate for a sampler plugin to return non-zero from sample()?

And there would seem to be a problem elsewhere in the ldms code, is there not? The aggregator shouldn't flood the logs when a sampler returns non-zero to sample(), right? Do we know what is causing the "there is an outstanding update" message?

@tom95858
Copy link
Collaborator

@morrone, in this context, a return code of zero just means "Don't stop the sampler."

I think we need to do some discovery and interaction with the ccil authors to properly know what the right thing to do is. Which errors are critical, which are retry errors, etc.

@tom95858 I am looking at it now. One question in my mind is: when is it appropriate for a sampler plugin to return non-zero from sample()?

When there is a critical error that can't be resolved by a retry.

And there would seem to be a problem elsewhere in the ldms code, is there not? The aggregator shouldn't flood the logs when a sampler returns non-zero to sample(), right? Do we know what is causing the "there is an outstanding update" message?

I don't think this is an infrastructure error. But this is a valid question, should the log be flooded with messages? What happens at midnight when the logs wrap and the 1st error is now gone? The admin comes in at 8AM and every thing is great, except that you're not collecting data any more in one or more subsystems.

In this particular investigation the cccil error (and loss of data collection) would not have been discovered unless the logs were flooded.

As an aside, in 4.5, we have sophisticated message filtering that gives the administrator greater control over the messages that appear in the log, so that they are able to effectively filter chatty plugins ; but also update the log level for plugins that are already running and suspected of having issues.

@johnstile
Copy link
Author

@johnstile is that because the period during which the ccil library actually works is getting shorter? Is the Y axis oversampled messages per second?

This is the number of oversampling messages per 2 mins across all cpu compute nodes. It is a side effect that I can measure pretty easy. it is harder to search though all logs on all computes for some string.

@tom95858
Copy link
Collaborator

@johnstile, @morrone We should probably open an issue on this for LDMS internals. For example, should an updater that is getting 350k over samples a second not be stopped (assuming the sampler is dead), and periodically restarted?

@morrone
Copy link
Collaborator

morrone commented Nov 23, 2024

@johnstile, @morrone We should probably open an issue on this for LDMS internals. For example, should an updater that is getting 350k over samples a second not be stopped (assuming the sampler is dead), and periodically restarted?

I don't know what "over samples" are. Or why there would be hundreds of thousands of them a second. So maybe? :)

@morrone
Copy link
Collaborator

morrone commented Nov 23, 2024

When there is a critical error that can't be resolved by a retry.

I imagine that means a sampler should (almost?) never return an error. Because short of a kernel failure, almost everything else on the node could resolve itself with time.

Do we document the semantics of sample() error codes anywhere (the fact that ldmsd will stop the sampler when an error is returned, rather than just knowing that round of sampling failed). Is there any way to communicate to ldmsd that just this one round of sampling failed? Should we leave the metric set in "inconsistent" mode or something instead of telling ldmsd that more directly?

@morrone
Copy link
Collaborator

morrone commented Nov 23, 2024

@johnstile I could be wrong, but the kernel message associated with these issues makes me think that your slingshot interface is bouncing (tearing down, and maybe coming back?). Could that be what is going on?

I am thinking I should flush slingshot_metric's device list cache too when an error is encountered, so it doesn't potentially keep trying a device that isn't around.

@morrone
Copy link
Collaborator

morrone commented Nov 23, 2024

I am thinking I should flush slingshot_metric's device list cache too when an error is encountered, so it doesn't potentially keep trying a device that isn't around.

Oh, cache_cxil_dev_close() removes it from the cache already. Right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants