slingshot_metrics samler function returns an error causing ldmsd to stop the sampler #1533

johnstile · 2024-11-22T01:28:56Z

The ldmsd shows the sampler was stopped, but the other samplers continue working:

Nov 19 10:19:53 nidXXXXXX ldmsd[NNNNN]: Tue Nov 19 10:19:53 2024: ERROR : 'slingshot_metrics': failed to sample. Stopping the plug-in.

When cxi_read_counter fails it closes and returns an error.
The sample function in slingshot_metrics returns the error.
ldmsd disables the sampler.
The generation number stops incrementing.
ldmsd aggregators start generating thousands of "there is an outstanding update" messages due to the producer not
I was able to use ldms_controller to resume the sampler, so I'm not sure why it had an error (need console investigation, but the nic is alive).

Proposed fix discussion for @morrone or @tom95858

Never return an error from the sample function in slingshot_info or slingshot_metrics samplers,
instead log the error and return zero from the sample function to prevent ldmsd from stopping the sampler.

johnstile · 2024-11-22T07:17:40Z

Every time we see the slingshot_metrics stop, The only kernel message I see is for stopping automatic telemetry cache refresh.
Nov 22 03:28:55 nidXXXXXX ldmsd[92889]: Fri Nov 22 03:28:55 2024: ERROR : 'slingshot_metrics': failed to sample. Stopping the plug-in.
There is a corresponding kernel messages
Nov 22 03:28:55 nidXXXXXX kernel: cxi_core 0000:21:00.0: cxi0[hsn0] stopping automatic telemetry cache refresh

tom95858 · 2024-11-22T10:35:56Z

@johnstile we probably need to get HPE involved to explain the kernel message; what it means and what causes it. Obviously, it's not terminal or restarting the sampler would not have worked, i.e. we would have restarted the sampler and it would have stopped right away with another error.

johnstile · 2024-11-22T17:48:20Z

This is a graph showing the rate at which we see the oversampling of this sampler, which is caused by the slingshot_metrics sampler deactivating, which shows a linear growth impact on monitoring the system. Where the slope restart is where I restarted ldmsd on all the nodes which had deactivated the sampler, so this is reoccurring.

tom95858 · 2024-11-22T18:19:54Z

This is a graph showing the rate at which we see the oversampling of this sampler, which is caused by the slingshot_metrics sampler deactivating, which shows a linear growth impact on monitoring the system. Where the slope restart is where I restarted ldmsd on all the nodes which had deactivated the sampler, so this is reoccurring.

Wow this sure looks like a ccil library bug.

tom95858 · 2024-11-22T18:31:23Z

@johnstile is that because the period during which the ccil library actually works is getting shorter? Is the Y axis oversampled messages per second?

tom95858 · 2024-11-22T18:50:23Z

@morrone, are you going to make the log-message/return code changes to slingshot_metrics? I can do it, but I don't want to duplicate the work.

morrone · 2024-11-22T19:24:08Z

@tom95858 I am looking at it now. One question in my mind is: when is it appropriate for a sampler plugin to return non-zero from sample()?

And there would seem to be a problem elsewhere in the ldms code, is there not? The aggregator shouldn't flood the logs when a sampler returns non-zero to sample(), right? Do we know what is causing the "there is an outstanding update" message?

tom95858 · 2024-11-22T20:21:14Z

@morrone, in this context, a return code of zero just means "Don't stop the sampler."

I think we need to do some discovery and interaction with the ccil authors to properly know what the right thing to do is. Which errors are critical, which are retry errors, etc.

@tom95858 I am looking at it now. One question in my mind is: when is it appropriate for a sampler plugin to return non-zero from sample()?

When there is a critical error that can't be resolved by a retry.

And there would seem to be a problem elsewhere in the ldms code, is there not? The aggregator shouldn't flood the logs when a sampler returns non-zero to sample(), right? Do we know what is causing the "there is an outstanding update" message?

I don't think this is an infrastructure error. But this is a valid question, should the log be flooded with messages? What happens at midnight when the logs wrap and the 1st error is now gone? The admin comes in at 8AM and every thing is great, except that you're not collecting data any more in one or more subsystems.

In this particular investigation the cccil error (and loss of data collection) would not have been discovered unless the logs were flooded.

As an aside, in 4.5, we have sophisticated message filtering that gives the administrator greater control over the messages that appear in the log, so that they are able to effectively filter chatty plugins ; but also update the log level for plugins that are already running and suspected of having issues.

johnstile · 2024-11-22T20:32:37Z

@johnstile is that because the period during which the ccil library actually works is getting shorter? Is the Y axis oversampled messages per second?

This is the number of oversampling messages per 2 mins across all cpu compute nodes. It is a side effect that I can measure pretty easy. it is harder to search though all logs on all computes for some string.

tom95858 · 2024-11-23T00:08:19Z

@johnstile, @morrone We should probably open an issue on this for LDMS internals. For example, should an updater that is getting 350k over samples a second not be stopped (assuming the sampler is dead), and periodically restarted?

morrone · 2024-11-23T02:06:42Z

@johnstile, @morrone We should probably open an issue on this for LDMS internals. For example, should an updater that is getting 350k over samples a second not be stopped (assuming the sampler is dead), and periodically restarted?

I don't know what "over samples" are. Or why there would be hundreds of thousands of them a second. So maybe? :)

morrone · 2024-11-23T02:12:58Z

When there is a critical error that can't be resolved by a retry.

I imagine that means a sampler should (almost?) never return an error. Because short of a kernel failure, almost everything else on the node could resolve itself with time.

Do we document the semantics of sample() error codes anywhere (the fact that ldmsd will stop the sampler when an error is returned, rather than just knowing that round of sampling failed). Is there any way to communicate to ldmsd that just this one round of sampling failed? Should we leave the metric set in "inconsistent" mode or something instead of telling ldmsd that more directly?

morrone · 2024-11-23T02:18:21Z

@johnstile I could be wrong, but the kernel message associated with these issues makes me think that your slingshot interface is bouncing (tearing down, and maybe coming back?). Could that be what is going on?

I am thinking I should flush slingshot_metric's device list cache too when an error is encountered, so it doesn't potentially keep trying a device that isn't around.

morrone · 2024-11-23T02:22:47Z

I am thinking I should flush slingshot_metric's device list cache too when an error is encountered, so it doesn't potentially keep trying a device that isn't around.

Oh, cache_cxil_dev_close() removes it from the cache already. Right.

morrone added this to the v4.5.1 milestone Nov 23, 2024

morrone mentioned this issue Nov 23, 2024

[b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails #1535

Merged

morrone removed this from the v4.5.1 milestone Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slingshot_metrics samler function returns an error causing ldmsd to stop the sampler #1533

slingshot_metrics samler function returns an error causing ldmsd to stop the sampler #1533

johnstile commented Nov 22, 2024

johnstile commented Nov 22, 2024

tom95858 commented Nov 22, 2024

johnstile commented Nov 22, 2024 •

edited

Loading

tom95858 commented Nov 22, 2024

tom95858 commented Nov 22, 2024

tom95858 commented Nov 22, 2024

morrone commented Nov 22, 2024

tom95858 commented Nov 22, 2024

johnstile commented Nov 22, 2024

tom95858 commented Nov 23, 2024

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

slingshot_metrics samler function returns an error causing ldmsd to stop the sampler #1533

slingshot_metrics samler function returns an error causing ldmsd to stop the sampler #1533

Comments

johnstile commented Nov 22, 2024

johnstile commented Nov 22, 2024

tom95858 commented Nov 22, 2024

johnstile commented Nov 22, 2024 • edited Loading

tom95858 commented Nov 22, 2024

tom95858 commented Nov 22, 2024

tom95858 commented Nov 22, 2024

morrone commented Nov 22, 2024

tom95858 commented Nov 22, 2024

johnstile commented Nov 22, 2024

tom95858 commented Nov 23, 2024

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

johnstile commented Nov 22, 2024 •

edited

Loading