MSOutput Should not silently tolerate exceptions during rule creation #11941

klannon · 2024-03-20T19:16:15Z

Impact of the bug
MSOutput

Describe the bug
A change in Rucio caused tape output rule creation to fail and we missed this for 3 weeks causing 7 PB of tape transfers to pile up.

How to reproduce it
Break Rucio

Expected behavior
If MSOutput fails to create a rule it, it should trigger an alarm, at least if it fails for multiple cycles.

Additional context and error message
None

amaltaro · 2024-07-23T11:09:10Z

And I just stumbled upon this issue, after resolving another 3-4 weeks outage of rule creation in MSOutput, addressed in this ticket: #12044

I am setting this ticket to Q4 such that we can at least implement an alarm and get notified when the whole MSOutputConsumer cycle is skipped.

mapellidario · 2024-10-08T15:37:36Z

After a private discussion with Alan, we decided that I can start working on this issue. so far, we agreed on a two-pronged approach:

make sure that if MSOutput has trouble processing a single workflow, it continues processing the next workflow and does not break from the loop.
Send monitoring data to a new AMQ topic (or re-use and existing topic with a new document type).
- the idea is to have a single plot with a time series of the number of workflows that MSOutput needs to process
- if MSOutput fails to process one or more workflows, then it will be visible as a non-zero baseline in this new plot. I like this approached because it worked reliably for CRAB's Publisher monitoring
- we can setup a grafana alert on this plot (send a message if the number of workflows is above a threshold for longer than X days.)

We considered the idea of having MSOutput sending an alert if it fails to process a workflow N times, but it would require to implement some new logic to keep track of past attempts. Too much effort developing new code when we can achieve the same result exploiting existing monitoring infrastructure.

amaltaro added this to WMCore quarterly developments Jul 23, 2024

amaltaro moved this to ToDo in WMCore quarterly developments Jul 23, 2024

amaltaro assigned mapellidario Oct 8, 2024

amaltaro moved this from ToDo to In Progress in WMCore quarterly developments Oct 8, 2024

amaltaro added Monitoring MSOutput BUG labels Oct 8, 2024

mapellidario linked a pull request Dec 4, 2024 that will close this issue

ms-output - do not break out producer and consumer loops #12194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSOutput Should not silently tolerate exceptions during rule creation #11941

MSOutput Should not silently tolerate exceptions during rule creation #11941

klannon commented Mar 20, 2024 •

edited by amaltaro

Loading

amaltaro commented Jul 23, 2024

mapellidario commented Oct 8, 2024

MSOutput Should not silently tolerate exceptions during rule creation #11941

MSOutput Should not silently tolerate exceptions during rule creation #11941

Comments

klannon commented Mar 20, 2024 • edited by amaltaro Loading

amaltaro commented Jul 23, 2024

mapellidario commented Oct 8, 2024

klannon commented Mar 20, 2024 •

edited by amaltaro

Loading