WMAgent - cronjobs spam #12184

todor-ivanov · 2024-11-27T08:05:50Z

Impact of the bug
WMAgent

Describe the bug
Upon fixing #12166 A spam behavior was reported by people, notifying that the restart component cronjob is generating a lot of useless emails. The reason is that crontab's stdout gets emailed to the user cmst1 and consequently anybody to whom its mailbox has been redirected as well. One of the cronjob records in question is as follows:

*/15 * * * * source /data/WMAgent.venv3/bin/activate > /dev/null &&  source /data/WMAgent.venv3/deploy/restartComponent.sh 2>&1 >> /data/WMAgent.venv3/srv/wmagent/2.3.7/logs/component-restart.log

The situation becomes even worse especially when a machine deployment goes wrong and the component restarts intensify. We happen to notice that in testbed only because with the virtualenv. setup we fixed the cronjobs mailing mechanism first. Once the solution of #12159 (through dmwm/CMSKubernetes#1566) gets propagated to the Docker containers we run in production, we would have been hit by a e-mail storm much bigger than what we noticed in testbed from just a single agent. It would have been a single e-mail per every run of every cronjob from all of the Docker containers in production.

And things do not stop there. In one of those cronjobs, we have a hardwired absolute path not coupled to the environment and the deployment root:

WMCore/deploy/restartComponent.sh

Line 15 in 90a14e5

install="/data/srv/wmagent/current/install/"

Which additionally leads to constantly scanning wrong component logs for errors, especially when it comes to virtual env. setup, and causing indefinite restarts of perfectly healthy components. This path should be tied to the environment variable: WMA_INSATLL_DIR

How to reproduce it
Just run the crontabs as currently generated at the agents

Expected behavior

No e-mails to be generated by cronjobs' stdout.

Additional context and error message
None

The text was updated successfully, but these errors were encountered:

todor-ivanov · 2024-11-28T07:38:08Z

ok ... I am reopening this issue because there were two problems reported by T0 team:

They found out the component restarts failed due to a wrong component name identification - it is been there before, I just happened to miss it myself. At this line:

WMCore/deploy/restartComponent.sh

Line 16 in 83c08ff

comps=$(ls $WMA_INSTALL_DIR)

(WMAgent.venv3) cmst0@vocms0502:~ $ comps=$(ls $WMA_INSTALL_DIR)
(WMAgent.venv3) cmst0@vocms0502:~ $ echo $comps 
AgentStatusWatcher/ AnalyticsDataCollector/ DBS3Upload/ ErrorHandler/ JobAccountant/ JobArchiver/ JobCreator/ JobStatusLite/ JobSubmitter/ JobTracker/ RetryManager/ RucioInjector/ TaskArchiver/ Tier0Feeder/

thanks @LinaresToine for reporting this

They have tested this mechanism in two of their production agents and turned out the so hardcoded grace period of 30min before considering a component for stale is far from enough, especially when it comes to HI runs:

WMCore/deploy/restartComponent.sh

Line 26 in 83c08ff

if (("$INTERVAL" >= 1800)); then

The result is they noticed the mechanism triggered on every run of the cronjob command and only the formerly explained error about the component naming, saved them from entering a spiral of component restarts.

So this should be configurable!!!

LinaresToine · 2024-11-28T07:56:47Z

Thank you @todor-ivanov. Having the cron job configurable would be useful for us. After what we have seen this year, with the DBS3Uploader taking 2+ hours under heavy load, and now seeing a similar symptom in the RucioInjector during the HI run, I believe that restarting these components after 30 minutes is not necessary. However, this does not imply that we don't want to be warned about it, which is already done by WMStats. For other components, an automatic restart after more than 30 minutes may be too much.

Is it possible to decouple the components from the cron job by creating one cron job per component? And have a configurable attribute in each component configuration that defines the grace period before an automatic restart?

For example (name can be improved):

config.RucioInjector.secondsBeforeRestart = 7200
config.JobCreator.secondsBeforeRestart = 1800
...

With this setup we would be able to modify the config according to the issue at hand.

@amaltaro @todor-ivanov would something like this be possible?
Thanks again for looking into this

todor-ivanov · 2024-11-28T08:38:16Z

To add something I just noticed. I happened to miss the error reported about the bad component names resolution, simply because the T0 and Production agents behave differently.

A T0 agent:

(WMAgent.venv3) cmst0@vocms0502:/data/tier0/WMAgent.venv3 $ ls $WMA_INSTALL_DIR/
AgentStatusWatcher/  AnalyticsDataCollector/  DBS3Upload/   ErrorHandler/  JobAccountant/  JobArchiver/  JobCreator/  JobStatusLite/  JobSubmitter/  JobTracker/  RetryManager/
RucioInjector/	     TaskArchiver/	      Tier0Feeder/

vs.

A Production agent:

(WMAgent.venv3) cmst1@vocms0260:WMAgent.venv3 $ ls $WMA_INSTALL_DIR/
AgentStatusWatcher      ArchiveDataReporter  ErrorHandler   JobArchiver  JobStatusLite  JobTracker  RetryManager   TaskArchiver     WorkQueueManager
AnalyticsDataCollector  DBS3Upload           JobAccountant  JobCreator   JobSubmitter   JobUpdater  RucioInjector  WorkflowUpdater

So definitely not the optimal way of identifying the list of enabled components at the agent.

p.s.
And here is the reason why. The T0 account has an alias as: alias ls='/bin/ls -G -oFxb', which causes this.

todor-ivanov · 2024-11-28T11:26:57Z

Hi @LinaresToine

About the so reported issues:

Here is the first solution for properly estimating the currently running components at the agent: #12189

About the suggestion here: WMAgent - cronjobs spam #12184 (comment):

I think this is possible, but would require rewriting the whole restartComponent.sh script in python. It definitely deserves a separate WMCore issue, addressing also the possibility to add per component configurable parameter through the relevant component config section. Said that, @LinaresToine could you please create this issue and put this explanation there, such that we could properly plan, work on, and account it. Thank you in advance!

In the meantime, the temporary solution for increasing the grace period before component restarts just for T0 agents, is unfortunately, again tweaks and sed-based shell hacks as it is done in other parts of the T0 agent setup process. I personally, strongly dislike such approach, and would prefer this to be addressed the proper way, aside the eventual discussion we must have in the team on two topics :

What should be the exact logic of this component restart mechanism, such that we properly address the pure transient errors in the components and avoid eventual spiral of components restart triggered by externally driven load on the system?
Is the "naked" and unbuffered mailing mechanism the best way for alarming the audience about component misbehavior (not only restarts) ?

FYI: @vkuznet @anpicci @amaltaro @mapellidario @khurtado @klannon @d-ylee

amaltaro · 2024-12-02T14:59:14Z

From this comment: #12184 (comment)

I think there is a misunderstanding on this functionality. This script will not restart a component that is running a long cycle (and/or under heavy load). It actually relies on modifications to the change log (stat command). Provided we are logging something to the component log within 30min, there should be no restart of the component (hence protecting slow components against slow execution).

About making restarts configurable, if we are really serious about such mechanism, then I think we need to take a step back and implement it properly. Either with another component or with an actual supervisor application.
Or even better, making components more resilient such that they (almost) never crash.

amaltaro · 2024-12-04T02:31:37Z

@todor-ivanov as mentioned in my previous comment and now also tested in an agent, indeed with the latest changes the script no longer looks into components that are down. Can you please follow this up? I am reopening this issue.

UPDATE: see this comment please #12189 (comment)

todor-ivanov added BUG WMAgent deployment Issue related to deployment of the services labels Nov 27, 2024

todor-ivanov self-assigned this Nov 27, 2024

todor-ivanov added this to WMCore quarterly developments Nov 27, 2024

todor-ivanov moved this to In Progress in WMCore quarterly developments Nov 27, 2024

This was referenced Nov 27, 2024

Fix broken cron jobs redirect dmwm/CMSKubernetes#1569

Merged

Properly run restartComponent.sh avoiding hardwired paths #12185

Merged

vkuznet closed this as completed in dmwm/CMSKubernetes#1569 Nov 27, 2024

github-project-automation bot moved this from In Progress to Done in WMCore quarterly developments Nov 27, 2024

todor-ivanov reopened this Nov 28, 2024

todor-ivanov mentioned this issue Nov 28, 2024

Change the way to estimate running components #12189

Merged

todor-ivanov closed this as completed in #12189 Dec 2, 2024

amaltaro reopened this Dec 4, 2024

amaltaro moved this from Done to In Progress in WMCore quarterly developments Dec 4, 2024

todor-ivanov mentioned this issue Dec 4, 2024

Add back non running components to restartComponent #12193

Merged

amaltaro closed this as completed in #12193 Dec 4, 2024

github-project-automation bot moved this from In Progress to Done in WMCore quarterly developments Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WMAgent - cronjobs spam #12184

WMAgent - cronjobs spam #12184

todor-ivanov commented Nov 27, 2024 •

edited

Loading

todor-ivanov commented Nov 28, 2024

LinaresToine commented Nov 28, 2024 •

edited

Loading

todor-ivanov commented Nov 28, 2024 •

edited

Loading

todor-ivanov commented Nov 28, 2024 •

edited

Loading

amaltaro commented Dec 2, 2024

amaltaro commented Dec 4, 2024 •

edited

Loading

WMAgent - cronjobs spam #12184

WMAgent - cronjobs spam #12184

Comments

todor-ivanov commented Nov 27, 2024 • edited Loading

todor-ivanov commented Nov 28, 2024

LinaresToine commented Nov 28, 2024 • edited Loading

todor-ivanov commented Nov 28, 2024 • edited Loading

todor-ivanov commented Nov 28, 2024 • edited Loading

amaltaro commented Dec 2, 2024

amaltaro commented Dec 4, 2024 • edited Loading

todor-ivanov commented Nov 27, 2024 •

edited

Loading

LinaresToine commented Nov 28, 2024 •

edited

Loading

todor-ivanov commented Nov 28, 2024 •

edited

Loading

todor-ivanov commented Nov 28, 2024 •

edited

Loading

amaltaro commented Dec 4, 2024 •

edited

Loading