Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMAgent - cronjobs spam #12184

Closed
todor-ivanov opened this issue Nov 27, 2024 · 6 comments · Fixed by dmwm/CMSKubernetes#1569, #12185, #12189 or #12193
Closed

WMAgent - cronjobs spam #12184

todor-ivanov opened this issue Nov 27, 2024 · 6 comments · Fixed by dmwm/CMSKubernetes#1569, #12185, #12189 or #12193
Assignees
Labels
BUG deployment Issue related to deployment of the services WMAgent

Comments

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 27, 2024

Impact of the bug
WMAgent

Describe the bug
Upon fixing #12166 A spam behavior was reported by people, notifying that the restart component cronjob is generating a lot of useless emails. The reason is that crontab's stdout gets emailed to the user cmst1 and consequently anybody to whom its mailbox has been redirected as well. One of the cronjob records in question is as follows:

*/15 * * * * source /data/WMAgent.venv3/bin/activate > /dev/null &&  source /data/WMAgent.venv3/deploy/restartComponent.sh 2>&1 >> /data/WMAgent.venv3/srv/wmagent/2.3.7/logs/component-restart.log

The situation becomes even worse especially when a machine deployment goes wrong and the component restarts intensify. We happen to notice that in testbed only because with the virtualenv. setup we fixed the cronjobs mailing mechanism first. Once the solution of #12159 (through dmwm/CMSKubernetes#1566) gets propagated to the Docker containers we run in production, we would have been hit by a e-mail storm much bigger than what we noticed in testbed from just a single agent. It would have been a single e-mail per every run of every cronjob from all of the Docker containers in production.

And things do not stop there. In one of those cronjobs, we have a hardwired absolute path not coupled to the environment and the deployment root:

install="/data/srv/wmagent/current/install/"

Which additionally leads to constantly scanning wrong component logs for errors, especially when it comes to virtual env. setup, and causing indefinite restarts of perfectly healthy components. This path should be tied to the environment variable: WMA_INSATLL_DIR

How to reproduce it
Just run the crontabs as currently generated at the agents

Expected behavior

No e-mails to be generated by cronjobs' stdout.

Additional context and error message
None

@todor-ivanov
Copy link
Contributor Author

ok ... I am reopening this issue because there were two problems reported by T0 team:

  • They found out the component restarts failed due to a wrong component name identification - it is been there before, I just happened to miss it myself. At this line:
    comps=$(ls $WMA_INSTALL_DIR)
(WMAgent.venv3) cmst0@vocms0502:~ $ comps=$(ls $WMA_INSTALL_DIR)
(WMAgent.venv3) cmst0@vocms0502:~ $ echo $comps 
AgentStatusWatcher/ AnalyticsDataCollector/ DBS3Upload/ ErrorHandler/ JobAccountant/ JobArchiver/ JobCreator/ JobStatusLite/ JobSubmitter/ JobTracker/ RetryManager/ RucioInjector/ TaskArchiver/ Tier0Feeder/

thanks @LinaresToine for reporting this

  • They have tested this mechanism in two of their production agents and turned out the so hardcoded grace period of 30min before considering a component for stale is far from enough, especially when it comes to HI runs:
    if (("$INTERVAL" >= 1800)); then

    The result is they noticed the mechanism triggered on every run of the cronjob command and only the formerly explained error about the component naming, saved them from entering a spiral of component restarts.

So this should be configurable!!!

@todor-ivanov todor-ivanov reopened this Nov 28, 2024
@LinaresToine
Copy link

LinaresToine commented Nov 28, 2024

Thank you @todor-ivanov. Having the cron job configurable would be useful for us. After what we have seen this year, with the DBS3Uploader taking 2+ hours under heavy load, and now seeing a similar symptom in the RucioInjector during the HI run, I believe that restarting these components after 30 minutes is not necessary. However, this does not imply that we don't want to be warned about it, which is already done by WMStats. For other components, an automatic restart after more than 30 minutes may be too much.

Is it possible to decouple the components from the cron job by creating one cron job per component? And have a configurable attribute in each component configuration that defines the grace period before an automatic restart?

For example (name can be improved):

config.RucioInjector.secondsBeforeRestart = 7200
config.JobCreator.secondsBeforeRestart = 1800
...

With this setup we would be able to modify the config according to the issue at hand.

@amaltaro @todor-ivanov would something like this be possible?
Thanks again for looking into this

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Nov 28, 2024

To add something I just noticed. I happened to miss the error reported about the bad component names resolution, simply because the T0 and Production agents behave differently.

  • A T0 agent:
(WMAgent.venv3) cmst0@vocms0502:/data/tier0/WMAgent.venv3 $ ls $WMA_INSTALL_DIR/
AgentStatusWatcher/  AnalyticsDataCollector/  DBS3Upload/   ErrorHandler/  JobAccountant/  JobArchiver/  JobCreator/  JobStatusLite/  JobSubmitter/  JobTracker/  RetryManager/
RucioInjector/	     TaskArchiver/	      Tier0Feeder/

vs.

  • A Production agent:
(WMAgent.venv3) cmst1@vocms0260:WMAgent.venv3 $ ls $WMA_INSTALL_DIR/
AgentStatusWatcher      ArchiveDataReporter  ErrorHandler   JobArchiver  JobStatusLite  JobTracker  RetryManager   TaskArchiver     WorkQueueManager
AnalyticsDataCollector  DBS3Upload           JobAccountant  JobCreator   JobSubmitter   JobUpdater  RucioInjector  WorkflowUpdater

So definitely not the optimal way of identifying the list of enabled components at the agent.

p.s.
And here is the reason why. The T0 account has an alias as: alias ls='/bin/ls -G -oFxb', which causes this.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Nov 28, 2024

Hi @LinaresToine

  • About the so reported issues:

Here is the first solution for properly estimating the currently running components at the agent: #12189

I think this is possible, but would require rewriting the whole restartComponent.sh script in python. It definitely deserves a separate WMCore issue, addressing also the possibility to add per component configurable parameter through the relevant component config section. Said that, @LinaresToine could you please create this issue and put this explanation there, such that we could properly plan, work on, and account it. Thank you in advance!

In the meantime, the temporary solution for increasing the grace period before component restarts just for T0 agents, is unfortunately, again tweaks and sed-based shell hacks as it is done in other parts of the T0 agent setup process. I personally, strongly dislike such approach, and would prefer this to be addressed the proper way, aside the eventual discussion we must have in the team on two topics :

  • What should be the exact logic of this component restart mechanism, such that we properly address the pure transient errors in the components and avoid eventual spiral of components restart triggered by externally driven load on the system?

  • Is the "naked" and unbuffered mailing mechanism the best way for alarming the audience about component misbehavior (not only restarts) ?

FYI: @vkuznet @anpicci @amaltaro @mapellidario @khurtado @klannon @d-ylee

@amaltaro
Copy link
Contributor

amaltaro commented Dec 2, 2024

From this comment: #12184 (comment)

I think there is a misunderstanding on this functionality. This script will not restart a component that is running a long cycle (and/or under heavy load). It actually relies on modifications to the change log (stat command). Provided we are logging something to the component log within 30min, there should be no restart of the component (hence protecting slow components against slow execution).

About making restarts configurable, if we are really serious about such mechanism, then I think we need to take a step back and implement it properly. Either with another component or with an actual supervisor application.
Or even better, making components more resilient such that they (almost) never crash.

@amaltaro
Copy link
Contributor

amaltaro commented Dec 4, 2024

@todor-ivanov as mentioned in my previous comment and now also tested in an agent, indeed with the latest changes the script no longer looks into components that are down. Can you please follow this up? I am reopening this issue.

UPDATE: see this comment please #12189 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment