-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WMAgent - cronjobs spam #12184
WMAgent - cronjobs spam #12184
Comments
ok ... I am reopening this issue because there were two problems reported by T0 team:
thanks @LinaresToine for reporting this
So this should be configurable!!! |
Thank you @todor-ivanov. Having the cron job configurable would be useful for us. After what we have seen this year, with the DBS3Uploader taking 2+ hours under heavy load, and now seeing a similar symptom in the RucioInjector during the HI run, I believe that restarting these components after 30 minutes is not necessary. However, this does not imply that we don't want to be warned about it, which is already done by WMStats. For other components, an automatic restart after more than 30 minutes may be too much. Is it possible to decouple the components from the cron job by creating one cron job per component? And have a configurable attribute in each component configuration that defines the grace period before an automatic restart? For example (name can be improved):
With this setup we would be able to modify the config according to the issue at hand. @amaltaro @todor-ivanov would something like this be possible? |
To add something I just noticed. I happened to miss the error reported about the bad component names resolution, simply because the T0 and Production agents behave differently.
vs.
So definitely not the optimal way of identifying the list of enabled components at the agent. p.s. |
Here is the first solution for properly estimating the currently running components at the agent: #12189
I think this is possible, but would require rewriting the whole In the meantime, the temporary solution for increasing the grace period before component restarts just for T0 agents, is unfortunately, again tweaks and
FYI: @vkuznet @anpicci @amaltaro @mapellidario @khurtado @klannon @d-ylee |
From this comment: #12184 (comment) I think there is a misunderstanding on this functionality. This script will not restart a component that is running a long cycle (and/or under heavy load). It actually relies on modifications to the change log (stat command). Provided we are logging something to the component log within 30min, there should be no restart of the component (hence protecting slow components against slow execution). About making restarts configurable, if we are really serious about such mechanism, then I think we need to take a step back and implement it properly. Either with another component or with an actual supervisor application. |
@todor-ivanov as mentioned in my previous comment and now also tested in an agent, indeed with the latest changes the script no longer looks into components that are down. Can you please follow this up? I am reopening this issue. UPDATE: see this comment please #12189 (comment) |
Impact of the bug
WMAgent
Describe the bug
Upon fixing #12166 A spam behavior was reported by people, notifying that the restart component cronjob is generating a lot of useless emails. The reason is that crontab's stdout gets emailed to the user
cmst1
and consequently anybody to whom its mailbox has been redirected as well. One of the cronjob records in question is as follows:The situation becomes even worse especially when a machine deployment goes wrong and the component restarts intensify. We happen to notice that in testbed only because with the virtualenv. setup we fixed the cronjobs mailing mechanism first. Once the solution of #12159 (through dmwm/CMSKubernetes#1566) gets propagated to the Docker containers we run in production, we would have been hit by a e-mail storm much bigger than what we noticed in testbed from just a single agent. It would have been a single e-mail per every run of every cronjob from all of the Docker containers in production.
And things do not stop there. In one of those cronjobs, we have a hardwired absolute path not coupled to the environment and the deployment root:
WMCore/deploy/restartComponent.sh
Line 15 in 90a14e5
Which additionally leads to constantly scanning wrong component logs for errors, especially when it comes to virtual env. setup, and causing indefinite restarts of perfectly healthy components. This path should be tied to the environment variable:
WMA_INSATLL_DIR
How to reproduce it
Just run the crontabs as currently generated at the agents
Expected behavior
No e-mails to be generated by cronjobs' stdout.
Additional context and error message
None
The text was updated successfully, but these errors were encountered: