-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442
Comments
Simply check Publisher process if it is still running before checking with time, maybe resolve the issue. |
that's surely good ! But maybe we can also introduce a timeout. Originally the idea was that at time a single Publisher iteration could take hours and we did not want to have a timeout + hard kill. Publisher was not designed with the idea of being stateless and idempotent, and I opted for "safety". But we were only using stop as issued by operator. As we try to have something automatic, and have enough experience, we can probably try a timeout at 2h. |
After #8561, it rarely happens on prod/preprod but happens quite a lot on test env and is super annoying. My new idea is we do flock using CRABServer/src/python/TaskWorker/Actions/PostJob.py Lines 426 to 428 in ce559a1
Wrap over CRABServer/src/python/Publisher/Main.py Line 69 in ce559a1
On the process controller script, just check if flock file exist. Is this too over-engineering @belforte ? |
#856 does not exist. Which issue did you want to point to ? PostJob has locks to prevent multiple PJs running at same time from trying to update same file a at same time. But here we should not need such locks. Or maybe I am confused :-), as usual. |
modify isPublisherBusy to do a "ps" ? |
Ok, I remember why I need But...The real problem is that The real(?) solution would be to set a timeout, as you said in #8442 (comment) . |
but stlll.. wouldn't your original idea #8442 (comment) be enough ? |
if you mean to kill Publsher 5min after "next iteration will start at ..." time, I am kind of lost. |
Yes, and it already implemented in #8562 .
No. Change stop.sh from "waiting forever" to "exit with error if waiting longer than 5mins". |
😲 |
This is done in #8737 |
The stop.sh script (or manage.sh stop in new PyPI image) will wait forever if it run after exceed "Next cycle" time and no Publisher process is running.
The script parse timestamp from last log line from
logs/log.txt
:CRABServer/src/script/Deployment/Publisher/stop.sh
Lines 19 to 23 in 97bf591
The
delta
is gradually decreasing while the time (epoch) is increasing. No process is running mean no one will update the log line and script will stuck forever.Workaround is exec to the container and kill
/bin/bash ./stop.sh
to allow Deploy_TW to continue.The text was updated successfully, but these errors were encountered: