Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442

Closed
novicecpp opened this issue May 29, 2024 · 11 comments

Comments

@novicecpp
Copy link
Contributor

The stop.sh script (or manage.sh stop in new PyPI image) will wait forever if it run after exceed "Next cycle" time and no Publisher process is running.

The script parse timestamp from last log line from logs/log.txt:

2024-05-29 11:13:12,912:INFO:PublisherMaster,296:Next cycle will start at 11:39:01

start=`echo $lastLine|awk '{print $NF}'`
startTime=`date -d ${start} +%s` # in seconds from Epoch
now=`date +%s` # in seconds from Epoch
delta=$((${startTime}-${now}))
if [ $delta -gt 60 ]; then

The delta is gradually decreasing while the time (epoch) is increasing. No process is running mean no one will update the log line and script will stuck forever.

Workaround is exec to the container and kill /bin/bash ./stop.sh to allow Deploy_TW to continue.

@novicecpp novicecpp self-assigned this May 29, 2024
@novicecpp
Copy link
Contributor Author

Simply check Publisher process if it is still running before checking with time, maybe resolve the issue.

@belforte
Copy link
Member

that's surely good ! But maybe we can also introduce a timeout. Originally the idea was that at time a single Publisher iteration could take hours and we did not want to have a timeout + hard kill. Publisher was not designed with the idea of being stateless and idempotent, and I opted for "safety". But we were only using stop as issued by operator. As we try to have something automatic, and have enough experience, we can probably try a timeout at 2h.

@novicecpp
Copy link
Contributor Author

novicecpp commented Sep 13, 2024

After #8561, it rarely happens on prod/preprod but happens quite a lot on test env and is super annoying.

My new idea is we do flock using getLock() like PostJob.py

with getLock('get_transfers_statuses'):
# Get the transfer status in all documents listed in self.docs_in_transfer.
transfers_statuses = self.get_transfers_statuses()

Wrap over
master.algorithm()

On the process controller script, just check if flock file exist.

Is this too over-engineering @belforte ?

@belforte
Copy link
Member

belforte commented Sep 13, 2024

#856 does not exist. Which issue did you want to point to ?
Maybe we can simply use file to record the PID of the running Publisher, like it happens for the REST ?
IIUC the problem is that we end up waiting forever while there is nothing running.

PostJob has locks to prevent multiple PJs running at same time from trying to update same file a at same time. But here we should not need such locks.

Or maybe I am confused :-), as usual.

@belforte
Copy link
Member

modify isPublisherBusy to do a "ps" ?

@novicecpp
Copy link
Contributor Author

Ok, I remember why I need getLock().
Parsing logs.txt to find the time to execute next round is unreliable. But I forgot which case that make stop.sh stall.

But...The real problem is that stop.sh stall forever in CI.
When it happens, it feels like "everything is fine" because you did not get a fail message within 15mins, but actually the Publisher deployment has been in progress for hours.

The real(?) solution would be to set a timeout, as you said in #8442 (comment) .
2 hr is too much, 5 mins should be enough for every env.

@belforte
Copy link
Member

but stlll.. wouldn't your original idea #8442 (comment) be enough ?

@belforte
Copy link
Member

if you mean to kill Publsher 5min after "next iteration will start at ..." time, I am kind of lost.

@novicecpp
Copy link
Contributor Author

but stlll.. wouldn't your original idea #8442 (comment) be enough ?

Yes, and it already implemented in #8562 .
But....there is a case when publisher process is running (publishing/sleeping/whatever) but stop.sh never manage to terminate it and stuck forever. Sorry, I am not sure "how" until I see it again.

if you mean to kill Publsher 5min after "next iteration will start at ..." time, I am kind of lost.

No. Change stop.sh from "waiting forever" to "exit with error if waiting longer than 5mins".
Then, CI will shout an error, which means manual intervention is required. Not just keep silencing...

@belforte
Copy link
Member

belforte commented Sep 18, 2024

😲
wow odd things are going on. Yup let's try to catch em, or at least live with them

@novicecpp
Copy link
Contributor Author

This is done in #8737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants