Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442

novicecpp · 2024-05-29T09:27:24Z

The stop.sh script (or manage.sh stop in new PyPI image) will wait forever if it run after exceed "Next cycle" time and no Publisher process is running.

The script parse timestamp from last log line from logs/log.txt:

2024-05-29 11:13:12,912:INFO:PublisherMaster,296:Next cycle will start at 11:39:01

CRABServer/src/script/Deployment/Publisher/stop.sh

Lines 19 to 23 in 97bf591

    
           start=`echo $lastLine|awk '{print $NF}'` 
        
           startTime=`date -d ${start} +%s`  # in seconds from Epoch 
        
           now=`date +%s` # in seconds from Epoch 
        
           delta=$((${startTime}-${now})) 
        
           if [ $delta -gt 60 ]; then

The delta is gradually decreasing while the time (epoch) is increasing. No process is running mean no one will update the log line and script will stuck forever.

Workaround is exec to the container and kill /bin/bash ./stop.sh to allow Deploy_TW to continue.

The text was updated successfully, but these errors were encountered:

novicecpp · 2024-05-29T09:28:47Z

Simply check Publisher process if it is still running before checking with time, maybe resolve the issue.

belforte · 2024-05-29T12:42:08Z

that's surely good ! But maybe we can also introduce a timeout. Originally the idea was that at time a single Publisher iteration could take hours and we did not want to have a timeout + hard kill. Publisher was not designed with the idea of being stateless and idempotent, and I opted for "safety". But we were only using stop as issued by operator. As we try to have something automatic, and have enough experience, we can probably try a timeout at 2h.

novicecpp · 2024-09-13T11:06:17Z

After #8561, it rarely happens on prod/preprod but happens quite a lot on test env and is super annoying.

My new idea is we do flock using getLock() like PostJob.py

CRABServer/src/python/TaskWorker/Actions/PostJob.py

Lines 426 to 428 in ce559a1

    
           with getLock('get_transfers_statuses'): 
        
               # Get the transfer status in all documents listed in self.docs_in_transfer. 
        
               transfers_statuses = self.get_transfers_statuses()

Wrap over

CRABServer/src/python/Publisher/Main.py

Line 69 in ce559a1

master.algorithm()

On the process controller script, just check if flock file exist.

Is this too over-engineering @belforte ?

belforte · 2024-09-13T19:31:22Z

#856 does not exist. Which issue did you want to point to ?
Maybe we can simply use file to record the PID of the running Publisher, like it happens for the REST ?
IIUC the problem is that we end up waiting forever while there is nothing running.

PostJob has locks to prevent multiple PJs running at same time from trying to update same file a at same time. But here we should not need such locks.

Or maybe I am confused :-), as usual.

belforte · 2024-09-15T11:05:46Z

modify isPublisherBusy to do a "ps" ?

novicecpp · 2024-09-17T18:07:59Z

Ok, I remember why I need getLock().
Parsing logs.txt to find the time to execute next round is unreliable. But I forgot which case that make stop.sh stall.

But...The real problem is that stop.sh stall forever in CI.
When it happens, it feels like "everything is fine" because you did not get a fail message within 15mins, but actually the Publisher deployment has been in progress for hours.

The real(?) solution would be to set a timeout, as you said in #8442 (comment) .
2 hr is too much, 5 mins should be enough for every env.

belforte · 2024-09-17T20:00:30Z

but stlll.. wouldn't your original idea #8442 (comment) be enough ?

belforte · 2024-09-17T20:07:43Z

if you mean to kill Publsher 5min after "next iteration will start at ..." time, I am kind of lost.

novicecpp · 2024-09-17T20:43:10Z

but stlll.. wouldn't your original idea #8442 (comment) be enough ?

Yes, and it already implemented in #8562 .
But....there is a case when publisher process is running (publishing/sleeping/whatever) but stop.sh never manage to terminate it and stuck forever. Sorry, I am not sure "how" until I see it again.

if you mean to kill Publsher 5min after "next iteration will start at ..." time, I am kind of lost.

No. Change stop.sh from "waiting forever" to "exit with error if waiting longer than 5mins".
Then, CI will shout an error, which means manual intervention is required. Not just keep silencing...

belforte · 2024-09-18T00:18:23Z

😲
wow odd things are going on. Yup let's try to catch em, or at least live with them

novicecpp · 2024-10-31T16:21:52Z

This is done in #8737

novicecpp self-assigned this May 29, 2024

novicecpp added OPERATION Publisher labels Jun 25, 2024

novicecpp added the PyPI label Jul 4, 2024

novicecpp mentioned this issue Jul 25, 2024

Change entrypoint of Publisher process to simple binary script #8561

Closed

novicecpp mentioned this issue Oct 11, 2024

Various fix for process controller scripts (manage.sh/manage.py/etc.) #8737

Merged

novicecpp closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442

Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442

novicecpp commented May 29, 2024

novicecpp commented May 29, 2024

belforte commented May 29, 2024

novicecpp commented Sep 13, 2024 •

edited

Loading

belforte commented Sep 13, 2024 •

edited

Loading

belforte commented Sep 15, 2024

novicecpp commented Sep 17, 2024

belforte commented Sep 17, 2024

belforte commented Sep 17, 2024

novicecpp commented Sep 17, 2024

belforte commented Sep 18, 2024 •

edited

Loading

novicecpp commented Oct 31, 2024

Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442

Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442

Comments

novicecpp commented May 29, 2024

novicecpp commented May 29, 2024

belforte commented May 29, 2024

novicecpp commented Sep 13, 2024 • edited Loading

belforte commented Sep 13, 2024 • edited Loading

belforte commented Sep 15, 2024

novicecpp commented Sep 17, 2024

belforte commented Sep 17, 2024

belforte commented Sep 17, 2024

novicecpp commented Sep 17, 2024

belforte commented Sep 18, 2024 • edited Loading

novicecpp commented Oct 31, 2024

novicecpp commented Sep 13, 2024 •

edited

Loading

belforte commented Sep 13, 2024 •

edited

Loading

belforte commented Sep 18, 2024 •

edited

Loading