Skip to content

Commit

Permalink
backend: implement timeout fail-safe in case copr-rpmbuild hangs up
Browse files Browse the repository at this point in the history
Fix #3343

I am not 100% sure this PR will fully prevent issues like #3343 but
I am trying to address cases like this here.

Normally, copr-rpmbuild is responsible for terminating itself when the
build timeout is reached. This is indicated by the "Copr timeout =>
sending INT" message in the builder-live.log.

This proves itself to not be reliable enough because sometimes
the whole copr-rpmbuild process hangs up, doesn't terminate itself,
and the build gets stuck in the running state for months. This PR
implements a fail-safe mechanism on backend, which waits a short
period of time after the timeout is reached and then terminates the
build.

Backend terminates the timeouted builds by connecting to the broken
builders and running `copr-rpmbuild-cancel` on them. This successfully
terminated the currently stuck builds mentioned by #3343.
  • Loading branch information
FrostyX authored and praiskup committed Aug 19, 2024
1 parent c7dc2f3 commit 04d375b
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 7 deletions.
25 changes: 19 additions & 6 deletions backend/copr_backend/background_worker_build.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,20 +404,34 @@ def _proctitle(self, text):

def _check_build_interrupted(self):
"""
Should we interrupet a running worker?
Should we interrupt a running worker?
"""
return (self._check_failed_resalloc_ticket()
or self._cancel_task_check_request())
or self._cancel_task_check_request()
or self._build_timeouted())

@ttl_cache(ttl=10*60)
def _check_failed_resalloc_ticket(self):
"""
Did the resalloc ticket fail?
"""
self.host.ticket.collect()
self.log.info("Status: %s", self.host.ticket.failed)
self.log.info("Failed resalloc ticket: %s", self.host.ticket.failed)
return self.host.ticket.failed

def _build_timeouted(self):
"""
When build timeouts, it should be handled by `copr-rpmbuild` and the
builder machine should mark itself as finished. When it fails to do so,
we have this fail-safe to know if a build timeouted and should be
terminated.
"""
# Wait some time (1 hour) after the configured timeout for the builder
# to terminate itself.
timestamp = self.job.started_on + self.job.timeout + 60 * 60
limit = datetime.fromtimestamp(timestamp)
return datetime.now() > limit

def _cancel_task_check_request(self):
"""
Was the build canceled by the user?
Expand Down Expand Up @@ -482,9 +496,6 @@ def _discard_running_worker(self):
raise any exception. The worst case scenario is that nothing is
canceled.
"""
if not self.canceled:
return

self._proctitle("Canceling running task...")
self.redis_set_worker_flag("canceling", 1)
try:
Expand Down Expand Up @@ -994,6 +1005,8 @@ def build(self, attempt):
).run()
if self.canceled:
raise BuildCanceled
if self._build_timeouted():
raise BackendError("Build timeouted")
if self.host.ticket.failed:
transfer_failure = "Resalloc ticket FAILED"
if transfer_failure:
Expand Down
4 changes: 3 additions & 1 deletion backend/tests/test_background_worker_build.py
Original file line number Diff line number Diff line change
Expand Up @@ -716,8 +716,10 @@ def test_average_step():

@_patch_bwbuild_object("time.sleep", mock.MagicMock())
@_patch_bwbuild_object("time.time")
def test_retry_for_ssh_tail_failure(mc_time, f_build_rpm_case,
@_patch_bwbuild_object("BuildBackgroundWorker._build_timeouted")
def test_retry_for_ssh_tail_failure(mc_timeouted, mc_time, f_build_rpm_case,
caplog):
mc_timeouted.return_value = False
mc_time.side_effect = list(range(500))
class _SideEffect:
counter = 0
Expand Down

0 comments on commit 04d375b

Please sign in to comment.