Skip to content

fix: timeout in tform #672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

fix: timeout in tform #672

wants to merge 1 commit into from

Conversation

jodavies
Copy link
Collaborator

The CI often fails due to the timeout test, and needs to be re-run, when it usually will pass. There are two issues:

  • Sometimes FORM will terminate after 1s, but not print an error
  • In TFORM, particularly under valgrind, SIGALRM is delivered to the wrong thread. Resolve this by specifically blocking SIGALRM on thread creation.

This resolves #612

@coveralls
Copy link

coveralls commented Jun 19, 2025

Coverage Status

coverage: 50.606% (+0.003%) from 50.603%
when pulling a63e686 on jodavies:timeout
into 21f29a8 on form-dev:master.

Copy link
Collaborator

@tueda tueda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to temporarily block SIGALRM on the main thread during worker thread creation (the signal mask of created threads is inherited from the parent). In your implementation, there is a dangerous timing window in which the signal can be delivered to worker threads (though it probably never occurs).

@jodavies
Copy link
Collaborator Author

Another option is to temporarily block SIGALRM on the main thread during worker thread creation (the signal mask of created threads is inherited from the parent). In your implementation, there is a dangerous timing window in which the signal can be delivered to worker threads (though it probably never occurs).

In this case, what happens if SIGALRM comes while the main thread is also blocking it?

@jodavies
Copy link
Collaborator Author

OK, as I now understand it, the signal will be queued until the main thread unblocks it. So the latest commit should avoid the potential timing problem, and still works properly under valgrind. If you're happy I will squash these.

The only remaining possible issue, is if the valgrind report of #612 can happen at the same time as #612 (comment) -- in this case I think the CI fails, since the test succeeds but still has a valgrind error?

@@ -306,7 +306,8 @@ assert succeeded?
# ParFORM may terminate without printing the error message,
# depending on the MPI environment.
#pend_if mpi?
assert runtime_error?
# Sometimes, FORM will terminate after 1s without a runtime error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopping without printing a runtime error is a bug. Maybe we can put a "TODO" comment here?

@tueda
Copy link
Collaborator

tueda commented Jun 24, 2025

OK, as I now understand it, the signal will be queued until the main thread unblocks it.

Yes, POSIX says:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/V2_chap02.html#tag_16_04_01

If there are no threads in a call to a sigwait() function selecting that signal, and if all threads within the process block delivery of the signal, the signal shall remain pending on the process until a thread calls a sigwait() function selecting that signal, a thread unblocks delivery of the signal, or the action associated with the signal is set to ignore the signal.

The only remaining possible issue, is if the valgrind report of #612 can happen at the same time as #612 (comment) -- in this case I think the CI fails, since the test succeeds but still has a valgrind error?

I think this Valgrind error won't happen because worker threads don't get SIGALRM.

sigemptyset(&sig_set);
sigaddset(&sig_set, SIGALRM);
pthread_sigmask(SIG_BLOCK, &sig_set, NULL);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such code also for RunSortBot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here the signal is only unblocked below after we also have started the sort bots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: WITH_ALARM should be checked. Otherwise, the TFORM build fails on Windows.

The CI often fails due to the timeout test, and needs to be re-run, when it usually
will pass. There are two issues:
- Sometimes FORM will terminate after 1s, but not print an error. This is a bug
  really, but for now allow it to pass the CI.
- In TFORM, particularly under valgrind, SIGALRM may be delivered to the wrong
  thread. Resolve this by blocking SIGALRM in the worker and sortbot threads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TimeoutAfter_2 test failure
3 participants