-
Notifications
You must be signed in to change notification settings - Fork 144
fix: timeout in tform #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to temporarily block SIGALRM
on the main thread during worker thread creation (the signal mask of created threads is inherited from the parent). In your implementation, there is a dangerous timing window in which the signal can be delivered to worker threads (though it probably never occurs).
In this case, what happens if SIGALRM comes while the main thread is also blocking it? |
OK, as I now understand it, the signal will be queued until the main thread unblocks it. So the latest commit should avoid the potential timing problem, and still works properly under valgrind. If you're happy I will squash these. The only remaining possible issue, is if the valgrind report of #612 can happen at the same time as #612 (comment) -- in this case I think the CI fails, since the test succeeds but still has a valgrind error? |
@@ -306,7 +306,8 @@ assert succeeded? | |||
# ParFORM may terminate without printing the error message, | |||
# depending on the MPI environment. | |||
#pend_if mpi? | |||
assert runtime_error? | |||
# Sometimes, FORM will terminate after 1s without a runtime error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopping without printing a runtime error is a bug. Maybe we can put a "TODO" comment here?
Yes, POSIX says: https://pubs.opengroup.org/onlinepubs/9799919799/functions/V2_chap02.html#tag_16_04_01
I think this Valgrind error won't happen because worker threads don't get |
sigemptyset(&sig_set); | ||
sigaddset(&sig_set, SIGALRM); | ||
pthread_sigmask(SIG_BLOCK, &sig_set, NULL); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need such code also for RunSortBot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here the signal is only unblocked below after we also have started the sort bots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: WITH_ALARM
should be checked. Otherwise, the TFORM build fails on Windows.
The CI often fails due to the timeout test, and needs to be re-run, when it usually will pass. There are two issues: - Sometimes FORM will terminate after 1s, but not print an error. This is a bug really, but for now allow it to pass the CI. - In TFORM, particularly under valgrind, SIGALRM may be delivered to the wrong thread. Resolve this by blocking SIGALRM in the worker and sortbot threads.
The CI often fails due to the timeout test, and needs to be re-run, when it usually will pass. There are two issues:
This resolves #612