-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check for stuck jobs in poll() #5451
base: master
Are you sure you want to change the base?
Conversation
This might be an issue for really big tasks (e.g. ClimatePrediction.net) |
Would this be more of an issue of an hour is too short of time frame, or more with the implementation? |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5451 +/- ##
=========================================
Coverage 10.84% 10.84%
Complexity 1068 1068
=========================================
Files 279 279
Lines 36156 36156
Branches 8355 8355
=========================================
Hits 3920 3920
Misses 31842 31842
Partials 394 394 |
@FTang21, ah, no, my bad: in the original proposal there was an additional verification method for long running jobs: CPU time, and I completely missed that this was the part of the implementation in this PR. @davidpanderson, could you please review this and verify that this is a desired implementation of the original proposal? |
There are various problems with this. |
@davidpanderson I updated the implementation based on the updated issue. Lmk if this is ok. Should I add the abort on its own after some time or this is fine for now? Would this also be preferable as it function? |
For now, let's just show a message using ... telling the user which job is stuck, This will be useful for testing because we can see the stuck job |
Gotcha, I updated it to MSG_USER_ALERT |
Almost but not quite. Please review my pseudo-code. |
Ah I see what I missed, it should match the order provided. |
(atp->current_cpu_time - atp->stuck_check_cpu_time) < 10) { | ||
// if fraction done does not change and cpu time is <10, message the user | ||
msg_printf(atp->result->project, MSG_USER_ALERT, | ||
"[task] has not made progress in last hour, consider aborting task %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the better working should be
Task has not made any progress within the last hour, consider aborting it: %s
@davidpanderson, could you please verify the spelling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the direction this might take the BOINC client (eventually), or might lead the user.
Not every wrapper-using project successfully manages to implement a realistic measure of progress made. This direction of thought changes the progress report from 'nice to have' towards 'essential'. It needs to be explained carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A job is regarded as stuck if, in the last hour of running,
a) it accumulates less than 10 sec of CPU time
AND
b) its fraction done is unchanged.
b) by itself doesn't make it stuck.
Fixes #5352
Description of the Change
Add a way to check if jobs were stuck by adding an addition poll that occurs every hour, it checks if an active_tasks fraction_done doesn't change and current_cpu_time < 10s.
Alternate Designs
Release Notes