You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is one scenario in the websocket version of the protocol that's currently problematic:
When API restarts, graders will be interrupted with a disconnection exception. If there are ongoing jobs in graders in the restarting period, API will think the grader is still working on the original job (since in the http protocol graders would still continue the job and submit in this case). The "running" flag of such grader is not cleared in API once the restart is done.
I think we should probably:
add reconnecting mechanism in the websocket version of grader instead of relying entirely on docker (and try to preserve the status of an on-going job as long as possible)
add some kind of draining mechanism in API to temporarily block incoming grading requests when we are expecting a restart to happen (e.g. deploying new version)
We should test that when a node reconnects, their info is sustained and they can continue from where they left.
The text was updated successfully, but these errors were encountered: