You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During multi-hour jobs, Inspect will sometimes hang. The user-visible manifestation of this bug is that the timers will stop. Anecdotally, memory and CPU usage do not appear to be high after the hang begins.
We currently do not have a consistent repro case of this behavior that doesn't involve running a real multi-hour job. In addition, when the issue happened, we were not able to diagnose the root cause with basic profiling steps.
Therefore, we'll need to:
Create a consistent repro of the behavior
Ideally we can do this without wasting compute on this (though that isn't guaranteed if the root cause is complex). We could try and make fairly artificial tasks using MockLLM or something similar and just run them for a very long time to see if the bug repros.
More sophisticated mock tasks may be required to trigger the bug, such as ones that use Docker sandboxes.
If we try the above steps and still can't repro the bug, it might be worth doing an actual run to see if it can be triggered as a last resort.
Diagnose the issue once we can reproduce it at least occasionally
We likely will need to attach debugging tools to the stuck process to Inspect what went wrong.
We can also look at things like what resource usage was like before the crash, what task it was on, etc. which might help make repros easier or give us some notion about root causes.
The text was updated successfully, but these errors were encountered:
Originally discussed on Slack here.
During multi-hour jobs, Inspect will sometimes hang. The user-visible manifestation of this bug is that the timers will stop. Anecdotally, memory and CPU usage do not appear to be high after the hang begins.
We currently do not have a consistent repro case of this behavior that doesn't involve running a real multi-hour job. In addition, when the issue happened, we were not able to diagnose the root cause with basic profiling steps.
Therefore, we'll need to:
The text was updated successfully, but these errors were encountered: