Inspect sometimes hangs on extremely large runs #849

MSchmatzAISI · 2024-11-15T09:20:16Z

During multi-hour jobs, Inspect will sometimes hang. The user-visible manifestation of this bug is that the timers will stop. Anecdotally, memory and CPU usage do not appear to be high after the hang begins.

We currently do not have a consistent repro case of this behavior that doesn't involve running a real multi-hour job. In addition, when the issue happened, we were not able to diagnose the root cause with basic profiling steps.

Therefore, we'll need to:

Create a consistent repro of the behavior
- Ideally we can do this without wasting compute on this (though that isn't guaranteed if the root cause is complex). We could try and make fairly artificial tasks using MockLLM or something similar and just run them for a very long time to see if the bug repros.
- More sophisticated mock tasks may be required to trigger the bug, such as ones that use Docker sandboxes.
- If we try the above steps and still can't repro the bug, it might be worth doing an actual run to see if it can be triggered as a last resort.
Diagnose the issue once we can reproduce it at least occasionally
- @JJ Allaire recommends file logging for this so we can observe what it was doing right before it crashed.
- We likely will need to attach debugging tools to the stuck process to Inspect what went wrong.
- We can also look at things like what resource usage was like before the crash, what task it was on, etc. which might help make repros easier or give us some notion about root causes.

MSchmatzAISI added the bug Something isn't working label Nov 15, 2024

MSchmatzAISI self-assigned this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inspect sometimes hangs on extremely large runs #849

Inspect sometimes hangs on extremely large runs #849

MSchmatzAISI commented Nov 15, 2024

Inspect sometimes hangs on extremely large runs #849

Inspect sometimes hangs on extremely large runs #849

Comments

MSchmatzAISI commented Nov 15, 2024