Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inspect sometimes hangs on extremely large runs #849

Open
MSchmatzAISI opened this issue Nov 15, 2024 · 0 comments
Open

Inspect sometimes hangs on extremely large runs #849

MSchmatzAISI opened this issue Nov 15, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@MSchmatzAISI
Copy link
Contributor

Originally discussed on Slack here.

During multi-hour jobs, Inspect will sometimes hang. The user-visible manifestation of this bug is that the timers will stop. Anecdotally, memory and CPU usage do not appear to be high after the hang begins.

We currently do not have a consistent repro case of this behavior that doesn't involve running a real multi-hour job. In addition, when the issue happened, we were not able to diagnose the root cause with basic profiling steps.

Therefore, we'll need to:

  • Create a consistent repro of the behavior
    • Ideally we can do this without wasting compute on this (though that isn't guaranteed if the root cause is complex). We could try and make fairly artificial tasks using MockLLM or something similar and just run them for a very long time to see if the bug repros.
    • More sophisticated mock tasks may be required to trigger the bug, such as ones that use Docker sandboxes.
    • If we try the above steps and still can't repro the bug, it might be worth doing an actual run to see if it can be triggered as a last resort.
  • Diagnose the issue once we can reproduce it at least occasionally
    • @JJ Allaire recommends file logging for this so we can observe what it was doing right before it crashed.
    • We likely will need to attach debugging tools to the stuck process to Inspect what went wrong.
    • We can also look at things like what resource usage was like before the crash, what task it was on, etc. which might help make repros easier or give us some notion about root causes.
@MSchmatzAISI MSchmatzAISI added the bug Something isn't working label Nov 15, 2024
@MSchmatzAISI MSchmatzAISI self-assigned this Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant