-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FailOnRecoverableError mode #5240
Comments
@li-boxuan I believe we have similar mechanism implemented in our OpenHands/evaluation/utils/shared.py Lines 300 to 364 in eb2a0b1
|
@xingyaoww This wrapper is useful to capture OpenHand's instability errors (e.g. runtime crash), but not the external LLM provider errors (out of budget, rate limiting, etc.) I mentioned above. In those cases, OpenHands main process doesn't throw any exception. Instead, it finishes the execution with an error message. Consequently, benchmarking code treats external LLM provider errors as OpenHand's incompetence. |
I'm trying to understand how to separate, because there is also a third kind: application errors of all kinds. Or I guess those are more like "incompetence". Is it fair to say that your proposal of fail-on-recoverable-error is a kind of exit that would happen:
|
Yeah they sound like "incompetence" to me. Actually I do feel the current wrapper approach - try/except and rerun is unfair to some point, as software stableness should definitely be a part of the benchmark target itself, but I guess that's not how research works, so let it be.
Not really. LLM out-of-context error is a counter example. That's categorized as OpenHand's incompetence (as in it cannot condense its prompt, OR fallback to a different LLM with longer context support, etc.)
Docker errors (I sometimes do see Docker crash on my laptop), or cloud provider errors (if running on a cloud-based sandbox, ever) fall into this category too. Mostly, any error that's out of OpenHand's control. |
Ah of course, absolutely. I wasn't counting it 😅 Then,
400 errors from the LLM API are probably not included, 502 are, 500... are. Sorry, I'm rephrasing just so you can correct me, and I make sure I get the point. In my understanding, then, what you're proposing is like: there are all kinds of actors, users, third parties, that |
Yep that's what I am thinking. Apparently those 3rd party errors, along as OpenHands errors, are equivalent to users: "this thing just doesn't work". But from benchmarking perspective, the 3rd party errors should be reported differently in my opinion. |
What problem or use case are you trying to solve?
This is mostly for evaluation purpose. I am not sure what this mode should be properly named, so let me describe the scenario here:
Sometimes, OpenHands fails to complete a task in a benchmark because it's incompetent, runs out of predefined budget or max iterations, reaches LLM context limit in a single conversation, or because a loop is detected. In this case, OpenHands fails the task and ends a session. Benchmark evaluators, which are outside of OpenHands, would usually then declare a task/challenge as failed/incomplete. This makes sense.
At other times, OpenHands fails to complete a task because there's a recoverable error such as LLM rate limiting, LLM key out of budget, or even LLM is down. This may not be that "transient" since an immediate retry usually doesn't work. This, however, is recoverable, and is not really OpenHand's fault. Benchmark evaluators, which are outside of OpenHands, would still grade the task because OpenHands declares it completes with an error.
My proposal is to introduce a new mode, say, fail-on-recoverable-error. When OpenHands sees litellm.exceptions.AuthenticationError, rate limiting error, or similar, it should just fail the entire program. Alternatively, it needs to indicate something else in the final state so that outside evaluation code can differentiate between OpenHands' incompetence and LLM issues.
The benefit is we can fix the LLM problem and rerun OpenHands for those tasks. Otherwise, we might have under-evaluated OpenHand's score in benchmarks.
C.C. @xingyaoww
The text was updated successfully, but these errors were encountered: