Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FailOnRecoverableError mode #5240

Open
li-boxuan opened this issue Nov 24, 2024 · 6 comments
Open

FailOnRecoverableError mode #5240

li-boxuan opened this issue Nov 24, 2024 · 6 comments
Labels
enhancement New feature or request evaluation Related to running evaluations with OpenHands

Comments

@li-boxuan
Copy link
Collaborator

li-boxuan commented Nov 24, 2024

What problem or use case are you trying to solve?

This is mostly for evaluation purpose. I am not sure what this mode should be properly named, so let me describe the scenario here:

  1. Sometimes, OpenHands fails to complete a task in a benchmark because it's incompetent, runs out of predefined budget or max iterations, reaches LLM context limit in a single conversation, or because a loop is detected. In this case, OpenHands fails the task and ends a session. Benchmark evaluators, which are outside of OpenHands, would usually then declare a task/challenge as failed/incomplete. This makes sense.

  2. At other times, OpenHands fails to complete a task because there's a recoverable error such as LLM rate limiting, LLM key out of budget, or even LLM is down. This may not be that "transient" since an immediate retry usually doesn't work. This, however, is recoverable, and is not really OpenHand's fault. Benchmark evaluators, which are outside of OpenHands, would still grade the task because OpenHands declares it completes with an error.

My proposal is to introduce a new mode, say, fail-on-recoverable-error. When OpenHands sees litellm.exceptions.AuthenticationError, rate limiting error, or similar, it should just fail the entire program. Alternatively, it needs to indicate something else in the final state so that outside evaluation code can differentiate between OpenHands' incompetence and LLM issues.

The benefit is we can fix the LLM problem and rerun OpenHands for those tasks. Otherwise, we might have under-evaluated OpenHand's score in benchmarks.

C.C. @xingyaoww

@li-boxuan li-boxuan added the enhancement New feature or request label Nov 24, 2024
@xingyaoww
Copy link
Collaborator

@li-boxuan I believe we have similar mechanism implemented in our run_evaluation function:

def _process_instance_wrapper(
process_instance_func: Callable[[pd.Series, EvalMetadata, bool], EvalOutput],
instance: pd.Series,
metadata: EvalMetadata,
use_mp: bool,
max_retries: int = 5,
timeout_seconds: int | None = None,
) -> EvalOutput:
"""Wrap the process_instance_func to handle retries and errors."""
for attempt in range(max_retries + 1):
try:
if timeout_seconds is not None:
with timeout(timeout_seconds):
result = process_instance_func(instance, metadata, use_mp)
else:
result = process_instance_func(instance, metadata, use_mp)
return result
except EvalTimeoutException as e:
error = f'Timeout after {timeout_seconds} seconds'
stacktrace = traceback.format_exc()
msg = (
'-' * 10
+ '\n'
+ f'Timeout ({timeout_seconds} seconds) in instance [{instance.instance_id}], Stopped evaluation for this instance.'
+ '\n'
+ '-' * 10
)
logger.exception(e)
return EvalOutput(
instance_id=instance.instance_id,
test_result={},
error=error,
)
except Exception as e:
error = str(e)
stacktrace = traceback.format_exc()
if attempt == max_retries:
logger.exception(e)
msg = (
'-' * 10
+ '\n'
+ f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}'
+ '\n'
+ f'[Encountered after {max_retries} retries. Please check the logs and report the issue.]'
+ '-' * 10
)
# Raise an error after all retries & stop the evaluation
logger.exception(e)
raise RuntimeError(
f'Maximum error retries reached for instance {instance.instance_id}'
) from e
msg = (
'-' * 10
+ '\n'
+ f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}'
+ '\n'
+ '-' * 10
+ f'[The above error occurred. Retrying... (attempt {attempt + 1} of {max_retries})]'
+ '-' * 10
+ '\n'
)
logger.error(msg)
if use_mp:
print(msg) # use print to directly print to console
time.sleep(5)

@li-boxuan
Copy link
Collaborator Author

li-boxuan commented Nov 24, 2024

@li-boxuan I believe we have similar mechanism implemented in our run_evaluation function:

def _process_instance_wrapper(
process_instance_func: Callable[[pd.Series, EvalMetadata, bool], EvalOutput],
instance: pd.Series,
metadata: EvalMetadata,
use_mp: bool,
max_retries: int = 5,
timeout_seconds: int | None = None,
) -> EvalOutput:
"""Wrap the process_instance_func to handle retries and errors."""
for attempt in range(max_retries + 1):
try:
if timeout_seconds is not None:
with timeout(timeout_seconds):
result = process_instance_func(instance, metadata, use_mp)
else:
result = process_instance_func(instance, metadata, use_mp)
return result
except EvalTimeoutException as e:
error = f'Timeout after {timeout_seconds} seconds'
stacktrace = traceback.format_exc()
msg = (
'-' * 10
+ '\n'
+ f'Timeout ({timeout_seconds} seconds) in instance [{instance.instance_id}], Stopped evaluation for this instance.'
+ '\n'
+ '-' * 10
)
logger.exception(e)
return EvalOutput(
instance_id=instance.instance_id,
test_result={},
error=error,
)
except Exception as e:
error = str(e)
stacktrace = traceback.format_exc()
if attempt == max_retries:
logger.exception(e)
msg = (
'-' * 10
+ '\n'
+ f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}'
+ '\n'
+ f'[Encountered after {max_retries} retries. Please check the logs and report the issue.]'
+ '-' * 10
)
# Raise an error after all retries & stop the evaluation
logger.exception(e)
raise RuntimeError(
f'Maximum error retries reached for instance {instance.instance_id}'
) from e
msg = (
'-' * 10
+ '\n'
+ f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}'
+ '\n'
+ '-' * 10
+ f'[The above error occurred. Retrying... (attempt {attempt + 1} of {max_retries})]'
+ '-' * 10
+ '\n'
)
logger.error(msg)
if use_mp:
print(msg) # use print to directly print to console
time.sleep(5)

@xingyaoww This wrapper is useful to capture OpenHand's instability errors (e.g. runtime crash), but not the external LLM provider errors (out of budget, rate limiting, etc.) I mentioned above. In those cases, OpenHands main process doesn't throw any exception. Instead, it finishes the execution with an error message. Consequently, benchmarking code treats external LLM provider errors as OpenHand's incompetence.

@enyst
Copy link
Collaborator

enyst commented Nov 25, 2024

I'm trying to understand how to separate, because there is also a third kind: application errors of all kinds. Or I guess those are more like "incompetence".

Is it fair to say that your proposal of fail-on-recoverable-error is a kind of exit that would happen:

  • for all LLM errors
  • and only LLM errors?

@li-boxuan
Copy link
Collaborator Author

li-boxuan commented Nov 25, 2024

application errors of all kinds. Or I guess those are more like "incompetence".

Yeah they sound like "incompetence" to me. Actually I do feel the current wrapper approach - try/except and rerun is unfair to some point, as software stableness should definitely be a part of the benchmark target itself, but I guess that's not how research works, so let it be.

Is it fair to say that your proposal of fail-on-recoverable-error is a kind of exit that would happen: for all LLM errors

Not really. LLM out-of-context error is a counter example. That's categorized as OpenHand's incompetence (as in it cannot condense its prompt, OR fallback to a different LLM with longer context support, etc.)

Is it fair to say that your proposal of fail-on-recoverable-error is a kind of exit that would happen: and only LLM errors

Docker errors (I sometimes do see Docker crash on my laptop), or cloud provider errors (if running on a cloud-based sandbox, ever) fall into this category too. Mostly, any error that's out of OpenHand's control.

@enyst
Copy link
Collaborator

enyst commented Nov 25, 2024

Not really. LLM out-of-context error is a counter example. That's categorized as OpenHand's incompetence (as in it cannot condense its prompt, OR fallback to a different LLM with longer context support, etc.)

Ah of course, absolutely. I wasn't counting it 😅

Then,

  • "it didn't hit the LLM API for some temporary reason" kind of LLM errors (aka provider related, retryable, but retries failed; includes rate limits)
  • auth errors (they're permanent, but under user's control; includes out of budget)
  • maybe docker errors; sandbox crashes.

400 errors from the LLM API are probably not included, 502 are, 500... are.

Sorry, I'm rephrasing just so you can correct me, and I make sure I get the point.

In my understanding, then, what you're proposing is like: there are all kinds of actors, users, third parties, that openhands is working with. Errors that come from those could be reported differently, like an extra State note.

@li-boxuan
Copy link
Collaborator Author

Not really. LLM out-of-context error is a counter example. That's categorized as OpenHand's incompetence (as in it cannot condense its prompt, OR fallback to a different LLM with longer context support, etc.)

Ah of course, absolutely. I wasn't counting it 😅

Then,

  • "it didn't hit the LLM API for some temporary reason" kind of LLM errors (aka provider related, retryable, but retries failed; includes rate limits)
  • auth errors (they're permanent, but under user's control; includes out of budget)
  • maybe docker errors; sandbox crashes.

400 errors from the LLM API are probably not included, 502 are, 500... are.

Sorry, I'm rephrasing just so you can correct me, and I make sure I get the point.

In my understanding, then, what you're proposing is like: there are all kinds of actors, users, third parties, that openhands is working with. Errors that come from those could be reported differently, like an extra State note.

Yep that's what I am thinking. Apparently those 3rd party errors, along as OpenHands errors, are equivalent to users: "this thing just doesn't work". But from benchmarking perspective, the 3rd party errors should be reported differently in my opinion.

@mamoodi mamoodi added the evaluation Related to running evaluations with OpenHands label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request evaluation Related to running evaluations with OpenHands
Projects
None yet
Development

No branches or pull requests

4 participants