FailOnRecoverableError mode #5240

li-boxuan · 2024-11-24T20:14:39Z

What problem or use case are you trying to solve?

This is mostly for evaluation purpose. I am not sure what this mode should be properly named, so let me describe the scenario here:

Sometimes, OpenHands fails to complete a task in a benchmark because it's incompetent, runs out of predefined budget or max iterations, reaches LLM context limit in a single conversation, or because a loop is detected. In this case, OpenHands fails the task and ends a session. Benchmark evaluators, which are outside of OpenHands, would usually then declare a task/challenge as failed/incomplete. This makes sense.
At other times, OpenHands fails to complete a task because there's a recoverable error such as LLM rate limiting, LLM key out of budget, or even LLM is down. This may not be that "transient" since an immediate retry usually doesn't work. This, however, is recoverable, and is not really OpenHand's fault. Benchmark evaluators, which are outside of OpenHands, would still grade the task because OpenHands declares it completes with an error.

My proposal is to introduce a new mode, say, fail-on-recoverable-error. When OpenHands sees litellm.exceptions.AuthenticationError, rate limiting error, or similar, it should just fail the entire program. Alternatively, it needs to indicate something else in the final state so that outside evaluation code can differentiate between OpenHands' incompetence and LLM issues.

The benefit is we can fix the LLM problem and rerun OpenHands for those tasks. Otherwise, we might have under-evaluated OpenHand's score in benchmarks.

C.C. @xingyaoww

xingyaoww · 2024-11-24T21:48:12Z

@li-boxuan I believe we have similar mechanism implemented in our run_evaluation function:

OpenHands/evaluation/utils/shared.py

Lines 300 to 364 in eb2a0b1

    
           def _process_instance_wrapper( 
        
               process_instance_func: Callable[[pd.Series, EvalMetadata, bool], EvalOutput], 
        
               instance: pd.Series, 
        
               metadata: EvalMetadata, 
        
               use_mp: bool, 
        
               max_retries: int = 5, 
        
               timeout_seconds: int | None = None, 
        
           ) -> EvalOutput: 
        
               """Wrap the process_instance_func to handle retries and errors.""" 
        
               for attempt in range(max_retries + 1): 
        
                   try: 
        
                       if timeout_seconds is not None: 
        
                           with timeout(timeout_seconds): 
        
                               result = process_instance_func(instance, metadata, use_mp) 
        
                       else: 
        
                           result = process_instance_func(instance, metadata, use_mp) 
        
                       return result 
        
                   except EvalTimeoutException as e: 
        
                       error = f'Timeout after {timeout_seconds} seconds' 
        
                       stacktrace = traceback.format_exc() 
        
                       msg = ( 
        
                           '-' * 10 
        
                           + '\n' 
        
                           + f'Timeout ({timeout_seconds} seconds) in instance [{instance.instance_id}], Stopped evaluation for this instance.' 
        
                           + '\n' 
        
                           + '-' * 10 
        
                       ) 
        
                       logger.exception(e) 
        
                       return EvalOutput( 
        
                           instance_id=instance.instance_id, 
        
                           test_result={}, 
        
                           error=error, 
        
                       ) 
        
                   except Exception as e: 
        
                       error = str(e) 
        
                       stacktrace = traceback.format_exc() 
        
                       if attempt == max_retries: 
        
                           logger.exception(e) 
        
                           msg = ( 
        
                               '-' * 10 
        
                               + '\n' 
        
                               + f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}' 
        
                               + '\n' 
        
                               + f'[Encountered after {max_retries} retries. Please check the logs and report the issue.]' 
        
                               + '-' * 10 
        
                           ) 
        
                           # Raise an error after all retries & stop the evaluation 
        
                           logger.exception(e) 
        
                           raise RuntimeError( 
        
                               f'Maximum error retries reached for instance {instance.instance_id}' 
        
                           ) from e 
        
                       msg = ( 
        
                           '-' * 10 
        
                           + '\n' 
        
                           + f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}' 
        
                           + '\n' 
        
                           + '-' * 10 
        
                           + f'[The above error occurred. Retrying... (attempt {attempt + 1} of {max_retries})]' 
        
                           + '-' * 10 
        
                           + '\n' 
        
                       ) 
        
                       logger.error(msg) 
        
                       if use_mp: 
        
                           print(msg)  # use print to directly print to console 
        
                       time.sleep(5)

li-boxuan · 2024-11-24T23:05:36Z

@li-boxuan I believe we have similar mechanism implemented in our run_evaluation function:

OpenHands/evaluation/utils/shared.py

Lines 300 to 364 in eb2a0b1

def _process_instance_wrapper(

process_instance_func: Callable[[pd.Series, EvalMetadata, bool], EvalOutput],

instance: pd.Series,

metadata: EvalMetadata,

use_mp: bool,

max_retries: int = 5,

timeout_seconds: int | None = None,

) -> EvalOutput:

"""Wrap the process_instance_func to handle retries and errors."""

for attempt in range(max_retries + 1):

try:

if timeout_seconds is not None:

with timeout(timeout_seconds):

result = process_instance_func(instance, metadata, use_mp)

else:

result = process_instance_func(instance, metadata, use_mp)

return result

except EvalTimeoutException as e:

error = f'Timeout after {timeout_seconds} seconds'

stacktrace = traceback.format_exc()

msg = (

'-' * 10

+ '\n'

+ f'Timeout ({timeout_seconds} seconds) in instance [{instance.instance_id}], Stopped evaluation for this instance.'

+ '\n'

+ '-' * 10

)

logger.exception(e)

return EvalOutput(

instance_id=instance.instance_id,

test_result={},

error=error,

)

except Exception as e:

error = str(e)

stacktrace = traceback.format_exc()

if attempt == max_retries:

logger.exception(e)

msg = (

'-' * 10

+ '\n'

+ f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}'

+ '\n'

+ f'[Encountered after {max_retries} retries. Please check the logs and report the issue.]'

+ '-' * 10

)

# Raise an error after all retries & stop the evaluation

logger.exception(e)

raise RuntimeError(

f'Maximum error retries reached for instance {instance.instance_id}'

) from e

msg = (

'-' * 10

+ '\n'

+ f'Error in instance [{instance.instance_id}]: {error}. Stacktrace:\n{stacktrace}'

+ '\n'

+ '-' * 10

+ f'[The above error occurred. Retrying... (attempt {attempt + 1} of {max_retries})]'

+ '-' * 10

+ '\n'

)

logger.error(msg)

if use_mp:

print(msg) # use print to directly print to console

time.sleep(5)

@xingyaoww This wrapper is useful to capture OpenHand's instability errors (e.g. runtime crash), but not the external LLM provider errors (out of budget, rate limiting, etc.) I mentioned above. In those cases, OpenHands main process doesn't throw any exception. Instead, it finishes the execution with an error message. Consequently, benchmarking code treats external LLM provider errors as OpenHand's incompetence.

enyst · 2024-11-25T00:19:42Z

I'm trying to understand how to separate, because there is also a third kind: application errors of all kinds. Or I guess those are more like "incompetence".

Is it fair to say that your proposal of fail-on-recoverable-error is a kind of exit that would happen:

for all LLM errors
and only LLM errors?

li-boxuan · 2024-11-25T00:56:37Z

application errors of all kinds. Or I guess those are more like "incompetence".

Yeah they sound like "incompetence" to me. Actually I do feel the current wrapper approach - try/except and rerun is unfair to some point, as software stableness should definitely be a part of the benchmark target itself, but I guess that's not how research works, so let it be.

Is it fair to say that your proposal of fail-on-recoverable-error is a kind of exit that would happen: for all LLM errors

Not really. LLM out-of-context error is a counter example. That's categorized as OpenHand's incompetence (as in it cannot condense its prompt, OR fallback to a different LLM with longer context support, etc.)

Is it fair to say that your proposal of fail-on-recoverable-error is a kind of exit that would happen: and only LLM errors

Docker errors (I sometimes do see Docker crash on my laptop), or cloud provider errors (if running on a cloud-based sandbox, ever) fall into this category too. Mostly, any error that's out of OpenHand's control.

enyst · 2024-11-25T01:17:44Z

Not really. LLM out-of-context error is a counter example. That's categorized as OpenHand's incompetence (as in it cannot condense its prompt, OR fallback to a different LLM with longer context support, etc.)

Ah of course, absolutely. I wasn't counting it 😅

Then,

"it didn't hit the LLM API for some temporary reason" kind of LLM errors (aka provider related, retryable, but retries failed; includes rate limits)
auth errors (they're permanent, but under user's control; includes out of budget)
maybe docker errors; sandbox crashes.

400 errors from the LLM API are probably not included, 502 are, 500... are.

Sorry, I'm rephrasing just so you can correct me, and I make sure I get the point.

In my understanding, then, what you're proposing is like: there are all kinds of actors, users, third parties, that openhands is working with. Errors that come from those could be reported differently, like an extra State note.

li-boxuan · 2024-11-25T01:22:42Z

Not really. LLM out-of-context error is a counter example. That's categorized as OpenHand's incompetence (as in it cannot condense its prompt, OR fallback to a different LLM with longer context support, etc.)

Ah of course, absolutely. I wasn't counting it 😅

Then,

"it didn't hit the LLM API for some temporary reason" kind of LLM errors (aka provider related, retryable, but retries failed; includes rate limits)

auth errors (they're permanent, but under user's control; includes out of budget)

maybe docker errors; sandbox crashes.

400 errors from the LLM API are probably not included, 502 are, 500... are.

Sorry, I'm rephrasing just so you can correct me, and I make sure I get the point.

In my understanding, then, what you're proposing is like: there are all kinds of actors, users, third parties, that openhands is working with. Errors that come from those could be reported differently, like an extra State note.

Yep that's what I am thinking. Apparently those 3rd party errors, along as OpenHands errors, are equivalent to users: "this thing just doesn't work". But from benchmarking perspective, the 3rd party errors should be reported differently in my opinion.

li-boxuan added the enhancement New feature or request label Nov 24, 2024

mamoodi added the evaluation Related to running evaluations with OpenHands label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FailOnRecoverableError mode #5240

FailOnRecoverableError mode #5240

li-boxuan commented Nov 24, 2024 •

edited

Loading

xingyaoww commented Nov 24, 2024

li-boxuan commented Nov 24, 2024 •

edited

Loading

enyst commented Nov 25, 2024

li-boxuan commented Nov 25, 2024 •

edited

Loading

enyst commented Nov 25, 2024

li-boxuan commented Nov 25, 2024

FailOnRecoverableError mode #5240

FailOnRecoverableError mode #5240

Comments

li-boxuan commented Nov 24, 2024 • edited Loading

xingyaoww commented Nov 24, 2024

li-boxuan commented Nov 24, 2024 • edited Loading

enyst commented Nov 25, 2024

li-boxuan commented Nov 25, 2024 • edited Loading

enyst commented Nov 25, 2024

li-boxuan commented Nov 25, 2024

li-boxuan commented Nov 24, 2024 •

edited

Loading

li-boxuan commented Nov 24, 2024 •

edited

Loading

li-boxuan commented Nov 25, 2024 •

edited

Loading