Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled native function calling for O1 + added support for reasoning_effort config in the config. #6256

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
3 changes: 3 additions & 0 deletions config.template.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ workspace_base = "./workspace"
# Cache directory path
#cache_dir = "/tmp/cache"

# Reasoning effort for o1 models (low, medium, high, or not set)
#reasoning_effort = "medium"

# Debugging enabled
#debug = false

Expand Down
2 changes: 2 additions & 0 deletions openhands/core/config/llm_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ class LLMConfig:
drop_params: Drop any unmapped (unsupported) params without causing an exception.
modify_params: Modify params allows litellm to do transformations like adding a default message, when a message is empty.
disable_vision: If model is vision capable, this option allows to disable image processing (useful for cost reduction).
reasoning_effort: The effort to put into reasoning. This is a string that can be one of 'low', 'medium', 'high', or 'none'. Exclusive for o1 models.
caching_prompt: Use the prompt caching feature if provided by the LLM and supported by the provider.
log_completions: Whether to log LLM completions to the state.
log_completions_folder: The folder to log LLM completions to. Required if log_completions is True.
Expand Down Expand Up @@ -79,6 +80,7 @@ class LLMConfig:
# Note: this setting is actually global, unlike drop_params
modify_params: bool = True
disable_vision: bool | None = None
reasoning_effort: str | None = None
enyst marked this conversation as resolved.
Show resolved Hide resolved
caching_prompt: bool = True
log_completions: bool = False
log_completions_folder: str = os.path.join(LOG_DIR, 'completions')
Expand Down
10 changes: 9 additions & 1 deletion openhands/llm/async_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@

from openhands.core.exceptions import UserCancelledError
from openhands.core.logger import openhands_logger as logger
from openhands.llm.llm import LLM, LLM_RETRY_EXCEPTIONS
from openhands.llm.llm import (
LLM,
LLM_RETRY_EXCEPTIONS,
REASONING_EFFORT_SUPPORTED_MODELS,
)
from openhands.utils.shutdown_listener import should_continue


Expand Down Expand Up @@ -55,6 +59,10 @@ async def async_completion_wrapper(*args, **kwargs):
elif 'messages' in kwargs:
messages = kwargs['messages']

# Set reasoning effort for models that support it
if self.config.model.lower() in REASONING_EFFORT_SUPPORTED_MODELS:
kwargs['reasoning_effort'] = self.config.reasoning_effort

# ensure we work with a list of messages
messages = messages if isinstance(messages, list) else [messages]

Expand Down
9 changes: 9 additions & 0 deletions openhands/llm/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,11 @@
'claude-3-5-haiku-20241022',
'gpt-4o-mini',
'gpt-4o',
'o1',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried without native function calling, to compare results between with it enabled and it disabled (prompting-based replacement)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, strictly speaking using native is already supported, it's just not enabled by default. But there's a native_function_calling setting to enable it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With native function calling the model solves 48% of the issues, with simulated function calling, 30%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make the results available soon, I still need to finish running SWE-Bench Verified (the result above is preliminary after running 300/500 issues)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good result! I'm surprised, I'm losing track of our current evals, I thought it was much lower last time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the current simulated tools from OH, O1's performance degrades significantly. It is quite interesting because 4o's performance is not impacted as much (19% vs 12%)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me actually! We have seen significant differences before. That might include even Sonnet 3.5, I just think we don't know for sure why, because when it jumped from something like ~26% to over 50%, three things happened:

  • switched from simulated "actions" to native tool calling
  • also redefined the prompts/tools very very close to Anthropic's tools
  • also went from Sonnet 3.5 (old) to Sonnet 3.5 (new) 😂
    I'm not sure that we know which factor mattered how much on that one. 😅

These preliminary results are on this branch, or the supervisor branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! O1_native_tool_calls gets a higher score than Sonnet 3.5 (but not way higher, in no way enough to justify its price), so being close to Anthrotopic tools might matter but not that much.

The results will be shared today in Huggingface, I am currently evaluating them using the harness.
The supervisor branch will be done soon, but I will run the experiments first and then update the branch before or after ICML deadline (30 Jan), depending on how much work left I have 😅

]

REASONING_EFFORT_SUPPORTED_MODELS = [
'o1',
]


Expand Down Expand Up @@ -205,6 +210,10 @@ def wrapper(*args, **kwargs):
'anthropic-beta': 'prompt-caching-2024-07-31',
}

# Set reasoning effort for models that support it
if self.config.model.lower() in REASONING_EFFORT_SUPPORTED_MODELS:
kwargs['reasoning_effort'] = self.config.reasoning_effort

# set litellm modify_params to the configured value
# True by default to allow litellm to do transformations like adding a default message, when a message is empty
# NOTE: this setting is global; unlike drop_params, it cannot be overridden in the litellm completion partial
Expand Down
5 changes: 5 additions & 0 deletions openhands/llm/streaming_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from openhands.core.exceptions import UserCancelledError
from openhands.core.logger import openhands_logger as logger
from openhands.llm.async_llm import LLM_RETRY_EXCEPTIONS, AsyncLLM
from openhands.llm.llm import REASONING_EFFORT_SUPPORTED_MODELS


class StreamingLLM(AsyncLLM):
Expand Down Expand Up @@ -61,6 +62,10 @@ async def async_streaming_completion_wrapper(*args, **kwargs):
'The messages list is empty. At least one message is required.'
)

# Set reasoning effort for models that support it
if self.config.model.lower() in REASONING_EFFORT_SUPPORTED_MODELS:
kwargs['reasoning_effort'] = self.config.reasoning_effort

self.log_prompt(messages)

try:
Expand Down
Loading