Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled native function calling for O1 + added support for reasoning_effort config in the config. #6256
base: main
Are you sure you want to change the base?
Enabled native function calling for O1 + added support for reasoning_effort config in the config. #6256
Changes from all commits
715cb87
a9b6554
f47ff60
52ba208
1f1a76c
a9eef0b
3dc1161
40202e3
8517ac4
e33ecb4
87c68a8
567221e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tried without native function calling, to compare results between with it enabled and it disabled (prompting-based replacement)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note, strictly speaking using native is already supported, it's just not enabled by default. But there's a native_function_calling setting to enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With native function calling the model solves 48% of the issues, with simulated function calling, 30%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the results available soon, I still need to finish running SWE-Bench Verified (the result above is preliminary after running 300/500 issues)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good result! I'm surprised, I'm losing track of our current evals, I thought it was much lower last time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using the current simulated tools from OH, O1's performance degrades significantly. It is quite interesting because 4o's performance is not impacted as much (19% vs 12%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me actually! We have seen significant differences before. That might include even Sonnet 3.5, I just think we don't know for sure why, because when it jumped from something like ~26% to over 50%, three things happened:
I'm not sure that we know which factor mattered how much on that one. 😅
These preliminary results are on this branch, or the supervisor branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! O1_native_tool_calls gets a higher score than Sonnet 3.5 (but not way higher, in no way enough to justify its price), so being close to Anthrotopic tools might matter but not that much.
The results will be shared today in Huggingface, I am currently evaluating them using the harness.
The supervisor branch will be done soon, but I will run the experiments first and then update the branch before or after ICML deadline (30 Jan), depending on how much work left I have 😅