-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust prompt to use view command #5506
Conversation
There we go:
|
I ran 13 instances that are unresolved (0/13) in the CodeAct 2.2 results. They're all on django, and all part of the intersection of Lite with Verified. CodeAct2.2: 0/13 Too little to matter, but FWIW! @xingyaoww |
I'm thinking about whether we should still make this change in the prompt, as encouraging the agent to use |
Running evaluation on the PR. Once eval is done, the results will be posted. |
Evaluation results: ## Summary
Empty patches were from the litellm proxy error:
|
Haven't automated this part yet so here ya go: |
@openhands-agent Your last attempt to fix the conflicts didn't work. Please do this again: pull main into this branch and fix the conflicts. |
@xingyaoww What are your thoughts on this one?
In regular use the past month, with the resolver, the llm asks for the |
@enyst hmm - i can probably run a larger-scale (100 instance) one later today? |
OK, but I can do that, if the remote runtime cooperates today. Or can we sweet-talk Mamoodi to help? ❤️ |
❤️ if it is easy, could you run one? :D LMK if you need more LLM credits and/or remote runtime concurrency. Otherwise let's see if @mamoodi have the bandwidth to help 🙏 |
I'll give it a go! |
This PR branch: Summary
Best from another PR: Summary
Last known main: 41 / 100 It looks good! @xingyaoww full archive is on slack |
@enyst are you running with max iteration of 100 or 30? |
30: > claude-3-5-sonnet-20241022_maxiter_30_N_v0.20.0-no-hint-run_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
48/100 for max 30 turns looks great! This LGTM
Remove the NOT FOR MERGE before merging? :) |
Oh, indeed Django made the difference! It's by far the largest repo:
|
Very weird.. after merging this into one of my branch and running a full SWE-Bench verified (compared to our prev 53% run) -- it django actually got a lot of failed :( I suspect it is because "view" only go up-to two level depth. And at two level, it didn't show the agent which folder is expandable or not. I'd suggest we can probably show the type of file/folder in the output of
|
That is very weird, it doesn't list a directory? How exactly does it get confused? I would love to look into the The closest I've seen in the previous run looked OK actually, when the LLM needed more depth it did something like this:
|
Give a summary of what the PR does, explaining any non-trivial design decisions
This is the prompt adjustment I used, with the purpose that the LLM uses more the
view
tool for directories, which is part of itsfile_editor
tool, than other options (ls -R /workspace
orls -la /workspace
).I think it would be interesting to eval this after Ryan's fix is merged in
main
.Reason for this experiment:
I was surprised to see in the event stream of CodeAct 2.2 swe-bench run:
ls -R /workspace
a lotls -la
sometimes.ls -R /workspace
is tough, on large repos. On the django repo it overflows the observation limit by a lot (over 100k tokens!), so we truncate it to ~10k tokens. But that difference also means we truncate it to less than 10%, taking only the beginning and the end, so the information the agent gets is very sparse and lopsided. So it ends up messing around in the repo a lot more than if it had used itsview
with depth 2.To run this PR locally, use the following command: