Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust prompt to use view command #5506

Merged
merged 15 commits into from
Jan 21, 2025
Merged

Adjust prompt to use view command #5506

merged 15 commits into from
Jan 21, 2025

Conversation

enyst
Copy link
Collaborator

@enyst enyst commented Dec 10, 2024

Give a summary of what the PR does, explaining any non-trivial design decisions

This is the prompt adjustment I used, with the purpose that the LLM uses more the view tool for directories, which is part of its file_editor tool, than other options (ls -R /workspace or ls -la /workspace).

I think it would be interesting to eval this after Ryan's fix is merged in main.

Reason for this experiment:
I was surprised to see in the event stream of CodeAct 2.2 swe-bench run:

  • the agent uses ls -R /workspace a lot
  • and ls -la sometimes.

ls -R /workspace is tough, on large repos. On the django repo it overflows the observation limit by a lot (over 100k tokens!), so we truncate it to ~10k tokens. But that difference also means we truncate it to less than 10%, taking only the beginning and the end, so the information the agent gets is very sparse and lopsided. So it ends up messing around in the repo a lot more than if it had used its view with depth 2.


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:23bbd9c-nikolaik   --name openhands-app-23bbd9c   docker.all-hands.dev/all-hands-ai/openhands:23bbd9c

@enyst enyst marked this pull request as draft December 10, 2024 07:12
@enyst
Copy link
Collaborator Author

enyst commented Dec 10, 2024

There we go:

THOUGHT: I'll help you implement the necessary changes to fix the issue with `django.utils.http.parse_http_date`. Let's follow the steps you outlined.

1. First, let's explore the repository structure:
CODE:
print(file_editor(**{'command': 'view', 'path': '/workspace/django__django__3.1'}))

OBSERVATION
[Agent Controller default] **IPythonRunCellObservation**
Here's the files and directories up to 2 levels deep in /workspace/django__django__3.1, excluding hidden items:
/workspace/django__django__3.1
/workspace/django__django__3.1/AUTHORS
/workspace/django__django__3.1/CONTRIBUTING.rst
/workspace/django__django__3.1/FUNDING.yml
/workspace/django__django__3.1/Gruntfile.js
/workspace/django__django__3.1/INSTALL
/workspace/django__django__3.1/LICENSE
/workspace/django__django__3.1/LICENSE.python
...

@enyst
Copy link
Collaborator Author

enyst commented Dec 10, 2024

I ran 13 instances that are unresolved (0/13) in the CodeAct 2.2 results. They're all on django, and all part of the intersection of Lite with Verified.

CodeAct2.2: 0/13
Branch: 1/13.

Too little to matter, but FWIW! @xingyaoww

@ryanhoangt
Copy link
Contributor

ryanhoangt commented Dec 11, 2024

I'm thinking about whether we should still make this change in the prompt, as encouraging the agent to use view over ls -R can save us on tokens, hence allowing the agent to execute more steps before reaching the context limit 🤔

@enyst enyst added the run-eval-m Runs evaluation with 30 instances label Dec 13, 2024
@enyst enyst added run-eval-m Runs evaluation with 30 instances and removed run-eval-m Runs evaluation with 30 instances labels Dec 13, 2024
Copy link
Contributor

Running evaluation on the PR. Once eval is done, the results will be posted.

@openhands-agent
Copy link
Contributor

openhands-agent commented Dec 13, 2024

Evaluation results: ## Summary

  • submitted instances: 30
  • empty patch instances: 12
  • resolved instances: 8
  • unresolved instances: 22
  • error instances: 0

Empty patches were from the litellm proxy error:

2024-12-13 11:47:01,561 - ERROR - [Agent Controller default] Error while running the agent: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'message': 'litellm.NotFoundError: AnthropicException - {"type":"error","error":{"type":"not_found_error","message":"model: *"}}\nReceived Model Group=claude-3-5-sonnet-20241022.......
 'code': '404'}}

@mamoodi
Copy link
Collaborator

mamoodi commented Dec 13, 2024

Haven't automated this part yet so here ya go:
evaluation.zip

@enyst
Copy link
Collaborator Author

enyst commented Jan 5, 2025

@openhands-agent Your last attempt to fix the conflicts didn't work. Please do this again: pull main into this branch and fix the conflicts.

@openhands-agent
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Jan 5, 2025
@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Jan 5, 2025
@enyst enyst added lint-fix and removed run-eval-m Runs evaluation with 30 instances labels Jan 5, 2025
@enyst
Copy link
Collaborator Author

enyst commented Jan 21, 2025

@xingyaoww What are your thoughts on this one?

  • the 13 instances eval got a small improvement
  • the 30 instances is inconclusive (12 instances ran into an litellm error, bad day).

In regular use the past month, with the resolver, the llm asks for the view command quite regularly. But in the last official eval (CodeAct 2.2), it doesn't, it uses mostly ls -R /workspace, and I think view really should be better, at least on a large repo like django.

@xingyaoww
Copy link
Collaborator

@enyst hmm - i can probably run a larger-scale (100 instance) one later today?

@enyst
Copy link
Collaborator Author

enyst commented Jan 21, 2025

@enyst hmm - i can probably run a larger-scale (100 instance) one later today?

OK, but I can do that, if the remote runtime cooperates today. Or can we sweet-talk Mamoodi to help? ❤️

@xingyaoww
Copy link
Collaborator

❤️ if it is easy, could you run one? :D LMK if you need more LLM credits and/or remote runtime concurrency. Otherwise let's see if @mamoodi have the bandwidth to help 🙏

@enyst
Copy link
Collaborator Author

enyst commented Jan 21, 2025

I'll give it a go!

@enyst
Copy link
Collaborator Author

enyst commented Jan 21, 2025

This PR branch:

Summary

  • submitted instances: 100
  • empty patch instances: 5
  • resolved instances: 48
  • unresolved instances: 52
  • error instances: 0

Best from another PR:

Summary

  • submitted instances: 100
  • empty patch instances: 14
  • resolved instances: 43
  • unresolved instances: 57
  • error instances: 0

Last known main: 41 / 100

It looks good! @xingyaoww full archive is on slack

@xingyaoww
Copy link
Collaborator

@enyst are you running with max iteration of 100 or 30?

@enyst
Copy link
Collaborator Author

enyst commented Jan 21, 2025

30: > claude-3-5-sonnet-20241022_maxiter_30_N_v0.20.0-no-hint-run_1

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

48/100 for max 30 turns looks great! This LGTM

@enyst enyst marked this pull request as ready for review January 21, 2025 21:46
@mamoodi
Copy link
Collaborator

mamoodi commented Jan 21, 2025

Remove the NOT FOR MERGE before merging? :)

@enyst enyst changed the title [NOT FOR MERGE] Adjust prompt to use view command Adjust prompt to use view command Jan 21, 2025
@enyst
Copy link
Collaborator Author

enyst commented Jan 21, 2025

Oh, indeed Django made the difference! It's by far the largest repo:
(x = branch, y = main)

django:
Difference: 8 instances!
X resolved but Y failed: (12 instances)
  ['django__django-11066', 'django__django-11179', 'django__django-11265', 'django__django-11276', 'django__django-12155', 'django__django-12262', 'django__django-12276', 'django__django-12304', 'django__django-12708', 'django__django-12858', 'django__django-13028', 'django__django-13112']
Y resolved but X failed: (4 instances)
  ['django__django-11815', 'django__django-12039', 'django__django-12273', 'django__django-13033']

astropy:
Difference: 3 instances!
X resolved but Y failed: (4 instances)
  ['astropy__astropy-12907', 'astropy__astropy-14096', 'astropy__astropy-14539', 'astropy__astropy-14995']
Y resolved but X failed: (1 instances)
  ['astropy__astropy-14365']

@enyst enyst merged commit f0dbb02 into main Jan 21, 2025
13 checks passed
@enyst enyst deleted the enyst/test_view_depth branch January 21, 2025 22:50
@xingyaoww
Copy link
Collaborator

image

Very weird.. after merging this into one of my branch and running a full SWE-Bench verified (compared to our prev 53% run) -- it django actually got a lot of failed :(

I suspect it is because "view" only go up-to two level depth. And at two level, it didn't show the agent which folder is expandable or not.

I'd suggest we can probably show the type of file/folder in the output of view command:

/workspace/django__django__3.0/django/middleware # folder:
/workspace/django__django__3.0/django/shortcuts.py # file
/workspace/django__django__3.0/django/template/ # folder: X files under this directory

@enyst
Copy link
Collaborator Author

enyst commented Jan 23, 2025

That is very weird, it doesn't list a directory? How exactly does it get confused? I would love to look into the llm_completions of the failed instances.

The closest I've seen in the previous run looked OK actually, when the LLM needed more depth it did something like this:

let's explore ... view /workspace/django
I see our problem is in <subdirectory>, so let's explore it ... view /workspace/django/subdirectory
Now I understand what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants