Adjust prompt to use view command #5506

enyst · 2024-12-10T07:12:23Z

Give a summary of what the PR does, explaining any non-trivial design decisions

This is the prompt adjustment I used, with the purpose that the LLM uses more the view tool for directories, which is part of its file_editor tool, than other options (ls -R /workspace or ls -la /workspace).

I think it would be interesting to eval this after Ryan's fix is merged in main.

Reason for this experiment:
I was surprised to see in the event stream of CodeAct 2.2 swe-bench run:

the agent uses ls -R /workspace a lot
and ls -la sometimes.

ls -R /workspace is tough, on large repos. On the django repo it overflows the observation limit by a lot (over 100k tokens!), so we truncate it to ~10k tokens. But that difference also means we truncate it to less than 10%, taking only the beginning and the end, so the information the agent gets is very sparse and lopsided. So it ends up messing around in the repo a lot more than if it had used its view with depth 2.

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:23bbd9c-nikolaik   --name openhands-app-23bbd9c   docker.all-hands.dev/all-hands-ai/openhands:23bbd9c

enyst · 2024-12-10T14:38:43Z

There we go:

THOUGHT: I'll help you implement the necessary changes to fix the issue with `django.utils.http.parse_http_date`. Let's follow the steps you outlined.

1. First, let's explore the repository structure:
CODE:
print(file_editor(**{'command': 'view', 'path': '/workspace/django__django__3.1'}))

OBSERVATION
[Agent Controller default] **IPythonRunCellObservation**
Here's the files and directories up to 2 levels deep in /workspace/django__django__3.1, excluding hidden items:
/workspace/django__django__3.1
/workspace/django__django__3.1/AUTHORS
/workspace/django__django__3.1/CONTRIBUTING.rst
/workspace/django__django__3.1/FUNDING.yml
/workspace/django__django__3.1/Gruntfile.js
/workspace/django__django__3.1/INSTALL
/workspace/django__django__3.1/LICENSE
/workspace/django__django__3.1/LICENSE.python
...

enyst · 2024-12-10T16:27:48Z

I ran 13 instances that are unresolved (0/13) in the CodeAct 2.2 results. They're all on django, and all part of the intersection of Lite with Verified.

CodeAct2.2: 0/13
Branch: 1/13.

Too little to matter, but FWIW! @xingyaoww

ryanhoangt · 2024-12-11T14:43:31Z

I'm thinking about whether we should still make this change in the prompt, as encouraging the agent to use view over ls -R can save us on tokens, hence allowing the agent to execute more steps before reaching the context limit 🤔

github-actions · 2024-12-13T11:34:06Z

Running evaluation on the PR. Once eval is done, the results will be posted.

openhands-agent · 2024-12-13T12:12:16Z

Evaluation results: ## Summary

submitted instances: 30
empty patch instances: 12
resolved instances: 8
unresolved instances: 22
error instances: 0

Empty patches were from the litellm proxy error:

2024-12-13 11:47:01,561 - ERROR - [Agent Controller default] Error while running the agent: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'message': 'litellm.NotFoundError: AnthropicException - {"type":"error","error":{"type":"not_found_error","message":"model: *"}}\nReceived Model Group=claude-3-5-sonnet-20241022.......
 'code': '404'}}

mamoodi · 2024-12-13T13:28:14Z

Haven't automated this part yet so here ya go:
evaluation.zip

enyst · 2025-01-05T02:59:08Z

@openhands-agent Your last attempt to fix the conflicts didn't work. Please do this again: pull main into this branch and fix the conflicts.

openhands-agent · 2025-01-05T02:59:27Z

OpenHands started fixing the pr! You can monitor the progress here.

enyst · 2025-01-21T14:03:25Z

@xingyaoww What are your thoughts on this one?

the 13 instances eval got a small improvement
the 30 instances is inconclusive (12 instances ran into an litellm error, bad day).

In regular use the past month, with the resolver, the llm asks for the view command quite regularly. But in the last official eval (CodeAct 2.2), it doesn't, it uses mostly ls -R /workspace, and I think view really should be better, at least on a large repo like django.

xingyaoww · 2025-01-21T14:16:06Z

@enyst hmm - i can probably run a larger-scale (100 instance) one later today?

enyst · 2025-01-21T14:42:15Z

@enyst hmm - i can probably run a larger-scale (100 instance) one later today?

OK, but I can do that, if the remote runtime cooperates today. Or can we sweet-talk Mamoodi to help? ❤️

xingyaoww · 2025-01-21T14:49:02Z

❤️ if it is easy, could you run one? :D LMK if you need more LLM credits and/or remote runtime concurrency. Otherwise let's see if @mamoodi have the bandwidth to help 🙏

enyst · 2025-01-21T14:54:31Z

I'll give it a go!

enyst · 2025-01-21T21:01:00Z

This PR branch:

Summary

submitted instances: 100
empty patch instances: 5
resolved instances: 48
unresolved instances: 52
error instances: 0

Best from another PR:

Summary

submitted instances: 100
empty patch instances: 14
resolved instances: 43
unresolved instances: 57
error instances: 0

Last known main: 41 / 100

It looks good! @xingyaoww full archive is on slack

xingyaoww · 2025-01-21T21:16:53Z

@enyst are you running with max iteration of 100 or 30?

enyst · 2025-01-21T21:24:38Z

30: > claude-3-5-sonnet-20241022_maxiter_30_N_v0.20.0-no-hint-run_1

xingyaoww

48/100 for max 30 turns looks great! This LGTM

mamoodi · 2025-01-21T21:50:51Z

Remove the NOT FOR MERGE before merging? :)

enyst · 2025-01-21T22:37:16Z

Oh, indeed Django made the difference! It's by far the largest repo:
(x = branch, y = main)

django:
Difference: 8 instances!
X resolved but Y failed: (12 instances)
  ['django__django-11066', 'django__django-11179', 'django__django-11265', 'django__django-11276', 'django__django-12155', 'django__django-12262', 'django__django-12276', 'django__django-12304', 'django__django-12708', 'django__django-12858', 'django__django-13028', 'django__django-13112']
Y resolved but X failed: (4 instances)
  ['django__django-11815', 'django__django-12039', 'django__django-12273', 'django__django-13033']

astropy:
Difference: 3 instances!
X resolved but Y failed: (4 instances)
  ['astropy__astropy-12907', 'astropy__astropy-14096', 'astropy__astropy-14539', 'astropy__astropy-14995']
Y resolved but X failed: (1 instances)
  ['astropy__astropy-14365']

xingyaoww · 2025-01-22T21:28:36Z

Very weird.. after merging this into one of my branch and running a full SWE-Bench verified (compared to our prev 53% run) -- it django actually got a lot of failed :(

I suspect it is because "view" only go up-to two level depth. And at two level, it didn't show the agent which folder is expandable or not.

I'd suggest we can probably show the type of file/folder in the output of view command:

/workspace/django__django__3.0/django/middleware # folder:
/workspace/django__django__3.0/django/shortcuts.py # file
/workspace/django__django__3.0/django/template/ # folder: X files under this directory

enyst · 2025-01-23T07:06:57Z

That is very weird, it doesn't list a directory? How exactly does it get confused? I would love to look into the llm_completions of the failed instances.

The closest I've seen in the previous run looked OK actually, when the LLM needed more depth it did something like this:

let's explore ... view /workspace/django
I see our problem is in <subdirectory>, so let's explore it ... view /workspace/django/subdirectory
Now I understand what happens.

prompt to use view command

925eefe

enyst marked this pull request as draft December 10, 2024 07:12

enyst mentioned this pull request Dec 10, 2024

[Bug]: The 'view' tool command doesn't work on /workspace #5497

Closed

1 task

use aci main

12be7fc

enyst added the run-eval-m Runs evaluation with 30 instances label Dec 13, 2024

enyst added 2 commits December 13, 2024 11:20

Merge branch 'main' into enyst/test_view_depth

832311c

poetry lock

bff64d8

enyst added run-eval-m Runs evaluation with 30 instances and removed run-eval-m Runs evaluation with 30 instances labels Dec 13, 2024

Fix pr #5506: [NOT FOR MERGE] Adjust prompt to use view command

e2c6a1f

Resolve merge conflicts with main branch

0ce0ce2

All-Hands-AI deleted a comment from openhands-agent Jan 5, 2025

enyst added lint-fix and removed run-eval-m Runs evaluation with 30 instances labels Jan 5, 2025

enyst added 5 commits January 5, 2025 19:30

restore poetry lock

27d70d0

restore perms

44a6bd9

Merge branch 'main' into enyst/test_view_depth

8f86073

Merge branch 'main' into enyst/test_view_depth

6a7c3bc

Merge branch 'main' into enyst/test_view_depth

3b064ab

enyst added 3 commits January 21, 2025 16:30

Merge branch 'main' into enyst/test_view_depth

e055bb5

disable prompt extensions in eval

6701a84

fix eval on remote runtime

54773e0

xingyaoww approved these changes Jan 21, 2025

View reviewed changes

Merge branch 'main' into enyst/test_view_depth

23bbd9c

enyst marked this pull request as ready for review January 21, 2025 21:46

enyst changed the title ~~[NOT FOR MERGE] Adjust prompt to use view command~~ Adjust prompt to use view command Jan 21, 2025

enyst merged commit f0dbb02 into main Jan 21, 2025
13 checks passed

enyst deleted the enyst/test_view_depth branch January 21, 2025 22:50

xingyaoww mentioned this pull request Jan 22, 2025

Add file/folder info in view command All-Hands-AI/openhands-aci#54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust prompt to use view command #5506

Adjust prompt to use view command #5506

enyst commented Dec 10, 2024 •

edited by github-actions bot

Loading

enyst commented Dec 10, 2024

enyst commented Dec 10, 2024

ryanhoangt commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 13, 2024

openhands-agent commented Dec 13, 2024 •

edited by enyst

Loading

mamoodi commented Dec 13, 2024

enyst commented Jan 5, 2025

openhands-agent commented Jan 5, 2025

enyst commented Jan 21, 2025

xingyaoww commented Jan 21, 2025

enyst commented Jan 21, 2025

xingyaoww commented Jan 21, 2025

enyst commented Jan 21, 2025 •

edited

Loading

enyst commented Jan 21, 2025

xingyaoww commented Jan 21, 2025

enyst commented Jan 21, 2025

xingyaoww left a comment

mamoodi commented Jan 21, 2025

enyst commented Jan 21, 2025

xingyaoww commented Jan 22, 2025

enyst commented Jan 23, 2025

Adjust prompt to use view command #5506

Adjust prompt to use view command #5506

Conversation

enyst commented Dec 10, 2024 • edited by github-actions bot Loading

enyst commented Dec 10, 2024

enyst commented Dec 10, 2024

ryanhoangt commented Dec 11, 2024 • edited Loading

github-actions bot commented Dec 13, 2024

openhands-agent commented Dec 13, 2024 • edited by enyst Loading

mamoodi commented Dec 13, 2024

enyst commented Jan 5, 2025

openhands-agent commented Jan 5, 2025

enyst commented Jan 21, 2025

xingyaoww commented Jan 21, 2025

enyst commented Jan 21, 2025

xingyaoww commented Jan 21, 2025

enyst commented Jan 21, 2025 • edited Loading

enyst commented Jan 21, 2025

Summary

Summary

xingyaoww commented Jan 21, 2025

enyst commented Jan 21, 2025

xingyaoww left a comment

Choose a reason for hiding this comment

mamoodi commented Jan 21, 2025

enyst commented Jan 21, 2025

xingyaoww commented Jan 22, 2025

enyst commented Jan 23, 2025

enyst commented Dec 10, 2024 •

edited by github-actions bot

Loading

ryanhoangt commented Dec 11, 2024 •

edited

Loading

openhands-agent commented Dec 13, 2024 •

edited by enyst

Loading

enyst commented Jan 21, 2025 •

edited

Loading