Added VQA as a evaluator. #52

ruhanaazam · 2024-06-20T14:17:52Z

VQA will evaluate whether a task is completed given screenshots taken throughout the execution. The evaluator requires that --take_screenshot true is set.

… task. VQA evaluator must be used with the flag --take_screenshots true

…ort.

test/evaluators.py

test/test_utils.py

test/evaluators.py

…mpress_png()

…doing a vqa evaluation.

deepak-akkil · 2024-08-14T16:32:30Z

Please add a section in readme on VQA (and more generally evaluations). I am thinking a new sub section in the test related readme part. Something like:

Automatic Evaluation Types:

We support different evaluation types which will enable users to easily build use and build their own custom evaluation datasets. The supported evaluation types are:
- string_match: compares the text response from Agent-E with a reference answer. There are multiple types of string_match that is supported such as: exact_match | must_include | some_matches | fuzzy_match
- url_match: compares the final URL to the reference url.
- program_html: Use a JS code to retrieve a value from the HTML page and compare it with a reference answer.
- manual: pauses the test execution for manual pass/fail annotation.
- 'vqa: Uses GPT4-vision to evaluate the task success based on a sequence of screenshots, taken at each step. If you use vqa, you need to have --take_screenshots true when running the test.
Examples of each can be found in "test/tasks/test.json"

Ensure that there is atleast one vqa example in test.json

deepak-akkil · 2024-08-14T16:35:17Z

Another issue i found when running this version was since we do not ever remove screenshots from previous run, just repeating the same task again takes all screenshots in the snapshot folder. This adds up the cost + also sometimes lead to incorrect vqa evaluation (imagine if in the earlier run the task was successful and in the next run it wasnt).

Perhaps an easy solution is to clear snapshots folder if a new run is taking place?

deepak-akkil · 2024-08-14T16:45:10Z

Lastly there are bunch of places where the ruff linter complains about unnecessary newlines, unwanted trailing whitespaces, unsorted import statements etc. If you install the dev dependencies (as mentioned in the readme), it includes 'ruff' which should highlight the issues in VSCode.
You can also auto-fix them using 'ruff format' in the parent folder.

deepak-akkil · 2024-08-14T16:47:43Z

Otherwise looks good to me!!

deepak-akkil

Few minor comments added, otherwise looks good.
Tested to work fine as well. The only error was when the snapshot folder contained images from previous run.

deepak-akkil · 2024-08-14T17:01:25Z

Also add some default behavior (may fail the test and continue or return a useful error message and crash) when the snapshots folder is empty. (e.g. if screenshot could not be taken for some reason or screenshot flag was not enabled)
Today it crashes with the following print statement
VQA score is None becauase None
KeyError: 'score'

… treated like a .

…tions here are and error is provided as the reasoning.

ruhanaazam · 2024-11-06T22:05:55Z

@deepak-akkil I addressed your comments:

Improved error handling when exceptions are thrown by VQA: Exceptions are more descriptive now. And VQA evaluation failures are treated as Skips so evaluation should continue if there is an issue with the VQA evaluation.
Fixed linter ruff complaints.
Added a warning to users when no screenshots are found for VQA evaluation.
Deleted all screenshots from prior runs before starting a new run (with the same test name).

Let me know if there are any other issues.

deepak-akkil · 2024-11-08T11:31:48Z

Looks good to me. @teaxio if you did not catch anything else, lets merge this.

ruhanaazam added 3 commits June 19, 2024 14:46

Added VQA evaluator. Uses sequence of screenshots to evaluate a given…

7cfde2a

… task. VQA evaluator must be used with the flag --take_screenshots true

Merge branch 'dev' of https://github.com/EmergenceAI/Agent-E into dev

e1e54a4

Fixed VQA evaluator bugs -- wrong path to screenshots and missing imp…

2ac8565

…ort.

teaxio requested changes Jul 2, 2024

View reviewed changes

test/evaluators.py Outdated Show resolved Hide resolved

test/test_utils.py Show resolved Hide resolved

test/evaluators.py Show resolved Hide resolved

ruhanaazam force-pushed the dev branch from d411f51 to 2ac8565 Compare July 10, 2024 22:41

ruhanaazam added 3 commits July 10, 2024 17:28

Added the vqa model's reasoning to the evaluator's output.

ddd999b

Added inline documentation to functions list_items_in_folder() and co…

bcbfe66

…mpress_png()

Used path from ae.config file to get screenshots.

1d710c9

ruhanaazam force-pushed the dev branch from 7e2896e to 1d710c9 Compare July 11, 2024 01:18

ruhanaazam added 2 commits July 10, 2024 18:38

Added the validation_agent package. This contains required files for …

bd53e53

…doing a vqa evaluation.

Merge branch 'dev' of https://github.com/EmergenceAI/Agent-E into dev

6a1045d

deepak-akkil self-requested a review August 14, 2024 16:48

deepak-akkil reviewed Aug 14, 2024

View reviewed changes

ruhanaazam added 8 commits November 6, 2024 10:45

Merge remote-tracking branch 'upstream/dev' into dev

e624030

Resolve ruff linter complains

2088836

More ruff linter formatting

de9dd05

Added warning for when screenshot folder is empty. This evaluation is…

fe5519c

… treated like a .

Added comment to evaluators.py

3136f57

Updated readme to include how to do automatic/custom evaluation

589290c

Removes old snapshots when tests are rerun with the same name.

93296ce

Better exception handling went vqa responses cannot be parsed. Evalua…

d6a13e9

…tions here are and error is provided as the reasoning.

ruhanaazam requested a review from teaxio November 6, 2024 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added VQA as a evaluator. #52

Added VQA as a evaluator. #52

ruhanaazam commented Jun 20, 2024

deepak-akkil commented Aug 14, 2024 •

edited

Loading

deepak-akkil commented Aug 14, 2024

deepak-akkil commented Aug 14, 2024

deepak-akkil commented Aug 14, 2024

deepak-akkil left a comment •

edited

Loading

deepak-akkil commented Aug 14, 2024 •

edited

Loading

ruhanaazam commented Nov 6, 2024

deepak-akkil commented Nov 8, 2024

Added VQA as a evaluator. #52

Are you sure you want to change the base?

Added VQA as a evaluator. #52

Conversation

ruhanaazam commented Jun 20, 2024

deepak-akkil commented Aug 14, 2024 • edited Loading

Automatic Evaluation Types:

deepak-akkil commented Aug 14, 2024

deepak-akkil commented Aug 14, 2024

deepak-akkil commented Aug 14, 2024

deepak-akkil left a comment • edited Loading

Choose a reason for hiding this comment

deepak-akkil commented Aug 14, 2024 • edited Loading

ruhanaazam commented Nov 6, 2024

deepak-akkil commented Nov 8, 2024

deepak-akkil commented Aug 14, 2024 •

edited

Loading

deepak-akkil left a comment •

edited

Loading

deepak-akkil commented Aug 14, 2024 •

edited

Loading