Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added VQA as a evaluator. #52

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from
Open

Conversation

ruhanaazam
Copy link

VQA will evaluate whether a task is completed given screenshots taken throughout the execution. The evaluator requires that --take_screenshot true is set.

@deepak-akkil
Copy link
Collaborator

deepak-akkil commented Aug 14, 2024

Please add a section in readme on VQA (and more generally evaluations). I am thinking a new sub section in the test related readme part. Something like:

Automatic Evaluation Types:

We support different evaluation types which will enable users to easily build use and build their own custom evaluation datasets. The supported evaluation types are:
- string_match: compares the text response from Agent-E with a reference answer. There are multiple types of string_match that is supported such as: exact_match | must_include | some_matches | fuzzy_match
- url_match: compares the final URL to the reference url.
- program_html: Use a JS code to retrieve a value from the HTML page and compare it with a reference answer.
- manual: pauses the test execution for manual pass/fail annotation.
- 'vqa: Uses GPT4-vision to evaluate the task success based on a sequence of screenshots, taken at each step. If you use vqa, you need to have --take_screenshots true when running the test.
Examples of each can be found in "test/tasks/test.json"

Ensure that there is atleast one vqa example in test.json

@deepak-akkil
Copy link
Collaborator

Another issue i found when running this version was since we do not ever remove screenshots from previous run, just repeating the same task again takes all screenshots in the snapshot folder. This adds up the cost + also sometimes lead to incorrect vqa evaluation (imagine if in the earlier run the task was successful and in the next run it wasnt).

Perhaps an easy solution is to clear snapshots folder if a new run is taking place?

@deepak-akkil
Copy link
Collaborator

Lastly there are bunch of places where the ruff linter complains about unnecessary newlines, unwanted trailing whitespaces, unsorted import statements etc. If you install the dev dependencies (as mentioned in the readme), it includes 'ruff' which should highlight the issues in VSCode.
You can also auto-fix them using 'ruff format' in the parent folder.

@deepak-akkil
Copy link
Collaborator

Otherwise looks good to me!!

@deepak-akkil deepak-akkil self-requested a review August 14, 2024 16:48
Copy link
Collaborator

@deepak-akkil deepak-akkil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments added, otherwise looks good.
Tested to work fine as well. The only error was when the snapshot folder contained images from previous run.

@deepak-akkil
Copy link
Collaborator

deepak-akkil commented Aug 14, 2024

Also add some default behavior (may fail the test and continue or return a useful error message and crash) when the snapshots folder is empty. (e.g. if screenshot could not be taken for some reason or screenshot flag was not enabled)
Today it crashes with the following print statement
VQA score is None becauase None
KeyError: 'score'

@ruhanaazam
Copy link
Author

@deepak-akkil I addressed your comments:

  1. Improved error handling when exceptions are thrown by VQA: Exceptions are more descriptive now. And VQA evaluation failures are treated as Skips so evaluation should continue if there is an issue with the VQA evaluation.
  2. Fixed linter ruff complaints.
  3. Added a warning to users when no screenshots are found for VQA evaluation.
  4. Deleted all screenshots from prior runs before starting a new run (with the same test name).

Let me know if there are any other issues.

@ruhanaazam ruhanaazam requested a review from teaxio November 6, 2024 22:06
@deepak-akkil
Copy link
Collaborator

Looks good to me. @teaxio if you did not catch anything else, lets merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants