-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added VQA as a evaluator. #52
base: dev
Are you sure you want to change the base?
Conversation
… task. VQA evaluator must be used with the flag --take_screenshots true
Please add a section in readme on VQA (and more generally evaluations). I am thinking a new sub section in the test related readme part. Something like: Automatic Evaluation Types:We support different evaluation types which will enable users to easily build use and build their own custom evaluation datasets. The supported evaluation types are: Ensure that there is atleast one vqa example in test.json |
Another issue i found when running this version was since we do not ever remove screenshots from previous run, just repeating the same task again takes all screenshots in the snapshot folder. This adds up the cost + also sometimes lead to incorrect vqa evaluation (imagine if in the earlier run the task was successful and in the next run it wasnt). Perhaps an easy solution is to clear snapshots folder if a new run is taking place? |
Lastly there are bunch of places where the ruff linter complains about unnecessary newlines, unwanted trailing whitespaces, unsorted import statements etc. If you install the dev dependencies (as mentioned in the readme), it includes 'ruff' which should highlight the issues in VSCode. |
Otherwise looks good to me!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments added, otherwise looks good.
Tested to work fine as well. The only error was when the snapshot folder contained images from previous run.
Also add some default behavior (may fail the test and continue or return a useful error message and crash) when the snapshots folder is empty. (e.g. if screenshot could not be taken for some reason or screenshot flag was not enabled) |
… treated like a .
…tions here are and error is provided as the reasoning.
@deepak-akkil I addressed your comments:
Let me know if there are any other issues. |
Looks good to me. @teaxio if you did not catch anything else, lets merge this. |
VQA will evaluate whether a task is completed given screenshots taken throughout the execution. The evaluator requires that
--take_screenshot true
is set.