-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add bananalyzer eval lib #75
Conversation
19a90ee
to
a6930a0
Compare
For running the bananalyzer eval, you'll need to run the init script first to download the static assets: | ||
|
||
`./evals/bananalyzer-ts/init.sh` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@navidkpr can you pls add instructions how to run the bananalyzer evals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now works for me as well. We are not passing all of the test cases when i run pnpm evals
yet, but I assume that is expected
why
Mainly because we needed a suite of reliable evals to improve Stagehand on.
We can now add any of the 319 test cases from bananalyzer with a single line. I will be adding the useful test cases in a separate PR.
what changed
test plan
This is tested by default every time we run the eval system.
For the future
Our current eval system uses active urls. This can cause a problem because the urls we have might change over time. For example, a single "." change will cause the eval to fail on an extract job. To get around that we can take snapshots of the websites we use in each eval using playwright and export that into a mhtml file.
This PR adds the functionality to test Stagehand on those snapshots. Making the site change issue null.
Also, other eval systems for web agents also use MHTML format (for example Mind2Web). So next time it will be much easier to integrate with an external eval source.