Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bananalyzer eval lib #75

Merged
merged 31 commits into from
Sep 30, 2024
Merged

add bananalyzer eval lib #75

merged 31 commits into from
Sep 30, 2024

Conversation

navidkpr
Copy link
Collaborator

@navidkpr navidkpr commented Sep 24, 2024

why

Mainly because we needed a suite of reliable evals to improve Stagehand on.

We can now add any of the 319 test cases from bananalyzer with a single line. I will be adding the useful test cases in a separate PR.

what changed

  1. Added a server to serve mhtml files
  2. Moved schemas from Bananalyzer into Typescript
  3. Added an integration between bananalyzer and Stagehand
  4. Added 4 Bananalyzer test cases into Stagehand eval

test plan

This is tested by default every time we run the eval system.

For the future

Our current eval system uses active urls. This can cause a problem because the urls we have might change over time. For example, a single "." change will cause the eval to fail on an extract job. To get around that we can take snapshots of the websites we use in each eval using playwright and export that into a mhtml file.

This PR adds the functionality to test Stagehand on those snapshots. Making the site change issue null.

Also, other eval systems for web agents also use MHTML format (for example Mind2Web). So next time it will be much easier to integrate with an external eval source.

@navidkpr navidkpr changed the title Add a system for adding bananalyzer evals add bananalyzer eval lib Sep 30, 2024
For running the bananalyzer eval, you'll need to run the init script first to download the static assets:

`./evals/bananalyzer-ts/init.sh`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navidkpr can you pls add instructions how to run the bananalyzer evals?

Copy link
Collaborator

@filip-michalsky filip-michalsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now works for me as well. We are not passing all of the test cases when i run pnpm evals yet, but I assume that is expected

@navidkpr navidkpr merged commit 0b1b942 into main Sep 30, 2024
1 check passed
@filip-michalsky filip-michalsky deleted the npour/more-evals branch October 6, 2024 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants