Support Online-Mind2Web task evaluation. #39

Syclus123 · 2025-05-20T13:02:05Z

This release adds the following features:

Support screenshots of the evaluation process
Support Online_Mind2Web task evaluation
Support access to gpt-4.1, o3-mini, o4-mini and other models

Tips: To run in a Linux environment without a visual interface, use the following command to start
sudo yum install -y xorg-x11-server-Xvfb
xvfb-run python batch_eval.py

Copilot

Pull Request Overview

This PR adds support for Online-Mind2Web task evaluation while introducing screenshot capture during task execution and extended model support including gpt-4.1, o3-mini, and o4-mini.

Added utility functions for JSON file operations.
Integrated log parsing and processing for Online-Mind2Web tasks.
Updated configuration and evaluation logic to support new task modes and enhanced API parameters.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
utils/utils.py	Added JSON load/save utilities.
utils/parser.py	Introduced a parser for log files with adjusted key names.
utils/log_processor.py	Improved log processing with task mapping support.
logs.py	Updated log folder path and directory creation.
evaluate/evaluate_utils.py	Enhanced evaluation utility with screenshot support and token count.
eval.py	Extended evaluation flow with new task and screenshot parameters.
data/Online-Mind2Web/*	Added Git LFS filters and updated README for Online-Mind2Web.
configs/setting.toml	Updated default task mode, max time step, and screenshot settings.
configs/log_config.json	Adjusted log and task mapping file paths.
batch_eval.py	Created batch evaluation script to run multiple tasks sequentially.
agent/LLM/openai.py	Modified token parameters and added new API parameter support.
agent/LLM/llm_instance.py	Updated model detection logic for JSON mode instantiation.

Support Online-Mind2Web task evaluation.

6d8382c

Copilot AI review requested due to automatic review settings May 20, 2025 13:02

Copilot AI reviewed May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Online-Mind2Web task evaluation. #39

Support Online-Mind2Web task evaluation. #39

Uh oh!

Syclus123 commented May 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Support Online-Mind2Web task evaluation. #39

Are you sure you want to change the base?

Support Online-Mind2Web task evaluation. #39

Uh oh!

Conversation

Syclus123 commented May 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!