ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

ReportBench is a comprehensive benchmark for evaluating the factual quality and citation behavior of Deep Research agents. Leveraging expert-authored survey papers as ground truth, ReportBench reverse-engineers domain-specific prompts and provides automated tools to assess both cited and non-cited content.

ReportBench addresses this need by:

Leveraging expert surveys: Uses high-quality, peer-reviewed survey papers from arXiv as gold-standard references.
Reverse-engineering prompts: Generates task-specific prompts matching each survey’s scope, methods, and temporal constraints.
Automated validation: Employs a dual-path evaluation to verify citation consistency and factual correctness of non-cited statements.

Benchmark Construction

The dataset construction pipeline consists of four phases:

Survey Paper Identification
- Start from an arXiv metadata snapshot (post-2020).
- Filter titles/abstracts for “survey” or “review” and confirm publication status via metadata and LLMs classification.
- Retain 600 high-quality, peer-reviewed survey papers.
Fine-Grained Reference Extraction
- Download and parse LaTeX sources to extract all in-text citation commands.
- Build a gold-standard set of references mirroring the true citation pattern of each survey.
Prompt Generation
- Reverse-engineer three levels of prompts (sentence, paragraph, detail-rich) via LLMs.
- Enforce temporal constraints matching each paper’s publication cutoff.
- Add explicit instructions to avoid citing the original survey itself.
Application Domain Distribution
- Classify surveys into ten domains using LLMs.
- Perform downsampling while adjusting the distribution to achieve balance, and sample one of the three prompt types to form a 100-task benchmark.

Evaluation Framework

ReportBench’s evaluation workflow consists of two complementary validation procedures:

1. Content Quality

URL Extraction: Extract all URL citations from the report, including from the base model and Deep Research outputs.
Normalization and Retrieval: Normalize and deduplicate URLs, then retrieve the content of each web page.
Document Type Classification: Use an LLM to determine whether each URL corresponds to a scholarly article and extract its title if applicable.
Title Matching: Compare the extracted titles against ground-truth references from the expert-authored report and compute an overlap ratio.

2.1. Cited Statements

Statement Extraction: Identify all sentences in a generated report containing explicit citations.
Source Retrieval: Scrape the full text of each cited source.
Semantic Matching: Use an LLM to locate supporting passages and verify consistency.
Scoring: Compute a citation alignment score for each report.

2.2. Non-Cited Statements

Statement Extraction: Extract factual claims without citations, filtering out common-sense content.
Web-Connected Fact Checking: Query multiple web-connected LLMs (Gemini Pro and Flash) to independently verify each claim.
Voting Mechanism: Aggregate judgments via majority vote to compute factual accuracy.

Evaluation Results

We evaluated two Deep Research products alongside their corresponding base LLMs using the ReportBench benchmark. Table 1 summarizes precision, recall, average references per report, citation match rate, cited statement count, non-cited factual accuracy, and non-cited statement count.

Test Model	Precision	Recall	Avg Refs	Cit. Match Rate	Cit. Stmt Count	Non-Cit Acc	Non-Cit Stmt Count
OpenAI Deep Research	0.385	0.033	9.89	78.87%	88.2	95.83%	38.9
Gemini Deep Research	0.145	0.036	32.42	72.94%	96.2	92.21%	49.6
gemini-2.5-flash	0.237	0.012	5.47	44.88%	12.1	98.52%	11.5
gemini-2.5-pro	0.269	0.010	4.27	59.24%	6.58	96.08%	9.35
o3	0.299	0.031	12.26	31.43%	16.16	82.22%	11.51
claude4-sonnet	0.337	0.021	6.74	73.67%	14.93	92.64%	17.07

Table 1. Performance metrics of Deep Research products and their base models.

Product-Level

OpenAI Deep Research: Highest precision (0.385) and citation match rate (78.87%), indicating focused and accurate retrieval with fewer references.
Gemini Deep Research: Generates many more citations (32.42 vs. 9.89) but yields only marginal recall gain, suggesting over-generation without proportional coverage benefits.

Model-Level

OpenAI vs. o3: Comparable retrieval metrics, but Deep Research produces far more cited statements (88.2 vs. 16.16) and non-cited statements (38.9 vs. 11.51), and achieves much higher alignment (78.87% vs. 31.43%) and accuracy (95.83% vs. 82.22%).
Gemini Deep Research vs. gemini-2.5-pro: Trades off precision (0.145 vs. 0.269) for higher recall and citation volume, while maintaining strong alignment (72.94% vs. 59.24%) but slightly lower non-cited statement accuracy.
claude4-sonnet: Most balanced baseline—moderate precision (0.337), recall (0.021), citation consistency (73.67%), and non-cited statement accuracy (92.64%).

🛠️ Installation and Usage

Prerequisites

Environment Requirements:

Python 3.8+
Required Python packages (install via pip)

Install Dependencies:

pip install pandas pyyaml langchain-openai tenacity tqdm requests pathlib beautifulsoup4 firecrawl-py python-dotenv

API Keys Setup: Create a .env file in the project root with the following configurations:

# OpenAI Configuration
OPENAI_PROVIDER=openai  # or "azure" for Azure OpenAI
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL_NAME=gpt-4o-mini
TEMPERATURE=0.0
MAX_TOKENS=8192

# Azure OpenAI (if using Azure)
AZURE_OPENAI_ENDPOINT=your_azure_endpoint
AZURE_OPENAI_API_VERSION=2023-12-01-preview
AZURE_OPENAI_DEPLOYMENT_NAME=your_deployment_name
AZURE_OPENAI_API_KEY=your_azure_api_key

# Web Scraping (Required for citation verification)
FIRECRAWL_API_KEY=your_firecrawl_api_key

# Search API (Optional, for enhanced fact-checking)
SERPAPI_API_KEY=your_serpapi_key

Project Structure

ReportBench/
|
├── 📁 ReportBench Release v1.1
│   ├── ReportBench_v1.1.jsonl      # Main dataset file in JSON Lines format
│   └── ReportBench_v1.1_GT         # Ground truth reference data
|
├── 📁 Core Processing Scripts
│   ├── openai_processor.py          # Process OpenAI Deep Research outputs
│   ├── gemini_processor.py          # Process Gemini Deep Research outputs  
│   ├── statement_evaluator.py       # Extract and evaluate factual statements
│   ├── related_work_evaluator.py    # Evaluate citation accuracy and recall
│   └── metrics_calculator.py        # Calculate final performance metrics
│
├── 📁 Configuration & Utilities
│   ├── config.py                    # API keys and model configurations
│   ├── utils.py                     # Common utilities (LLM clients, CSV ops)
│   ├── cache_utils.py               # URL caching and normalization
│   └── .env                         # Environment variables (create this)
│
├── 📁 Evaluation Modules
│   ├── statement/                   # Statement extraction and verification
│   │   ├── extract_citations.py    # Extract cited statements
│   │   ├── extract_no_citations.py # Extract non-cited statements
│   │   ├── scrape_content.py       # Web scraping for citation sources
│   │   ├── match_text.py           # Semantic matching of statements
│   │   ├── verify_alignment.py     # Verify citation-statement alignment
│   │   └── verify_no_citations_web.py # Web-based fact-checking
│   └── process/                     # Data processing utilities
│       ├── extract_activity_structured.py
│       ├── extract_reference_structured.py
│       └── html2markdown.py
│
├── 📁 Scripts & Templates
│   ├── run_test.sh                  # Main evaluation pipeline
│   ├── process_prompt.py            # Prompt processing utilities
│   └── prompt_template/             # Evaluation prompt templates
│
└── 📄 Configuration Files
    ├── README.md                    # Project documentation
    ├── .gitignore                   # Git ignore rules
    └── .env.example                 # Environment variables template

Key Components:

Processing Pipeline: openai_processor.py → statement_evaluator.py + related_work_evaluator.py → metrics_calculator.py
Input Format: Model outputs as JSON files with response/content field containing the generated survey text
Output: Comprehensive evaluation metrics including citation alignment scores and factual accuracy rates

Quick Start

1. Prepare Your Model Data

Important Note: The data preparation process differs based on your model type:

For OpenAI Deep Research & Gemini Deep Research (Web-based)

These are web-based products that require special data collection:

Use Chrome Extension: Since these are web interfaces, you need to use a Chrome extension to capture the conversation records

Process with Dedicated Scripts: Use the corresponding processor to parse the captured data:

# For OpenAI Deep Research
python openai_processor.py --input=captured_data_dir --output=parsed_output_dir --markdown

# For Gemini Deep Research  
python gemini_processor.py --input=captured_data_dir --output=parsed_output_dir

For Other Models (API/Local Models)

Input Format: Your model outputs should be saved as JSON files with the following structure:

{
  "response": "Your model's generated survey text here...",
  "arxiv_id": "2024.12345",  // Optional: will be extracted from filename if not present
  "query": "Original query prompt",  // Optional
  // ... other metadata fields
}

Alternative accepted field names for the main content:

response, content, text, message, output, result

File Organization:

# Create your model's output directory
mkdir -p your-model-name

# Save each evaluation result as: {arxiv_id}.json
# Example filenames:
your-model-name/2003.00653.json
your-model-name/2004.05937.json
# ... (100 files total for the full benchmark)

2. Run Evaluation Pipeline

# 1. Process raw outputs (ONLY for web-based Deep Research products)
# For OpenAI Deep Research:
python openai_processor.py --input=data/test_data/raw_data/your-model-name --output=your-model-name-parsed --markdown
# For Gemini Deep Research:
python gemini_processor.py --input=data/test_data/raw_data/your-model-name --output=your-model-name-parsed --markdown
# For other models: Skip this step, use your JSON files directly

# 2. Extract and evaluate statements
python statement_evaluator.py your-model-name-parsed --output-dir your-model-name-stat-results

# 3. Evaluate citation accuracy  
python related_work_evaluator.py --survey-dir your-model-name-parsed --ground-truth-dir ReportBench_v1.1_GT --result-dir your-model-name-related-work-results

# 4. Calculate final metrics
python metrics_calculator.py your-model-name-stat-results

3. View Results

Results will be saved in structured directories:

Statement Evaluation: your-model-name-stat-results/ - Contains citation alignment and factual accuracy scores
Related Work Evaluation: your-model-name-related-work-results/ - Contains precision/recall for citation discovery
Final Metrics: Summary CSV files with aggregated performance metrics

Key Output Files:

Individual paper results in {arxiv_id}/ subdirectories

4. HuggingFace Dataset (Optional)

You can easily access ReportBench with the following code:

from datasets import load_dataset
dataset = load_dataset("ByteDance-BandAI/ReportBench")

Citation

@misc{li2025reportbenchevaluatingdeepresearch,
      title={ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks}, 
      author={Minghao Li and Ying Zeng and Zhihao Cheng and Cong Ma and Kai Jia},
      year={2025},
      eprint={2508.15804},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.15804}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

Benchmark Construction

Evaluation Framework

1. Content Quality

2.1. Cited Statements

2.2. Non-Cited Statements

Evaluation Results

Product-Level

Model-Level

🛠️ Installation and Usage

Prerequisites

Project Structure

Quick Start

1. Prepare Your Model Data

For OpenAI Deep Research & Gemini Deep Research (Web-based)

For Other Models (API/Local Models)

2. Run Evaluation Pipeline

3. View Results

4. HuggingFace Dataset (Optional)

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ReportBench_v1.1_GT		ReportBench_v1.1_GT
infer		infer
process		process
prompt_template		prompt_template
statement		statement
.gitignore		.gitignore
BandAI.png		BandAI.png
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ReportBench.pdf		ReportBench.pdf
ReportBench_v1.1.jsonl		ReportBench_v1.1.jsonl
cache_utils.py		cache_utils.py
config.py		config.py
gemini_processor.py		gemini_processor.py
metrics_calculator.py		metrics_calculator.py
openai_processor.py		openai_processor.py
related_work_evaluator.py		related_work_evaluator.py
requirements.txt		requirements.txt
run_test.sh		run_test.sh
statement_evaluator.py		statement_evaluator.py
utils.py		utils.py

License

ByteDance-BandAI/ReportBench

Folders and files

Latest commit

History

Repository files navigation

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

Benchmark Construction

Evaluation Framework

1. Content Quality

2.1. Cited Statements

2.2. Non-Cited Statements

Evaluation Results

Product-Level

Model-Level

🛠️ Installation and Usage

Prerequisites

Project Structure

Quick Start

1. Prepare Your Model Data

For OpenAI Deep Research & Gemini Deep Research (Web-based)

For Other Models (API/Local Models)

2. Run Evaluation Pipeline

3. View Results

4. HuggingFace Dataset (Optional)

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages