Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
3a1809b
feat: Add Tongyi DeepResearch integration with rLLM AgentWorkflowEngine
yayashuxue Sep 19, 2025
02844bc
Fix DeepResearch token counting and improve HLE evaluation
yayashuxue Sep 30, 2025
33b67ff
Port complete tool implementations from Tongyi DeepResearch
yayashuxue Sep 30, 2025
43a7749
feat(engine): Add adaptive parameter compatibility for OpenAI reasoni…
yayashuxue Oct 4, 2025
cb1de22
fix: Critical bug fixes for DeepResearch agent evaluation
yayashuxue Oct 4, 2025
15b36b9
feat(deepresearch): Add vision model support and alignment documentation
yayashuxue Oct 5, 2025
e81c82a
fix: Handle confidence as string in metrics calculation
yayashuxue Oct 5, 2025
12c272b
deepresearch: HF-only HLE eval; README adds HF auth/cache notes; remo…
yayashuxue Oct 6, 2025
14a51d1
deepresearch: update tools for native function-calling + robust fallb…
yayashuxue Oct 6, 2025
0074ba4
file clean
yayashuxue Oct 6, 2025
9f04d36
Merge remote-tracking branch 'upstream/v0.2' into feature/deepresearc…
yayashuxue Oct 6, 2025
0ec7b65
deepresearch: merge upstream v0.2 - resolve conflicts and align forma…
yayashuxue Oct 6, 2025
f0194f8
feat: DeepResearch integration with model-specific parameter support
yayashuxue Oct 11, 2025
cfaaa9c
merge: upstream v0.2 latest changes
yayashuxue Oct 11, 2025
e54bf08
fix: let DeepResearch handle all eval sampling params
yayashuxue Oct 11, 2025
dcb8eb6
fix: handle undefined text for models without reasoning
yayashuxue Oct 11, 2025
df2725d
feat: complete O3 support with hybrid mode and parameter handling
yayashuxue Oct 11, 2025
ed90f40
refactor: use binary yes/no judge aligned with Tongyi
yayashuxue Oct 11, 2025
11f356e
refactor: simplify OpenAI engine token parameter handling
yayashuxue Oct 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -202,3 +202,10 @@ CLAUDE.md
examples/strands_outputs/*
strands_outputs/*
examples/strands/strands_outputs/*

# Deepresearch outputs ignore
examples/deepresearch/deepresearch_outputs/*
deepresearch_outputs/*
examples/deepresearch/hle_outputs/*
*/hle_outputs/*
examples/deepresearch/HLE_OUTPUT_EVOLUTION.md
28 changes: 28 additions & 0 deletions examples/deepresearch/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# DeepResearch API Configuration
# Copy this file to .env and fill in your API keys

# OpenAI API (recommended for best performance)
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4

# Alternative: Together AI (cost-effective option)
# TOGETHER_AI_API_KEY=your_together_ai_key_here
# TOGETHER_AI_MODEL_NAME=Qwen/Qwen2.5-7B-Instruct-Turbo

# Alternative: Custom OpenAI-compatible endpoint (for vLLM hosting)
# OPENAI_API_KEY=your_custom_api_key
# OPENAI_BASE_URL=http://your-vllm-server:8000/v1
# MODEL_NAME=your-hosted-model-name

# Search API keys for research tools
# Serper API (required for web search functionality)
SERPER_KEY_ID=your_serper_api_key_from_serper.dev

# Alternative: Google Custom Search API (if you prefer Google over Serper)
# GOOGLE_SEARCH_SECRET_KEY=your_google_api_key
# GOOGLE_SEARCH_ENGINE_ID=your_custom_search_engine_id

# Evaluation settings
# DEEPRESEARCH_TASK=Custom research question to test
# GAIA_DATASET_PATH=path/to/gaia.json
260 changes: 260 additions & 0 deletions examples/deepresearch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
# DeepResearch Integration for rLLM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an official score running the model on HLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the tongyi model? i don't have the model spun up but if we do we can run the full hle and get the score). for the GPT o3 15 samples we got 26.7% on HLE


## Overview

This module integrates Tongyi's DeepResearch ReAct agent into the rLLM framework, enabling evaluation on academic benchmarks like HLE (Humanity's Last Exam). The integration demonstrates how to port external agent architectures into rLLM's workflow system while maintaining compatibility with the training and evaluation infrastructure.

## Architecture

```
DeepResearch Agent (ReAct with XML-based tool calling)
DeepResearchWorkflow (rLLM Workflow wrapper)
AgentWorkflowEngine (Parallel execution)
Episode/Trajectory (rLLM data format)
```

### Key Components

- **`deepresearch_agent.py`**: MultiTurnReactAgent implementing Tongyi's ReAct loop with tool calling
- **`deepresearch_workflow.py`**: Wrapper that converts agent outputs to rLLM Episodes for trajectory tracking
- **`deepresearch_tools.py`**: Tool implementations (Search, Scholar, Visit, FileParser, PythonInterpreter)
- **`evaluate_hle.py`**: Evaluation script for HLE (Humanity's Last Exam) benchmark

## Installation

### Prerequisites

```bash
# Activate rLLM environment
conda activate rllm

# Install required dependencies
pip install datasets # For HLE dataset access
pip install tiktoken # Optional: for better token counting with OpenAI models
```

### Environment Setup

Create a `.env` file with your API keys:

```bash
# For model inference (choose one)
OPENAI_API_KEY=your_openai_key
TOGETHER_AI_API_KEY=your_together_key

# Optional: For web search tool
SERPER_API_KEY=your_serper_key # Get free key from serper.dev
```

## Usage

### Running HLE Evaluation

```bash
# Evaluate on HLE dataset with default settings
python evaluate_hle.py --hf-dataset cais/hle --max-samples 10 --parallel-tasks 4

# Use specific model
python evaluate_hle.py --model gpt-4o --max-samples 5

# Use Together AI for evaluation
python evaluate_hle.py --model Qwen/Qwen2.5-7B-Instruct-Turbo \
--base-url https://api.together.xyz/v1 \
--max-samples 20

# Custom output directory
python evaluate_hle.py --output-dir ./my_results --max-samples 20
```

### Using DeepResearch Agent Directly

```python
from rllm.engine.rollout import OpenAIEngine
from deepresearch_agent import MultiTurnReactAgent
from deepresearch_tools import get_all_tools

# Setup rollout engine
engine = OpenAIEngine(
model="gpt-4o",
api_key="your_key",
base_url="https://api.openai.com/v1"
)

# Create agent with tools
agent = MultiTurnReactAgent(
rollout_engine=engine,
tools=get_all_tools()
)

# Run a research task
result = await agent.run(
question="What is the reduced 12th dimensional Spin bordism of BG2?",
answer="Z/2" # Optional ground truth for evaluation
)

print(f"Prediction: {result['prediction']}")
print(f"Rounds: {result['rounds']}")
print(f"Time taken: {result['time_taken']}s")
```

### Integrating with rLLM Workflows

```python
from rllm.engine.agent_workflow_engine import AgentWorkflowEngine
from deepresearch_workflow import DeepResearchWorkflow

# Create workflow engine for parallel execution
workflow_engine = AgentWorkflowEngine(
workflow_cls=DeepResearchWorkflow,
workflow_args={
"tools": get_all_tools(),
"max_prompt_length": 4096,
"max_response_length": 2048
},
rollout_engine=engine,
n_parallel_tasks=4 # Run 4 tasks in parallel
)

# Run evaluation on multiple tasks
tasks = [
{"question": "Question 1", "answer": "Answer 1"},
{"question": "Question 2", "answer": "Answer 2"}
]

episodes = await workflow_engine.execute_tasks(tasks)

# Episodes contain full trajectories for training
for episode in episodes:
print(f"Task: {episode.task}")
print(f"Prediction: {episode.metrics.get('prediction')}")
print(f"Is correct: {episode.is_correct}")
```

## Tools

The agent has access to the following research tools:

| Tool | Description | Implementation Status |
| --------------------- | --------------------------- | ------------------------------------ |
| **Search** | Web search via Serper API | ✅ Fully implemented (needs API key) |
| **PythonInterpreter** | Execute Python code safely | ✅ Fully implemented with security |
| **Scholar** | Academic paper search | ❌ Placeholder only |
| **Visit** | Visit and analyze web pages | ❌ Placeholder only |
| **FileParser** | Parse various file formats | ⚠️ Basic text only (no PDF/DOCX) |

### Tool Implementation Notes

- **Search**: Real web search with Serper API integration. Configure API key in `.env` file
- **PythonInterpreter**: Enhanced security, 50s timeout, supports numpy/pandas when available
- **Scholar**: Returns placeholder results. Needs integration with arXiv/Google Scholar APIs
- **Visit**: Returns placeholder content. Needs requests/BeautifulSoup implementation
- **FileParser**: Only reads text files up to 5000 chars. Original supports PDF/DOCX/media files

## Key Improvements from Original

### 1. Token Counting Fix

- **Problem**: Original used mismatched tokenizers (GPT-2 for GPT-4o) causing incorrect context limits
- **Solution**: Now uses OpenAI API's actual token statistics from response.prompt_tokens and response.completion_tokens
- **Impact**: No more false "context exceeded" errors at 13k tokens when limit is 128k

### 2. Context Management

- **Problem**: System would incorrectly truncate messages based on wrong token counts
- **Solution**: Track actual cumulative API token consumption for accurate context management
- **Impact**: Model can use full context window effectively

### 3. System Prompt Optimization

- **Problem**: Over-constrained prompt requiring specific tags caused unnatural responses
- **Solution**: Simplified prompt matching original Tongyi design, letting model reason naturally
- **Impact**: Better convergence, fewer infinite loops

### 4. Parallel Execution

- \*\*Leverages AgentWorkflowEngine for concurrent task processing
- \*\*Configurable parallelism (n_parallel_tasks parameter)
- \*\*Automatic retry on failures

## Evaluation Results

Evaluation results will be added after running benchmarks. The system is designed to evaluate on HLE and other academic benchmarks.

## Known Issues and Limitations

1. **Tool Placeholders**: Scholar and Visit tools need real implementations for research tasks
2. **Model-Specific Behavior**:
- Some models may not consistently use `<answer>` tags
- Tool calling format adherence varies by model
3. **Long Context Tasks**: Very complex research may still hit token limits
4. **Judge Accuracy**: LLM judge may not perfectly evaluate complex answers

## Future Improvements

- [ ] Implement real Scholar tool using arXiv/Semantic Scholar APIs
- [ ] Implement real Visit tool using requests/BeautifulSoup
- [ ] Add PDF/DOCX parsing to FileParser
- [ ] Create unified evaluation framework for multiple benchmarks
- [ ] Add more Tongyi agents (QwenCoder, etc.)
- [ ] Improve judge accuracy with better prompts

## Project Structure

```
examples/deepresearch/
├── deepresearch_agent.py # Core ReAct agent implementation
├── deepresearch_workflow.py # rLLM workflow wrapper
├── deepresearch_tools.py # Tool implementations
├── evaluate_hle.py # HLE evaluation script
├── react_agent_original.py # Original Tongyi reference
├── tool_*_original.py # Original tool references
├── hle_outputs/ # Evaluation results (git ignored)
└── README.md # This file
```

## Contributing

To add new tools or improve existing ones:

1. Implement tool in `deepresearch_tools.py` following the pattern:

```python
class YourTool(DeepResearchTool):
async def call(self, **kwargs) -> str:
# Your implementation
return result_string
```

2. Add to `DEEPRESEARCH_TOOLS` registry

3. Test with evaluation script

4. Submit PR with test results

## Related Work

This integration is part of the rLLM evaluation framework initiative. See also:

- `examples/strands/` - Strands agent integration
- `rllm/agents/` - Native rLLM agents
- `rllm/workflows/` - Workflow base classes

## Citation

If you use this integration, please cite:

```bibtex
@misc{deepresearch2024,
title={DeepResearch: Multi-turn Research Agent},
author={Alibaba NLP Team},
year={2024},
url={https://github.com/Alibaba-NLP/DeepResearch}
}
```

## License

This integration follows rLLM's license. The original DeepResearch implementation is from Alibaba's Tongyi team.
Loading