Dynamic Evaluation Set Generation for LLM Benchmarking [NAACL '25]
Yourbench is a powerful framework for dynamically generating evaluation sets from source documents. It addresses the limitations of static benchmarks and benchmark saturation by creating diverse, contextually-rich questions tailored to specific educational levels.
- 🔄 Dynamic Generation: Create evaluation sets on-the-fly from any source documents
- 📚 Semantic Chunking: Smart document splitting that maintains context and meaning
- 🤔 Multi-hop Questions: Generate questions that require synthesizing information across document sections
- 📊 Configurable Difficulty: Tailor questions to specific educational levels
- 🔍 Diverse Question Types: Support for 10 different question types
- 🤖 Model Flexibility: Works with OpenAI and Azure OpenAI models via LiteLLM
- 📦 Hugging Face Integration: Direct dataset publishing to Hugging Face Hub
- Python 3.10+
- LiteLLM for model inference
- Sentence Transformers for semantic chunking
- Hugging Face Datasets for dataset management
- OpenAI API Compatible API / Azure AI. (more model types coming soon!)
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
.\venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
- Set up your environment:
# For OpenAI / OpenAI compatible APIs
export MODEL_BASE_URL=your_openai_url
export MODEL_API_KEY=your_openai_key
# For Azure OpenAI
export AZURE_BASE_URL=your_azure_url
export AZURE_API_KEY=your_azure_key
-
Create a task configuration (
config.yaml
). Here is some more information!. You can also look at an example task configuration -
Run the example task (after setting your 🤗 username / organization in the config!):
python src/yourbench/run_task.py --task-name yourbench_y1
Detailed documentation is available in the docs
directory:
- Configuration Guide: Comprehensive guide to YAML configuration
- Question Generation: Details about the question generation process
- Chunking System: Information about the semantic chunking system
- Processes source documents
- Creates structured datasets
- Supports local files and Hugging Face datasets
- Generates document summaries
- Provides context for question generation
- Uses configured language model
- Splits documents intelligently
- Maintains semantic coherence
- Configurable chunk sizes and overlap
- Pairs related document chunks
- Enables complex reasoning questions
- Smart chunk selection
- Single-shot questions from individual chunks
- Multi-hop questions from chunk pairs
- 10 different question types
- Difficulty calibration
- Educational level targeting
- Hugging Face integration
- Local storage options
- Dataset versioning
- Analytical: Break down complex ideas
- Application-based: Apply concepts to scenarios
- Clarification: Deep dive into specifics
- Counterfactual: Explore alternatives
- Conceptual: Examine theories
- True-false: Verify understanding
- Factual: Test recall
- Open-ended: Encourage discussion
- False-premise: Correct misconceptions
- Edge-case: Test boundaries
Example configuration:
task_name: yourbench_y1
configurations:
push_to_huggingface: true
set_hf_repo_visibility: public
hf_organization: your-org
model:
model_name: gpt-4
model_type: openai
max_concurrent_requests: 512
selected_choices:
generate_dataset:
execute: true
files_directory: examples/data
dataset_name: my_dataset
See Configuration Guide for detailed options.
We use:
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Install development dependencies
- Make your changes
- Run tests and ensure code style compliance
- Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LiteLLM for model inference
- Sentence Transformers for semantic embeddings
- Hugging Face for dataset infrastructure