Skip to content

sumukshashidhar/yourbench

Repository files navigation

🤗 Yourbench

Dynamic Evaluation Set Generation for LLM Benchmarking [NAACL '25]

Python 3.10+ Code style: ruff License: MIT 🤗 Hugging Face

🌟 Overview

Yourbench is a powerful framework for dynamically generating evaluation sets from source documents. It addresses the limitations of static benchmarks and benchmark saturation by creating diverse, contextually-rich questions tailored to specific educational levels.

🔄 Process Flow

Process Flow

✨ Features

  • 🔄 Dynamic Generation: Create evaluation sets on-the-fly from any source documents
  • 📚 Semantic Chunking: Smart document splitting that maintains context and meaning
  • 🤔 Multi-hop Questions: Generate questions that require synthesizing information across document sections
  • 📊 Configurable Difficulty: Tailor questions to specific educational levels
  • 🔍 Diverse Question Types: Support for 10 different question types
  • 🤖 Model Flexibility: Works with OpenAI and Azure OpenAI models via LiteLLM
  • 📦 Hugging Face Integration: Direct dataset publishing to Hugging Face Hub

🛠️ Requirements

📦 Installation

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

🚀 Quick Start

  1. Set up your environment:
# For OpenAI / OpenAI compatible APIs
export MODEL_BASE_URL=your_openai_url
export MODEL_API_KEY=your_openai_key

# For Azure OpenAI
export AZURE_BASE_URL=your_azure_url
export AZURE_API_KEY=your_azure_key
  1. Create a task configuration (config.yaml). Here is some more information!. You can also look at an example task configuration

  2. Run the example task (after setting your 🤗 username / organization in the config!):

python src/yourbench/run_task.py --task-name yourbench_y1

📚 Documentation

Detailed documentation is available in the docs directory:

🏗️ Pipeline Components

1. Dataset Generation

  • Processes source documents
  • Creates structured datasets
  • Supports local files and Hugging Face datasets

2. Document Summarization

  • Generates document summaries
  • Provides context for question generation
  • Uses configured language model

3. Semantic Chunking

  • Splits documents intelligently
  • Maintains semantic coherence
  • Configurable chunk sizes and overlap

4. Multi-hop Chunk Creation

  • Pairs related document chunks
  • Enables complex reasoning questions
  • Smart chunk selection

5. Question Generation

  • Single-shot questions from individual chunks
  • Multi-hop questions from chunk pairs
  • 10 different question types
  • Difficulty calibration
  • Educational level targeting

6. Dataset Management

  • Hugging Face integration
  • Local storage options
  • Dataset versioning

🎯 Question Types

  1. Analytical: Break down complex ideas
  2. Application-based: Apply concepts to scenarios
  3. Clarification: Deep dive into specifics
  4. Counterfactual: Explore alternatives
  5. Conceptual: Examine theories
  6. True-false: Verify understanding
  7. Factual: Test recall
  8. Open-ended: Encourage discussion
  9. False-premise: Correct misconceptions
  10. Edge-case: Test boundaries

⚙️ Configuration

Example configuration:

task_name: yourbench_y1
configurations:
  push_to_huggingface: true
  set_hf_repo_visibility: public
  hf_organization: your-org
  model:
    model_name: gpt-4
    model_type: openai
    max_concurrent_requests: 512

selected_choices:
  generate_dataset:
    execute: true
    files_directory: examples/data
    dataset_name: my_dataset

See Configuration Guide for detailed options.

🧰 Development

We use:

  • Ruff for code formatting and linting
  • pytest for testing

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Install development dependencies
  4. Make your changes
  5. Run tests and ensure code style compliance
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

About

Benchmark Large Language Models Reliably On Your Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published