- [2025-08-22] π― FutureQueryEval Dataset Released! - The first temporal IR benchmark with queries from April 2025+
- [2025-08-22] π§ Comprehensive evaluation framework released - 22 reranking methods, 40 variants tested
- [2025-08-22] π Integrated with RankArena leaderboard. You can view and interact with RankArena through this link
- [2025-08-20] π Paper accepted at EMNLP Findings 2025
We present the most comprehensive empirical study of reranking methods to date, systematically evaluating 22 state-of-the-art approaches across 40 variants. Our key contribution is FutureQueryEval - the first temporal benchmark designed to test reranker generalization on truly novel queries unseen during LLM pretraining.
- Temporal Performance Gap: 5-15% performance drop on novel queries compared to standard benchmarks
- Listwise Superiority: Best generalization to unseen content (8% avg. degradation vs 12-15% for others)
- Efficiency Trade-offs: Comprehensive runtime analysis reveals optimal speed-accuracy combinations
- Domain Vulnerabilities: All methods struggle with argumentative and informal content
FutureQueryEval is a novel IR benchmark comprising 148 queries with 2,938 query-document pairs across 7 topical categories, designed to evaluate reranker performance on temporal novelty.
- Zero Contamination: All queries refer to events after April 2025
- Human Annotated: 4 expert annotators with quality control
- Diverse Domains: Technology, Sports, Politics, Science, Health, Business, Entertainment
- Real Events: Based on actual news and developments, not synthetic data
Metric | Value |
---|---|
Total Queries | 148 |
Total Documents | 2,787 |
Query-Document Pairs | 2,938 |
Avg. Relevant Docs per Query | 6.54 |
Languages | English |
License | MIT |
- Technology: 25.0% (37 queries)
- Sports: 20.9% (31 queries)
- Science & Environment: 13.5% (20 queries)
- Business & Finance: 12.8% (19 queries)
- Health & Medicine: 10.8% (16 queries)
- World News & Politics: 9.5% (14 queries)
- Entertainment & Culture: 7.4% (11 queries)
π World News & Politics:
"What specific actions has Egypt taken to support injured Palestinians from Gaza,
as highlighted during the visit of Presidents El-Sisi and Macron to Al-Arish General Hospital?"
β½ Sports:
"Which teams qualified for the 2025 UEFA European Championship playoffs in June 2025?"
π» Technology:
"What are the key features of Apple's new Vision Pro 2 announced at WWDC 2025?"
- Source Selection: Major news outlets, official sites, sports organizations
- Temporal Filtering: Events after April 2025 only
- Query Creation: Manual generation by domain experts
- Novelty Validation: Tested against GPT-4 knowledge cutoff
- Quality Control: Multi-annotator review with senior oversight
The code and Dataset will be available soon...
Method Category | Best Model | NDCG@10 | Runtime (s) |
---|---|---|---|
Listwise | Zephyr-7B | 62.65 | 1,240 |
Pointwise | MonoT5-3B | 60.75 | 486 |
Setwise | Flan-T5-XL | 56.57 | 892 |
Pairwise | EchoRank-XL | 54.97 | 2,158 |
Tournament | TourRank-GPT4o | 62.02 | 3,420 |
- π Best Overall: Zephyr-7B (62.65 NDCG@10)
- β‘ Best Efficiency: FlashRank-MiniLM (55.43 NDCG@10, 195s)
- π― Best Balance: MonoT5-3B (60.75 NDCG@10, 486s)
We evaluate 22 reranking approaches across multiple paradigms:
- MonoT5, RankT5, InRanker, TWOLAR
- FlashRank, Transformer Rankers
- UPR, MonoBERT, ColBERT
- RankGPT, ListT5, Zephyr, Vicuna
- LiT5-Distill, InContext Rerankers
- PRP (Pairwise Ranking Prompting)
- EchoRank
- Setwise (Flan-T5 variants)
- TourRank (Tournament-based)
- RankLLaMA (Task-specific fine-tuned)
FutureQueryEval will be updated every 6 months with new queries about recent events to maintain temporal novelty. Subscribe to releases for notifications!
- Version 1.1 (December 2025): +100 queries from July-September 2025 events
- Version 1.2 (June 2026): +100 queries from October 2025-March 2026 events
Submit your reranking method results to appear on our leaderboard! See SUBMISSION.md for guidelines.
Current standings available at: RanArena
We welcome contributions! See CONTRIBUTING.md for:
- Adding new reranking methods
- Improving evaluation metrics
- Dataset quality improvements
- Bug fixes and optimizations
If you use FutureQueryEval or our evaluation framework, please cite:
Coming Soon
- Authors: Abdelrahman Abdallah, Bhawna Piryani
- Institution: University of Innsbruck
- Issues: Please use GitHub Issues for bug reports and feature requests
β Star this repo if you find it helpful! β
π§ Questions? Open an issue or contact the authors