How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models 🔍

🎉 News • 📖 Introduction • 📄 FutureQueryEval Dataset

🚀 Quick Start • 📊 Results • 🎈 Citation

🎉 News

[2025-08-22] 🎯 FutureQueryEval Dataset Released! - The first temporal IR benchmark with queries from April 2025+
[2025-08-22] 🔧 Comprehensive evaluation framework released - 22 reranking methods, 40 variants tested
[2025-08-22] 📊 Integrated with RankArena leaderboard. You can view and interact with RankArena through this link
[2025-08-20] 📝 Paper accepted at EMNLP Findings 2025

📖 Introduction

We present the most comprehensive empirical study of reranking methods to date, systematically evaluating 22 state-of-the-art approaches across 40 variants. Our key contribution is FutureQueryEval - the first temporal benchmark designed to test reranker generalization on truly novel queries unseen during LLM pretraining.

Performance comparison across pointwise, pairwise, and listwise reranking paradigms

Key Findings 🔍

Temporal Performance Gap: 5-15% performance drop on novel queries compared to standard benchmarks
Listwise Superiority: Best generalization to unseen content (8% avg. degradation vs 12-15% for others)
Efficiency Trade-offs: Comprehensive runtime analysis reveals optimal speed-accuracy combinations
Domain Vulnerabilities: All methods struggle with argumentative and informal content

📄 FutureQueryEval Dataset

Overview

FutureQueryEval is a novel IR benchmark comprising 148 queries with 2,938 query-document pairs across 7 topical categories, designed to evaluate reranker performance on temporal novelty.

🎯 Why FutureQueryEval?

Zero Contamination: All queries refer to events after April 2025
Human Annotated: 4 expert annotators with quality control
Diverse Domains: Technology, Sports, Politics, Science, Health, Business, Entertainment
Real Events: Based on actual news and developments, not synthetic data

📊 Dataset Statistics

Metric	Value
Total Queries	148
Total Documents	2,787
Query-Document Pairs	2,938
Avg. Relevant Docs per Query	6.54
Languages	English
License	MIT

🌍 Category Distribution

Technology: 25.0% (37 queries)
Sports: 20.9% (31 queries)
Science & Environment: 13.5% (20 queries)
Business & Finance: 12.8% (19 queries)
Health & Medicine: 10.8% (16 queries)
World News & Politics: 9.5% (14 queries)
Entertainment & Culture: 7.4% (11 queries)

📝 Example Queries

🌍 World News & Politics:
"What specific actions has Egypt taken to support injured Palestinians from Gaza, 
as highlighted during the visit of Presidents El-Sisi and Macron to Al-Arish General Hospital?"

⚽ Sports:
"Which teams qualified for the 2025 UEFA European Championship playoffs in June 2025?"

💻 Technology:
"What are the key features of Apple's new Vision Pro 2 announced at WWDC 2025?"

Data Collection Methodology

Source Selection: Major news outlets, official sites, sports organizations
Temporal Filtering: Events after April 2025 only
Query Creation: Manual generation by domain experts
Novelty Validation: Tested against GPT-4 knowledge cutoff
Quality Control: Multi-annotator review with senior oversight

🚀 Quick Start

The code and Dataset will be available soon...

📊 Evaluation Results

Top Performers on FutureQueryEval

Method Category	Best Model	NDCG@10	Runtime (s)
Listwise	Zephyr-7B	62.65	1,240
Pointwise	MonoT5-3B	60.75	486
Setwise	Flan-T5-XL	56.57	892
Pairwise	EchoRank-XL	54.97	2,158
Tournament	TourRank-GPT4o	62.02	3,420

Performance Insights

🏆 Best Overall: Zephyr-7B (62.65 NDCG@10)
⚡ Best Efficiency: FlashRank-MiniLM (55.43 NDCG@10, 195s)
🎯 Best Balance: MonoT5-3B (60.75 NDCG@10, 486s)

Runtime vs. Performance trade-offs across reranking methods

🔧 Supported Methods

We evaluate 22 reranking approaches across multiple paradigms:

Pointwise Methods

MonoT5, RankT5, InRanker, TWOLAR
FlashRank, Transformer Rankers
UPR, MonoBERT, ColBERT

Listwise Methods

RankGPT, ListT5, Zephyr, Vicuna
LiT5-Distill, InContext Rerankers

Pairwise Methods

PRP (Pairwise Ranking Prompting)
EchoRank

Advanced Methods

Setwise (Flan-T5 variants)
TourRank (Tournament-based)
RankLLaMA (Task-specific fine-tuned)

🔄 Dataset Updates

FutureQueryEval will be updated every 6 months with new queries about recent events to maintain temporal novelty. Subscribe to releases for notifications!

Upcoming Updates

Version 1.1 (December 2025): +100 queries from July-September 2025 events
Version 1.2 (June 2026): +100 queries from October 2025-March 2026 events

📋 Leaderboard

Submit your reranking method results to appear on our leaderboard! See SUBMISSION.md for guidelines.

Current standings available at: RanArena

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for:

Adding new reranking methods
Improving evaluation metrics
Dataset quality improvements
Bug fixes and optimizations

🎈 Citation

If you use FutureQueryEval or our evaluation framework, please cite:

Coming Soon

📞 Contact

Authors: Abdelrahman Abdallah, Bhawna Piryani
Institution: University of Innsbruck
Issues: Please use GitHub Issues for bug reports and feature requests

⭐ Star this repo if you find it helpful! ⭐

📧 Questions? Open an issue or contact the authors

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models 🔍

🎉 News

📖 Introduction

Key Findings 🔍

📄 FutureQueryEval Dataset

Overview

🎯 Why FutureQueryEval?

📊 Dataset Statistics

🌍 Category Distribution

📝 Example Queries

Data Collection Methodology

🚀 Quick Start

📊 Evaluation Results

Top Performers on FutureQueryEval

Performance Insights

🔧 Supported Methods

Pointwise Methods

Listwise Methods

Pairwise Methods

Advanced Methods

🔄 Dataset Updates

Upcoming Updates

📋 Leaderboard

🤝 Contributing

🎈 Citation

📞 Contact

About

Uh oh!

Releases

Packages

License

DataScienceUIBK/llm-reranking-generalization-study

Folders and files

Latest commit

History

Repository files navigation

How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models 🔍

🎉 News

📖 Introduction

Key Findings 🔍

📄 FutureQueryEval Dataset

Overview

🎯 Why FutureQueryEval?

📊 Dataset Statistics

🌍 Category Distribution

📝 Example Queries

Data Collection Methodology

🚀 Quick Start

📊 Evaluation Results

Top Performers on FutureQueryEval

Performance Insights

🔧 Supported Methods

Pointwise Methods

Listwise Methods

Pairwise Methods

Advanced Methods

🔄 Dataset Updates

Upcoming Updates

📋 Leaderboard

🤝 Contributing

🎈 Citation

📞 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages