PopQA benchmark

PopQA dataset is a large-scale open-domain question answering (QA) dataset, consisting of 14k entity-centric QA pairs. Each question is created by converting a knowledge tuple retrieved from Wikidata using a template. Each question come with the original subject_entitiey, object_entityand relationship_type annotation, as well as Wikipedia monthly page views.

Performance

1. Leaderboard from SOTA

Paper	Year	Model	Model Details	NDCG@10	Recall@5	acc
_{INSTRUCTRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales}	2024	InstructRAG	R:Contriever ,G:Llama3-Ins-8B(FT)	-	68.7	66.2
		InstructRAG	R:Contriever ,G:Llama3-Ins-8B(ICL)	-	68.7	64.2
		NaiveRAG	R: Contriever, G: ChatGPT	-	68.7	50.8
		NaiveRAG	R: Contriever, G: Llama3-Ins8B	-	68.7	62.3
_{ACTIVERAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents} _{(only sample 500 q for eval)}	2024	ActiveRAG	R:DPR ,G:ChatGPT-4oMINI	-	-	70.8
			R:DPR ,G:Llama-3-Ins-70B	-	-	69.8
			R:DPR ,G:Llama-3-Ins-8B	-	-	65.8
		Baseline1	R: ❌, G: Llama3-8-Ins8B	-	-	24.2
		Baseline2	R: ❌, G: Llama3-8-Ins70B	-	-	34.2
_{SELF-RAG: Learning To Retrieve, Generate, and Critique Through SELF-Reflection}	2023	Self-RAG	R: Contriever ,G: Llama2-13B	-	-	55.8
		Self-RAG	R: Contriever ,G:Llama2-7B	-	-	54.9
		Baseline1	R: ❌, G: Llama2-7B	-	-	14.7
		Baseline2	R: ❌, G: Llama2-13B	-	-	14.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PopQA.md

PopQA.md

PopQA benchmark

Performance

1. Leaderboard from SOTA

2. LLM-based Methods (Reproducable)

Files

PopQA.md

Latest commit

History

PopQA.md

File metadata and controls

PopQA benchmark

Performance

1. Leaderboard from SOTA

2. LLM-based Methods (Reproducable)