Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

数据集相关问题 #42

Closed
liuchunming2 opened this issue Dec 30, 2024 · 3 comments
Closed

数据集相关问题 #42

liuchunming2 opened this issue Dec 30, 2024 · 3 comments

Comments

@liuchunming2
Copy link

liuchunming2 commented Dec 30, 2024

请问您提供的训练集和测试集是否有原始的PDF原件,能否提供一下?

@liuchunming2
Copy link
Author

请问您的测试集在哪呢,我在huggingface上面看见了openbmb/VisRAG-Ret-Test-ArxivQA,但是为什么是parquet的格式呢,您提供的
from datasets import load_dataset
import csv

def load_beir_qrels(qrels_file):
qrels = {}
with open(qrels_file) as f:
tsvreader = csv.DictReader(f, delimiter="\t")
for row in tsvreader:
qid = row["query-id"]
pid = row["corpus-id"]
rel = int(row["score"])
if qid in qrels:
qrels[qid][pid] = rel
else:
qrels[qid] = {pid: rel}
return qrels

corpus_ds = load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", name="corpus", split="train")
queries_ds = load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", name="queries", split="train")

qrels_path = "xxxx" # path to qrels file which can be found under qrels folder in the repo.
qrels = load_beir_qrels(qrels_path)
这个代码是下载数据集的吗? 请您详细解释一下,谢谢

@Yu-Shi
Copy link
Collaborator

Yu-Shi commented Dec 30, 2024

每个测试集分三个部分:corpus(文档),queries(查询),以及qrels,即查询和文档之间的相关关系。通过示例代码可以访问这三部分

@tcy6
Copy link
Collaborator

tcy6 commented Jan 11, 2025

请问您提供的训练集和测试集是否有原始的PDF原件,能否提供一下?

实在抱歉,我们在生成训练和测试数据的过程中并没有保存PDF原件~

@tcy6 tcy6 closed this as completed Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants