-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate/Fix ClimateFever #1498
Comments
Below is the reply from Nandan:
and a follow-up from @jhyuklee:
Seems worth investigating and fixing to me! Probably adding a ClimateFEVERCleaned or .v2 task would be great! |
I was looking into the Climate-Fever dataset on MTEB and TFDS and found two main issues:
The MTEB corpus seems to be sourced from Wikipedia 2017 (like the FEVER dataset). I tried to rebuild the qrels to reflect the original labels but ran into two problems which I explain them in follwoing:
To make this clearer, let's look at the Climate-FEVER dataset, choose one data point, and check the queries, Qrels, and corpus for it. Example from the Original Dataset
also an example from the evidence column:
Queries and QrelsFor queries, it's quite straightforward, we have the claim_id as _id and also use the claim column. For example, the following is one of the entries from the queries. Example:
also For query _id 0, the related qrels in MTEB are:
I'm assuming that to create the corpus-is, it is obtained from the article key in the related evidence dictionary. https://huggingface.co/datasets/mteb/climate-fever/viewer/corpus/corpus?q=Global_warming This can lead to many results, making it unclear which corpus is related to this one. The second problem is that, for example, if you look at the evidence list for "Global_warming," you'll see two elements: one is labeled as SUPPORTS and the other as NOT_ENOUGH_INFO. So, with this corpus and this type of corpus-id, even if I can find the related text for "Global_warming," I cannot assign two different scores to the same corpus, unless we find a way to chunk it and update the labels accordingly.
Based on what I see, it's difficult for me to figure out how to link the Wikipedia corpus with Climate-FEVER. I was thinking that maybe we could use the "evidence" key in the evidence dictionary as the corpus, but I'm not sure. Could you help me with this and share your thoughts? |
From @jhyuklee:
I think it would be great investigating this and if it is an issue indeed then create an updated version of the Task to supersede it similar to Touchev3
mteb/mteb/tasks/Retrieval/eng/Touche2020Retrieval.py
Line 54 in 3ff38ec
The text was updated successfully, but these errors were encountered: