Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/Fix ClimateFever #1498

Open
Muennighoff opened this issue Nov 25, 2024 · 2 comments
Open

Investigate/Fix ClimateFever #1498

Muennighoff opened this issue Nov 25, 2024 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@Muennighoff
Copy link
Contributor

From @jhyuklee:

Jay found out that the TFDS version of Climate fever (https://www.tensorflow.org/datasets/community_catalog/huggingface/climate_fever) is not matching with the one uploaded for MTEB (https://huggingface.co/datasets/mteb/climate-fever/tree/main).

Specifically, the TFDS version indexes specific portions of the wiki articles (and in some cases two different parts of the article are linked by the same query id) and that MTEB/BEIR just takes the wiki article as a whole, but more importantly that the corpus text for the articles does not necessarily contain the text from the original target sentences/passages/subsections (but instead is just first x chars/tokens/or something).
Also it is worth noting that all of the qrels are scored as 1 in the MTEB version regardless of original rater annotations.

Since MTEB derived its preprocessing from BEIR, we are guessing that the discrepancy has started from BEIR.

I think it would be great investigating this and if it is an issue indeed then create an updated version of the Task to supersede it similar to Touchev3

class Touche2020v3Retrieval(AbsTaskRetrieval):

@isaac-chung isaac-chung added the help wanted Extra attention is needed label Dec 24, 2024
@Muennighoff
Copy link
Contributor Author

Below is the reply from Nandan:

Corpus for Climate-FEVER and differences from the TFDS dataset

The Climate-FEVER corpus was constructed using the same Wikipedia 2017 checkpoint as FEVER (v1.0, If I remember correctly) which contains the introductory section (equivalent to an abstract) for each Wikipedia title for constructing the BEIR task; as referenced in the BEIR paper (Appendix, Section D.9 on Fact Checking) and we manually added 25 extra documents to avoid missing out on claims. The retrieval task was chosen to be similar setting as FEVER; the BEIR version focused on relevance (retrieve Wikipedia passages) with the introductory text for the Wikipedia article (similar setting in FEVER).

Also it is worth noting that all of the qrels are scored as 1 in the MTEB version regardless of original rater annotations."
This reason is because in our initial decision choice, we considered Wikipedia articles with titles labelled as FULLY_SUPPORTED to the input claim as relevant for the claim. We did not want to consider documents with Not_Enough_Info as partially relevant or not relevant as it would create ambiguity for the retrieval task.

Specifically, the TFDS version indexes specific portions of the wiki articles.
The TFDS dataset contains the metadata for fact verification which was used for Wikipedia fact verification; for a newer version for Climate-FEVER dataset. We can look into constructing the task to be more realistic by considering the whole Wikipedia article as the corpus with chunking & consider overlap with text which annotators found.

I can't remember what their original corpus for Climate-FEVER was (http://climatefever.ai/ doesn't load anymore for me), but we would need to use that old version of Wikipedia corpus.

and a follow-up from @jhyuklee:

Thanks Nandan for the detailed information!

I think if the Climate Fever part of BEIR (hence MTEB) contains the original labeled snippet of the article, the data quality will become much better.
What we've found is that many articles that are positively labeled do not have enough information from the introductory section (not sure about 25 added ones and how they are labeled though).

Seems worth investigating and fixing to me! Probably adding a ClimateFEVERCleaned or .v2 task would be great!

@mina-parham
Copy link

I was looking into the Climate-Fever dataset on MTEB and TFDS and found two main issues:

  1. Corpus Difference: MTEB has a corpus that's missing in the original Climate-Fever dataset.

  2. Qrels Issue: All qrels in MTEB have a score of 1, while the original dataset uses SUPPORTS, REFUTES, and NOT_ENOUGH_INFO labels.

The MTEB corpus seems to be sourced from Wikipedia 2017 (like the FEVER dataset). I tried to rebuild the qrels to reflect the original labels but ran into two problems which I explain them in follwoing:

  • Corpus-id Ambiguity: The corpus-id in the corpus and in the qrels does not match, so if you search for a corpus-id like Global_warming, you'll find many results, making it unclear which one is related to which.

To make this clearer, let's look at the Climate-FEVER dataset, choose one data point, and check the queries, Qrels, and corpus for it.

Example from the Original Dataset
claim_id claim claim_label evidences
0 Global warming is driving polar bears toward extinction SUPPORTS [{'evidence_id': 'Extinction risk from global warming:170', ...}]
1 The sun has gone into 'lockdown' SUPPORTS [{'evidence_id': 'Famine:386', ...}]
2 The polar bear population has been growing. REFUTES [{'evidence_id': 'Polar bear:1332', ...}]

also an example from the evidence column:

[{'evidence_id': 'Extinction risk from global warming:170',
  'evidence_label': 'NOT_ENOUGH_INFO',
  'article': 'Extinction risk from global warming',
  'evidence': '"Recent Research Shows Human Activity Driving Earth Towards Global Extinction Event".',
  'entropy': 0.6931471805599451,
  'votes': ['SUPPORTS', 'NOT_ENOUGH_INFO', None, None, None]},
 {'evidence_id': 'Global warming:14',
  'evidence_label': 'SUPPORTS',
  'article': 'Global warming',
  'evidence': 'Environmental impacts include the extinction or relocation of many species as their ecosystems change, most immediately the environments of coral reefs, mountains, and the Arctic.',
  'entropy': 0.0,
  'votes': ['SUPPORTS', 'SUPPORTS', None, None, None]},
 {'evidence_id': 'Global warming:178',
  'evidence_label': 'NOT_ENOUGH_INFO',
  'article': 'Global warming',
  'evidence': 'Rising temperatures push bees to their physiological limits, and could cause the extinction of bee populations.',
  'entropy': 0.6931471805599451,
  'votes': ['SUPPORTS', 'NOT_ENOUGH_INFO', None, None, None]},
 {'evidence_id': 'Habitat destruction:61',
  'evidence_label': 'SUPPORTS',
  'article': 'Habitat destruction',
  'evidence': 'Rising global temperatures, caused by the greenhouse effect, contribute to habitat destruction, endangering various species, such as the polar bear.',
  'entropy': 0.0,
  'votes': ['SUPPORTS', 'SUPPORTS', None, None, None]},
 {'evidence_id': 'Polar bear:1328',
  'evidence_label': 'NOT_ENOUGH_INFO',
  'article': 'Polar bear',
  'evidence': '"Bear hunting caught in global warming debate".',
  'entropy': 0.6931471805599451,
  'votes': ['SUPPORTS', 'NOT_ENOUGH_INFO', None, None, None]}]
Queries and Qrels

For queries, it's quite straightforward, we have the claim_id as _id and also use the claim column. For example, the following is one of the entries from the queries.

Example:

{"_id": "0", "text": "Global warming is driving polar bears toward extinction"}

also For query _id 0, the related qrels in MTEB are:

{"query-id": "0", "corpus-id": "Habitat_destruction", "score": 1}
{"query-id": "0", "corpus-id": "polar_bear", "score": 1}
{"query-id": "0", "corpus-id": "Extinction_risk_from_global_warming", "score": 1}
{"query-id": "0", "corpus-id": "Global_warming", "score": 1}

I'm assuming that to create the corpus-is, it is obtained from the article key in the related evidence dictionary.
The first problem is that if we use a corpus-id, such as "Global_warming," and look up the related corpus, this is what we get:

https://huggingface.co/datasets/mteb/climate-fever/viewer/corpus/corpus?q=Global_warming

This can lead to many results, making it unclear which corpus is related to this one.

The second problem is that, for example, if you look at the evidence list for "Global_warming," you'll see two elements: one is labeled as SUPPORTS and the other as NOT_ENOUGH_INFO. So, with this corpus and this type of corpus-id, even if I can find the related text for "Global_warming," I cannot assign two different scores to the same corpus, unless we find a way to chunk it and update the labels accordingly.

 {'evidence_id': 'Global warming:14',
  'evidence_label': 'SUPPORTS',
  'article': 'Global warming',
  'evidence': 'Environmental impacts include the extinction or relocation of many species as their ecosystems change, most immediately the environments of coral reefs, mountains, and the Arctic.',
  'entropy': 0.0,
  'votes': ['SUPPORTS', 'SUPPORTS', None, None, None]},
 {'evidence_id': 'Global warming:178',
  'evidence_label': 'NOT_ENOUGH_INFO',
  'article': 'Global warming',
  'evidence': 'Rising temperatures push bees to their physiological limits, and could cause the extinction of bee populations.',
  'entropy': 0.6931471805599451,
  'votes': ['SUPPORTS', 'NOT_ENOUGH_INFO', None, None, None]},

Based on what I see, it's difficult for me to figure out how to link the Wikipedia corpus with Climate-FEVER. I was thinking that maybe we could use the "evidence" key in the evidence dictionary as the corpus, but I'm not sure. Could you help me with this and share your thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants