Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

This repository contains code, data, and templates for crowdsourcing protocols, described by the paper: Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries.

Scripts

calculate.ipynb: to calculate the score distribution, krippendorff reliability, and SHR reliability.

Data

We released our evaluation templates and annotations to promote future work on factual consistency evaluation. The annotations can be found in for CNN&DM data, for XSUM data and templates

Model

The code for BART, ProphetNet, PEGASUS, and BERTSUM is based on Fairseq(-py). Our pretrained models can be found in for CNN&DM data and for XSUM data

Citation

If you use our code in your research, please cite our work:

@inproceedings{tang2022investigating,
   title={Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries},
   author={Tang, Xiangru and Fabbri, Alexander R and Mao, Ziming and Adams, Griffin and Wang, Borui and Li, Haoran and Mehdad, Yashar and Radev, Dragomir},
   booktitle={North American Association for Computational Linguistics (NAACL)},
   year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Scripts

Data

Model

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Scripts

Data

Model

Citation