Skip to content

Code for the arXiv paper: "LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond"

License

Notifications You must be signed in to change notification settings

salesforce/factualNLG

Factual Consistency in Summarization

Can you tell which edits of summaries are consistent, and which are inconsistent?

SummEdits Benchmark

Here is the updated benchmark, with the latest LLMs (Gemini-pro added on 12/14/2023)

Model Name Podcast Bill Sum Sam Sum News Sales Call Sales Email Shake speare Sci TLDR QMSumm ECT Sum Overall
Llama2-7b 50 50 50 50.6 50.9 50 50 50 50.7 51.4 50.4
Dav001 53.3 50.2 51 54.4 55.5 52.5 50 51 50.1 50.9 51.9
DAE 54.4 55.1 58.7 60.9 50.4 53.6 53.6 54.7 52 58.3 55.2
Cohere-cmd-xl 51.1 52.7 51.3 52.6 60.2 59.4 50 60.5 54.5 60.5 55.3
Vicuna-13b 52.8 52.5 51.3 63.5 57.9 51.8 55.4 59.7 54 62.4 56.1
SummaCConv 58.1 55.2 53.1 61.9 59 53.7 59.3 59.7 53.5 57.9 57.1
Mistral-7b 50 55.5 56.7 59.8 63.4 59.7 53.5 59.6 55.9 63.7 57.8
Llama2-13b 51.3 54.6 57.2 59.3 63.1 58.1 58.6 63.4 56.5 61.4 58.4
Claudev13 60.4 51.9 64.5 63.4 61.3 57 58.1 57.8 56.9 68.1 59.9
Dav002 56.4 53.9 57.1 61.9 65.1 59.1 56.6 64.6 60.6 66.2 60.1
Bard 50 58.1 61.3 71.6 73.3 70.6 58.7 66 53.9 72.7 63.6
QAFactEval 63.7 54.2 66.2 74.4 68.4 63.6 61.6 67.5 62.4 72.6 65.5
PaLM-bison 66 62 69 68.4 74.4 68.1 61.6 78.1 70.4 72.4 69
Dav003 65.7 59.9 67.6 71 78.8 69.2 69.7 74.4 72.2 77.8 70.6
CGPT 68.4 63.6 69.1 74.4 79.4 65.5 68 75.6 69.2 78.6 71.2
Claudev2 68.7 61.7 75.4 75.5 81 67.4 74 78.1 74.8 79.2 73.6
Claudev21 72.6 66 75.7 77.2 82 68.5 73.2 78.6 72.7 77.1 74.4
Gemini-pro 73.7 60.2 75.7 77.6 86.9 74.2 71.9 77.6 74 83.1 75.5
GPT4 82.7 71.1 83.1 83.3 87.9 79.5 84 82.4 79.6 87 82.1
Human Perf. 90.8 87.5 89.4 90 91.8 87.4 96.9 89.3 90.7 95.4 90.9

SummEdits Benchmark Release (Section 6-7)

We release the data for the 10 domains in the SummEdits benchmark in the data/summedits folder.

The SummEdits_Benchmark.ipynb notebook provides information on how to access open, and visualize the dataset.

FactCC Explanation Analysis (Section 3.5)

As part of the paper, we annotated 3.6k explanations generated by models justifying their choice to identify a summary as inconsistent. The annotations are available in data/factcc/factcc_explanation_annotation.json. The notebook FactCC_Explanation_Annotation.ipynb shows how to load/view the annotations.

Prompts

We release all prompts that were used in experiments in the paper in the prompts/ folder. More specifically:

About

Code for the arXiv paper: "LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond"

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published