Text summarization takes a long text document and creates a shorter text document that is a fluent and accurate summary of the longer text document.
Input:
较早进入中国市场的星巴克, 是不少小资钟 情的品牌。相比在美国的平民形象,星巴克在中国就 显得“高端”得多。用料并无差别的一杯中杯美式咖 啡,在美国仅约合人民币12元,国内要卖21元,相当 于贵了75%。第一财经日报
Output:
媒体称星巴克美式咖啡售价中国比美国 贵75%。
ROUGE compares an automatically produced summary with human-produced, reference summaries. ROUGE-1 records unigram overlap, ROUGE-2 bigram overlap, and ROUGE-L the longest common subsequence. ROUGE can be computed over characters or words.
Implementations
- http://www.berouge.com/Pages/default.aspx
- https://github.com/lancopku/superAE/blob/master/data/script/PythonROUGE.py
- Hu et. al. (2015) created a dataset of Weibo summaries posted by media organizations. Short texts are about 100 characters, summaries are about 20 characters.
- Download instruction: “If you want to acquire the corpus. Please fill the application form and send to Qingcai Chen or Baotian Hu application form, Mainland China application form, Other” (http://icrc.hitsz.edu.cn/Article/show/139.html)
Test set | # (text, summary) pairs | # (text, summary) pairs >= 3 score | Genre |
---|---|---|---|
Part II (validation) | 10,666 | 8,685 | News, politics, economic, military, movies, games, etc. |
Part III (test) | 1,106 | 725 | News, politics, economic, military, movies, games, etc. |
System | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Duan et al. (2019) | 44.35 | 30.65 | 40.58 |
Wang et. al. (2018) | 39.9 | 21.5 | 37.9 |
Lin et. al. (2018) | 39.4 | 26.9 | 36.5 |
Ma et. al. (2018) | 39.2 | 26.0 | 36.2 |
Wei et. al. (2018) | 36.2 | 24.3 | 33.8 |
Seq2Seq (baseline) | 32.1 | 19.9 | 29.2 |
Train set | # (text, summary) pairs | Genre |
---|---|---|
Part I | 2,400,591 | News |
- List of Chinese text summarization papers
- MAT-INF dataset (ACL 2020), multi-task dataset that includes a 1M document summarization corpus, and shows improved summarization quality after training on other tasks.
- Recent work Abstractive Text Summarization by Incorporating Reader Comments (AAAI 2019), does not use the LCSTS dataset but their own version of SinaWeibo dataset
- Toolkit for abstractive summarization: https://github.com/lancopku/LancoSum
Suggestions? Changes? Please send email to [email protected]