GitHub - longvtran/nyt-corpus: Code to obtain The New York Times Annotated Corpus (non-anonymized) for summarization.

Code to obtain The New York Times Annotated Corpus (LDC2008T19) for summarization. Note: To obtain the corpus, refer to https://catalog.ldc.upenn.edu.

Citation

This code relies on original scripts written by:

Junyi Jessy Li, Kapil Thadani and Amanda Stent. The Role of Discourse Units in Near-Extractive Summarization. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 2016.

(Li et al's script is available on Github at: https://github.com/grimpil/nyt-summ)

and partly borrows from scripts on https://github.com/abisee/cnn-dailymail.

The TextRank implementation comes from https://github.com/davidadamojr/TextRank.

Installation

This script requires NLTK installation:

$ pip3 install nltk

Overview

The overall flow of the script is as follows:

Read the compress NYT Corpus on disk, conduct preprocessing (notably TextRank), and write into .story files
Create chunked of data files in binary format, and split the corpus into train, val, and test set
Additionally, a vocab file will be created

Usage

To get started, run:

main.py --help

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
lexical		lexical
resources		resources
README.md		README.md
corpus.py		corpus.py
main.py		main.py
make_bins.py		make_bins.py
make_stories.py		make_stories.py
test_list.txt		test_list.txt
textrank.py		textrank.py
train_list.txt		train_list.txt
val_list.txt		val_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citation

Installation

Overview

Usage

About

Releases

Packages

Languages

longvtran/nyt-corpus

Folders and files

Latest commit

History

Repository files navigation

Citation

Installation

Overview

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages