.
├── README.md
└── main
├── data
│ ├── _subjects
│ │ ├── biology
│ │ ├── geography
│ │ └── physics
│ │
│ ├── block_1.csv
│ ├── block_2.csv
│ └── block_3.csv
│
├── utils
│ ├── heatmap_sum.ipynb
│ ├── heatmap_tc.ipynb
│ ├── newsroom_dataset.ipynb
│ ├── nlp.ipynb
│ ├── nlp_tc.ipynb
│ ├── np.ipynb
│ ├── rouge.ipynb
│ └── text.ipynb
│
├── centroid.ipynb
├── gensim.ipynb
├── glove.ipynb
├── ner_pos.ipynb
├── nltk_ner_pos.ipynb
├── tc_svm.ipynb
├── textrank.ipynb
└── textsum.ipynb
You have to apply for it at Newsroom. Place the 'train.jsonl.gz' (size approx 2GB) in data directory.
In directory data you will find our small balanced dataset in _subjects, but the big unbalanced one did not fit this repo, otherwise placed in data/subjects (in our Jupyter notebook).
Create a directory in main called 'glove.6B' (or it comes with the zip with additional files) and download 'glove.6B.100d.txt' from: Stanford (glove.6b.zip, size 822MB).
NOTE: glove.ipynb does not work well with the present small Subject corpus
To use StanforNERTagger (ner_pos.ipynb) you need to download Stanford Named Entity Recognizer version 3.9.2 (instant download). For more information before download Stanford. Place the directory in main.
centroid.ipynb/ Automatic Text Summarization
Implementation of Centroid-based Text Summarization through Compositionality of Word Embeddings from their Github repo.
gensim.ipynb / Automatic Text Summarization (not include in the report)
Gensim text summarization did not do well on our Subject corpus texts. They did though very well on one text from our small ROUGE test (Convell Newsroom), with an 0.64 F1-score.
glove.ipynb/ Text Classification (not included in the report)
A Text Classification test on our big Subject corpus with 1D convolutional layer with a pre trained word embedding, GloVe. The basic solution is from keras.io.
We changed the optimizer rmsprop to adam (imported, to be able to change parameters) and fiddled only with the learning rate, while keras recommends not to touch other defaults. Validation accuracy leveled out in epoch 7, just above 97% (an improvment of 1% from rmsprop). We tried Dropout and Flatten without positive gains and in the end it is not an improvment compared to our lighter SVM TD-IDF model.
ner_pos.ipynb/ NER and POS tagging
nltk_ner_pos.ipynb/ NER and POS tagging, Noun Phrase Chunking
tc_svm.ipynb/ Text Classification
textrank.ipynb/ Automatic Text Summarization
Implementation of Variations of the Similarity Function of TextRank for Automated Summarization from their Github repo.
textsum.ipynb/ Automatic Text Summarization
Our own extension of an nltk frequency algorithm, with an Automated Query Weighting.