Bachelor/Data Science/Linnaeus University

tags: Text Classifcation, POS tagging, Noun Phrase Chunking, NER tagging, Text Summarization

Directory tree

.
├── README.md
└── main
    ├── data
    │   ├── _subjects
    │   │   ├── biology
    │   │   ├── geography
    │   │   └── physics
    │   │ 
    │   ├── block_1.csv
    │   ├── block_2.csv
    │   └── block_3.csv
    │
    ├── utils
    │   ├── heatmap_sum.ipynb
    │   ├── heatmap_tc.ipynb
    │   ├── newsroom_dataset.ipynb
    │   ├── nlp.ipynb
    │   ├── nlp_tc.ipynb
    │   ├── np.ipynb
    │   ├── rouge.ipynb
    │   └── text.ipynb
    │
    ├── centroid.ipynb
    ├── gensim.ipynb
    ├── glove.ipynb
    ├── ner_pos.ipynb
    ├── nltk_ner_pos.ipynb
    ├── tc_svm.ipynb
    ├── textrank.ipynb
    └── textsum.ipynb

# This repo reflects mostly our final work in Jupyter Lab.

Missing datasets and other sources (due to max size in repo)

Cornells Newsroom (ROUGE test dataset)

You have to apply for it at Newsroom. Place the 'train.jsonl.gz' (size approx 2GB) in data directory.

Subject corpus

In directory data you will find our small balanced dataset in _subjects, but the big unbalanced one did not fit this repo, otherwise placed in data/subjects (in our Jupyter notebook).

GloVe

Create a directory in main called 'glove.6B' (or it comes with the zip with additional files) and download 'glove.6B.100d.txt' from: Stanford (glove.6b.zip, size 822MB).

NOTE: glove.ipynb does not work well with the present small Subject corpus

StanfordNERTagger

To use StanforNERTagger (ner_pos.ipynb) you need to download Stanford Named Entity Recognizer version 3.9.2 (instant download). For more information before download Stanford. Place the directory in main.

Main notebooks

centroid.ipynb/ Automatic Text Summarization

Implementation of Centroid-based Text Summarization through Compositionality of Word Embeddings from their Github repo.

gensim.ipynb / Automatic Text Summarization (not include in the report)

Gensim text summarization did not do well on our Subject corpus texts. They did though very well on one text from our small ROUGE test (Convell Newsroom), with an 0.64 F1-score.

glove.ipynb/ Text Classification (not included in the report)

A Text Classification test on our big Subject corpus with 1D convolutional layer with a pre trained word embedding, GloVe. The basic solution is from keras.io.

We changed the optimizer rmsprop to adam (imported, to be able to change parameters) and fiddled only with the learning rate, while keras recommends not to touch other defaults. Validation accuracy leveled out in epoch 7, just above 97% (an improvment of 1% from rmsprop). We tried Dropout and Flatten without positive gains and in the end it is not an improvment compared to our lighter SVM TD-IDF model.

ner_pos.ipynb/ NER and POS tagging

wrappers included: StanfordNERTagger, stanfordnlp.Pipeline, spaCy, allennlp.predictors

nltk_ner_pos.ipynb/ NER and POS tagging, Noun Phrase Chunking

tc_svm.ipynb/ Text Classification

textrank.ipynb/ Automatic Text Summarization

Implementation of Variations of the Similarity Function of TextRank for Automated Summarization from their Github repo.

textsum.ipynb/ Automatic Text Summarization

Our own extension of an nltk frequency algorithm, with an Automated Query Weighting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bachelor/Data Science/Linnaeus University

tags: Text Classifcation, POS tagging, Noun Phrase Chunking, NER tagging, Text Summarization

Directory tree

# This repo reflects mostly our final work in Jupyter Lab.

Missing datasets and other sources (due to max size in repo)

Cornells Newsroom (ROUGE test dataset)

Subject corpus

GloVe

StanfordNERTagger

Main notebooks

centroid.ipynb/ Automatic Text Summarization

gensim.ipynb / Automatic Text Summarization (not include in the report)

glove.ipynb/ Text Classification (not included in the report)

ner_pos.ipynb/ NER and POS tagging

wrappers included: StanfordNERTagger, stanfordnlp.Pipeline, spaCy, allennlp.predictors

nltk_ner_pos.ipynb/ NER and POS tagging, Noun Phrase Chunking

tc_svm.ipynb/ Text Classification

textrank.ipynb/ Automatic Text Summarization

textsum.ipynb/ Automatic Text Summarization

A small ROUGE test for all three text summarization models on three samples from Cornells Newsroom

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
main		main
.gitignore		.gitignore
README.md		README.md

Schefferbird/degree_project_ml

Folders and files

Latest commit

History

Repository files navigation

Bachelor/Data Science/Linnaeus University

tags: Text Classifcation, POS tagging, Noun Phrase Chunking, NER tagging, Text Summarization

Directory tree

# This repo reflects mostly our final work in Jupyter Lab.

Missing datasets and other sources (due to max size in repo)

Cornells Newsroom (ROUGE test dataset)

Subject corpus

GloVe

StanfordNERTagger

Main notebooks

centroid.ipynb/ Automatic Text Summarization

gensim.ipynb / Automatic Text Summarization (not include in the report)

glove.ipynb/ Text Classification (not included in the report)

ner_pos.ipynb/ NER and POS tagging

wrappers included: StanfordNERTagger, stanfordnlp.Pipeline, spaCy, allennlp.predictors

nltk_ner_pos.ipynb/ NER and POS tagging, Noun Phrase Chunking

tc_svm.ipynb/ Text Classification

textrank.ipynb/ Automatic Text Summarization

textsum.ipynb/ Automatic Text Summarization

A small ROUGE test for all three text summarization models on three samples from Cornells Newsroom

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages