Skip to content

Schefferbird/degree_project_ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

Bachelor/Data Science/Linnaeus University

tags: Text Classifcation, POS tagging, Noun Phrase Chunking, NER tagging, Text Summarization

Directory tree

.
├── README.md
└── main
    ├── data
    │   ├── _subjects
    │   │   ├── biology
    │   │   ├── geography
    │   │   └── physics
    │   │ 
    │   ├── block_1.csv
    │   ├── block_2.csv
    │   └── block_3.csv
    │
    ├── utils
    │   ├── heatmap_sum.ipynb
    │   ├── heatmap_tc.ipynb
    │   ├── newsroom_dataset.ipynb
    │   ├── nlp.ipynb
    │   ├── nlp_tc.ipynb
    │   ├── np.ipynb
    │   ├── rouge.ipynb
    │   └── text.ipynb
    │
    ├── centroid.ipynb
    ├── gensim.ipynb
    ├── glove.ipynb
    ├── ner_pos.ipynb
    ├── nltk_ner_pos.ipynb
    ├── tc_svm.ipynb
    ├── textrank.ipynb
    └── textsum.ipynb
    
# This repo reflects mostly our final work in Jupyter Lab.

Missing datasets and other sources (due to max size in repo)

Cornells Newsroom (ROUGE test dataset)

You have to apply for it at Newsroom. Place the 'train.jsonl.gz' (size approx 2GB) in data directory.

Subject corpus

In directory data you will find our small balanced dataset in _subjects, but the big unbalanced one did not fit this repo, otherwise placed in data/subjects (in our Jupyter notebook).

GloVe

Create a directory in main called 'glove.6B' (or it comes with the zip with additional files) and download 'glove.6B.100d.txt' from: Stanford (glove.6b.zip, size 822MB).

NOTE: glove.ipynb does not work well with the present small Subject corpus

StanfordNERTagger

To use StanforNERTagger (ner_pos.ipynb) you need to download Stanford Named Entity Recognizer version 3.9.2 (instant download). For more information before download Stanford. Place the directory in main.

Main notebooks

centroid.ipynb/ Automatic Text Summarization

Implementation of Centroid-based Text Summarization through Compositionality of Word Embeddings from their Github repo.

gensim.ipynb / Automatic Text Summarization (not include in the report)

Gensim text summarization did not do well on our Subject corpus texts. They did though very well on one text from our small ROUGE test (Convell Newsroom), with an 0.64 F1-score.

glove.ipynb/ Text Classification (not included in the report)

A Text Classification test on our big Subject corpus with 1D convolutional layer with a pre trained word embedding, GloVe. The basic solution is from keras.io.

We changed the optimizer rmsprop to adam (imported, to be able to change parameters) and fiddled only with the learning rate, while keras recommends not to touch other defaults. Validation accuracy leveled out in epoch 7, just above 97% (an improvment of 1% from rmsprop). We tried Dropout and Flatten without positive gains and in the end it is not an improvment compared to our lighter SVM TD-IDF model.

Evaluation of 1D CNN with GloVe

ner_pos.ipynb/ NER and POS tagging

wrappers included: StanfordNERTagger, stanfordnlp.Pipeline, spaCy, allennlp.predictors

nltk_ner_pos.ipynb/ NER and POS tagging, Noun Phrase Chunking

tc_svm.ipynb/ Text Classification

textrank.ipynb/ Automatic Text Summarization

Implementation of Variations of the Similarity Function of TextRank for Automated Summarization from their Github repo.

textsum.ipynb/ Automatic Text Summarization

Our own extension of an nltk frequency algorithm, with an Automated Query Weighting.

ROUGE test for our three text summarizers

A small ROUGE test for all three text summarization models on three samples from Cornells Newsroom

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published