Textual Data Classification (NLP)

Goal: develop models to classify textual data, input is text documents, and output is categorical variable

#Multi-class classification problem and text classification

Datasets

20 news group dataset. Use the default train subset (subset=‘train’, and remove=([‘headers’, ‘footers’, ‘quotes’]) in sklearn.datasets) to train the models and report the final performance on the test subset. Note: need to start with the text data and convert text to feature vectors. Please refer to https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html for a tutorial on the steps needed for this.
IMDB Reviews: http://ai.stanford.edu/~amaas/data/sentiment/ Here, you need to use only reviews in the train folder for training and report the performance from the test folder. You need to work with the text documents to build your own features and ignore the pre-formatted feature files.

Models

Apply and compare the performance of following models using sklearn:

Logistic regression: sklearn.linear model.LogisticRegression
Decision trees: sklearn.tree.DecisionTreeClassifier
Support vector machines: sklearn.svm.LinearSVC
Ada boost: sklearn.ensemble.AdaBoostClassifier
Random forest: sklearn.ensemble.RandomForestClassifier
Naive Bayes: sklearn.naive_bayes.MultinomialNB

Use some Python libraries to extract features and preprocess the data, and to tune the hyper-parameters

Validation

Develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and study the effect of different hyperparamters or design choices. In a single table, compare and report the performance of the above mentioned models (with their best hyperparameters), and mark the winner for each dataset and overall.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
figures		figures
output		output
recorded_console		recorded_console
LICENSE		LICENSE
NewsGroup_AdaBoost.py		NewsGroup_AdaBoost.py
NewsGroup_DecisionTree.py		NewsGroup_DecisionTree.py
NewsGroup_LogisticRegression.py		NewsGroup_LogisticRegression.py
NewsGroup_NB.py		NewsGroup_NB.py
NewsGroup_RandomForest.py		NewsGroup_RandomForest.py
NewsGroup_SVM.py		NewsGroup_SVM.py
NewsGroup_common.py		NewsGroup_common.py
Newsgroup_Dataset.py		Newsgroup_Dataset.py
README.md		README.md
imdb_ada_boost.py		imdb_ada_boost.py
imdb_common.py		imdb_common.py
imdb_decision_tree.py		imdb_decision_tree.py
imdb_logistic_regression.py		imdb_logistic_regression.py
imdb_nb.py		imdb_nb.py
imdb_random_forest.py		imdb_random_forest.py
imdb_svm.py		imdb_svm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Textual Data Classification (NLP)

Datasets

Models

Validation

About

Uh oh!

Releases

Packages

Languages

License

violetwei/Textual-Data-Classification-NLP

Folders and files

Latest commit

History

Repository files navigation

Textual Data Classification (NLP)

Datasets

Models

Validation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages