Reference Guide : https://nlpforhackers.io/start/
Reference Guide : https://radimrehurek.com/gensim/tutorial.html
- Gender-Classification on Names using a Decision Tree
- Resource on Normalizing Data : http://simpledatamining.blogspot.com/2015/05/how-to-deal-with-mixed-data-types-when.html
- numerical data --> normalize
- categorical data --> one-hot encoding
- ordinal data --> normalize without one-hot encoding
- Model to build a simple inverted indexing for input sentences using NLTK (tokenization + stopword-removal + stemming/lemmatization)
- Resource on Stemming vs Lemmatization : https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
- Text Classification of a News Corpus using different methods to vectorize the given input
- Resource explaining the different ways to vectorize text : https://monkeylearn.com/blog/beginners-guide-text-vectorization/
- Possible Representations :
- tf.idf
- word2vec (Not Implemented)
- skip-thought-vectors (Not Implemented)
- Create a custom word2vec library based on the OpinRank dataset containing reviews about cars and hotels using Gensim
- Reference Article : http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.XBjfIWloTDc
- Dataset: https://github.com/kavgan/nlp-text-mining-working-examples/tree/master/word2vec
- Create a custom word2vec library based on the Amazon Review for Sentiment Analysis (Kaggle) dataset using Gensim
- Dataset: https://www.kaggle.com/bittlingmayer/amazonreviews