GitHub - kuta-ndze/TopicModeling_NLP: intuition to bag of words and Topic Modeler in NLP

This Repo Contains two personal projects outlined below and all files attached in the repo

`1. Sentiment Analysis`

An example intuition to bag of words model in NLP using Kirill Eremenko Restaurant reviews intuition dataset.
You could improve more on the model could be improved further to still be able to get the intuition behind it.
- sentimentalanalysisNLP.py

`2. Topic Modelling`

Objective

NB: Notebook might not lood check the python script 👉🏽 Kmeans Topic Modelling
In this project, we want to group customers reviews on twitter corpus based on recurring patterns. We should be able to get a sense of the specific topic in each cluster, what the customers are complaining about based on specific patterns. The twitter corpus contains a lot of noise and we will try to minimize this and create sense out of the data.

Data

The data used is Twitter data with lots of Noise on reviews. 21047 tweets with 4 attributes username, date , tweet and mention i.e a data about vodafone which is a telecom company tweets.csv.

Methodology

The ML technique used in this project is the kmeans clustering which is an unsupervised model to be able to extract some patterns.

Data Cleaning with Pattern Removal
- Removing mentions with @
- Replacing non-alphabets with empty space
- Convert Capital cases to lower cases for computer comprehension
- Collapse all spaces and remove words with lengths less than 2
Tokenizing data and Identify Special Instances of Tweets This separates the words and remove punctuations
- Create a list for each row of the clean text by making each word a standalone this also takes care of any full stops at end of text removes.
- Drop empty index in clean data
- Drop duplicates/empty tweets in data set and reset index

Vectorizer This is similar to tokenization only that it takes all the word vocabulary and convert all the vocabulary in the documents in to a matrix format bag of words. For instance

  [Hi my name is celdrick]
  [Hi my friend is Joyce]

  #vectorizing the entire vocabulary or words in a more structured format to a fix number of input length
  [Hi my name is celdrick friend Joyce]

  #Count vectorizer converts to matrix format: count vectorizer preferred to Tfidf because we have small data set.
  [1, 1, 1, 1, 1, 0, 0]
  [1, 1, 0, 1, 0, 1, 1]

Implementing count vectorizer with parameters like stop_words, analyzer, ngram_range, min_df, max_df and convert the matrix to array for modeling

Model Building and Evaluation
- Since this is a clustering problem , Kmeans has been used to suit the purpose.

Results

The optimal cluster on my model is 6 clusters/ 6 centroids, this can be improve with experience clustered_tweets.csv
Word cloud analysis has help to visualize prominent patterns in deciding the cluster number
Best cluster/Centroid ranges are between 2-8

Recommendations

Since customer reviews are subjective, Try more bigger data set with more reviews and we could keep monitoring the system performance and varying clusters as it comes with experience figuring the clusters and depending on domain problem. N/B If jupyter file does not render at this time. Check the .py file extension

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Kmeans clustering for Topic modeling on customer reviews.ipynb		Kmeans clustering for Topic modeling on customer reviews.ipynb
Kmeans clustering for Topic modeling on customer reviews.py		Kmeans clustering for Topic modeling on customer reviews.py
README.md		README.md
Restaurant_Reviews.tsv		Restaurant_Reviews.tsv
SentimentanalysisNLP.py		SentimentanalysisNLP.py
clustered_tweets.csv		clustered_tweets.csv
tweets.csv		tweets.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`1. Sentiment Analysis`

`2. Topic Modelling`

About

Releases

Packages

Languages

kuta-ndze/TopicModeling_NLP

Folders and files

Latest commit

History

Repository files navigation

1. Sentiment Analysis

2. Topic Modelling

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`1. Sentiment Analysis`

`2. Topic Modelling`

Packages