Skip to content

kuta-ndze/TopicModeling_NLP

Repository files navigation

Eample Portfolio URL github URL Mailto Linkedin URL Twitter URL


This Repo Contains two personal projects outlined below and all files attached in the repo

1. Sentiment Analysis

  • An example intuition to bag of words model in NLP using Kirill Eremenko Restaurant reviews intuition dataset.
  • You could improve more on the model could be improved further to still be able to get the intuition behind it.

2. Topic Modelling

Objective

  • NB: Notebook might not lood check the python script 👉🏽 Kmeans Topic Modelling
  • In this project, we want to group customers reviews on twitter corpus based on recurring patterns. We should be able to get a sense of the specific topic in each cluster, what the customers are complaining about based on specific patterns. The twitter corpus contains a lot of noise and we will try to minimize this and create sense out of the data.

Data

  • The data used is Twitter data with lots of Noise on reviews. 21047 tweets with 4 attributes username, date , tweet and mention i.e a data about vodafone which is a telecom company tweets.csv.

Methodology

  • The ML technique used in this project is the kmeans clustering which is an unsupervised model to be able to extract some patterns.
  1. Data Cleaning with Pattern Removal

    • Removing mentions with @
    • Replacing non-alphabets with empty space
    • Convert Capital cases to lower cases for computer comprehension
    • Collapse all spaces and remove words with lengths less than 2
  2. Tokenizing data and Identify Special Instances of Tweets This separates the words and remove punctuations

    • Create a list for each row of the clean text by making each word a standalone this also takes care of any full stops at end of text removes.
    • Drop empty index in clean data
    • Drop duplicates/empty tweets in data set and reset index
  3. Vectorizer This is similar to tokenization only that it takes all the word vocabulary and convert all the vocabulary in the documents in to a matrix format bag of words. For instance

      [Hi my name is celdrick]
      [Hi my friend is Joyce]
    
      #vectorizing the entire vocabulary or words in a more structured format to a fix number of input length
      [Hi my name is celdrick friend Joyce]
    
      #Count vectorizer converts to matrix format: count vectorizer preferred to Tfidf because we have small data set.
      [1, 1, 1, 1, 1, 0, 0]
      [1, 1, 0, 1, 0, 1, 1]
    • Implementing count vectorizer with parameters like stop_words, analyzer, ngram_range, min_df, max_df and convert the matrix to array for modeling
  4. Model Building and Evaluation

    • Since this is a clustering problem , Kmeans has been used to suit the purpose.

Results

  • The optimal cluster on my model is 6 clusters/ 6 centroids, this can be improve with experience clustered_tweets.csv
  • Word cloud analysis has help to visualize prominent patterns in deciding the cluster number
  • Best cluster/Centroid ranges are between 2-8

Recommendations

  • Since customer reviews are subjective, Try more bigger data set with more reviews and we could keep monitoring the system performance and varying clusters as it comes with experience figuring the clusters and depending on domain problem. N/B If jupyter file does not render at this time. Check the .py file extension

Releases

No releases published

Packages

No packages published