Predict review rating (how many stars the reviewer gave to given POI) given the textual-content of the review using pretrained GLOVE word embeddings and a 2 layers deep GRU(LSTM) recurrent neural network. Visualize clusters on a 2D space using PCA and t-SNE.
yelp data: https://www.yelp.de/dataset_challenge
notebook used to load and explore basic properties of this dataset can be found here:
https://github.com/i008/nyyelp/blob/master/exploration.ipynb
The notebook for the NLP part of this project can be found here:
https://github.com/i008/nyyelp/blob/master/nlp.ipynb
- Process .text data from yelp review-documents
- filter to leave english-only reviews
- balance the dataset (use similar amount of data for each star rating (1,2,3,4,5*)
- balance the dataset by review length, so the length distribution of reviews for each rating-class is atleast similar (bucketing might be a good idea)
- Process text for deep-learning
- bring GLOVE embeddings to a reasonable form
- Tokenize each review
- transform reviews into tokenized sequences
- Pad sequences to a fixed length
- Prepare the embedding weight-matrix to be used in the Embedding layer (we will not train this layer)
- split into test and train
- Deep learning
- loss function, regression(rmse) and/or classification(logloss)
- prepare architecture (2 LSTM/GRU layers -> Dense)
- train, wish for the best
- Post learning
- extract features from last LSTM layer
- casting to 2D space using t-SNE
- Remarks
- we have to limit the amount of data, bc of memory issues possible solution is to save the sequences into HDF and flowing from disk during training
- For performance and memory reasons the final model was trained on a balanced and downsampled (by a factor of 10) subset of the reviews leaving us around 150,000 reviews (in total)
- Review data was tokenized (20,000 words)
- Max sequence length was set to 1000.
- Embedding dimensions were set to 100 (GLOVE-100d representation was used) wich means every words is a vector with shape (100,)
- The Embedding layer used GLOVE (http://nlp.stanford.edu/projects/glove/) weights. Training of this layer was disabled
- The objective function was logloss wich means that it was a classification task , altough regression would also be a valid option
- After the 3rd-4th epoch we can observe that the model starts to overfit
As we can see on the images above the model managed to learn (something) and achieved decent accuracy(~0.62).
Key takeways:
- Extreme(1,5) reviews are easier to classify and understand, people write them in a specific way.
- As we can see on fig1. The model makes mistakes almost only "by 1-star" this is a very good sign and leads to the conclusion that the "grduation" of the data was learned to some degree
t-SNE and PCA are dimmensionality reduction techniques, used on the features from last GRU layer they can give us an "2D" idea of the problem we are tackling.
- One epoch (~150,000 reviews) with a batch size of 128 reviews took around 15mins to train on a GTX1070.
docker run i008/nyyelp:latest python predict.py --review "this place really sucks, food is terrible"
number of stars: 1
docker run i008/nyyelp:latest python predict.py --review "i have mixed feeling about this place, on one hand its good on the other not really"
number of stars: 3
- Treat this problem as regression and see what happens
- Cluster GRU features using DBSCAN or similar methods, might give interesting results.
- Use more data to train the models.
- Balance the dataset considering the length of the review.
- Interactive plot for t-SNE (click on a data point should show the review plotly?)