This is an application of "Attention-based LSTM for Aspect-level Sentiment Classification" by Yequan Wang, Minlie Huang, Li Zhao, and Xiaoyan Zhu.
The Attention-based LSTM with Aspect Embedding (ATAE-LSTM) is implemented using PyTorch in order to conduct sentiment analysis on the Twitter tweets during the Covid-19 pandemic.
For detailed explanation, please refer to ATAE-LSTM Explain.
The main purpose of this project is to have a hands-on experience with the Aspect-based Sentiment Analysis (ABSA) using PyTorch by evaluating the NLP model in the paper above.
At the beginning, I tried the original code of the paper. However, it was developed based on the MindSpore framework and Huawei Ascend 910 processor. Due to my personal hardware limitation, I did not get the code working on my own Windows 10 or Mac machine by trying the following potential solutions:
- Windows 10 + WSL2 (Ubuntu 20.04) + Cuda 11 + Pip
- Windows 10 + WSL2 (Ubuntu 20.04) + Cuda 11 + Docker
- Mac + Intel CPU
- Not supported
In the end, I decided to implement the NLP model using PyTorch due to the learning purpose and the looming deadline.
Operating System | Windows 10 Version 21H2 |
PyTorch | 1.10 + Pip + CUDA 11.3 |
CUDA | 11.5 |
The word embedding comes from the pre-trained word vector of Global Vectors for Word Representation (GloVe). The following pre-trained word embedding was explored:
- ./glove_embedding/glove.twitter.27B.25d.txt
- ./glove_embedding/glove.twitter.27B.200d.txt
- ./glove_embedding/glove.6B.200d.txt
- ./glove_embedding/glove.6B.300d.txt
You need to download those text files from Global Vectors for Word Representation (GloVe) and pass the path to the train_and_test.py
.
The pre-processed dataset is in the ./data/covid_dataset.csv
. Each sentence may have multiple aspects. However, the ATAE-LSTM model can only take one aspect per sentence. So, if a sentence has multiple aspects, I separate them so that each row only contains one aspect of a sentence.
The aspects are defined in this dictionary: {"politics":1, "economy":2, "foreign":3, "culture":4, "situation":5, "measures":6, "racism":7, "overall":8}
Run pip install -r requirements.txt
to install all required packages. Use pip3
if you are using Mac.
Run python ./train_and_test.py --data_path=./data/covid_dataset.csv --glove_path=./glove_embedding/glove.twitter.27B.25d.txt --batch_size=10 --epoch=10 --word_embedding_dim=25 --hidden_dim=25
to train and test the model.
If you want to experiment with different hyperparameters of the model, you can modify the PowerShell Script run.ps1
and run it. It will pipe the results into a text file to the result
folder.
The best overall accuracy comes from the following configurations:
- Word embedding dimension: 25
- Hidden dimension: 25
- Batch size: 10
- Epoch: 10
The overall accuracy is 71.48%.