This sentiment analysis tool uses the textcat module of the spaCy natural language processing library, in conjunction with a dataset of Tweets categorized as "positive" and "negative", to perform sentiment analysis. It represents one of my first forays into NLP and can be used as an example by anyone trying to learn.
-
Ensure you have Python installed
-
Clone this repository to your local machine using the following command:
git clone https://github.com/samnoyes/SentimentAnalysis.git
- Install the required dependencies. In terminal: navigate to the project directory and run the following command:
pip install -r requirements.txt
(pip3
if you are using python3)
This will install the necessary packages, including spaCy.
- Download spaCy's English language model by running the following command:
python -m spacy download en
(or python3
)
This will download the required language model for performing sentiment analysis.
python simple_sentiment_analysis.py <string to analyze>
(or python3
depending on your version)
Examples:
python simple_sentiment_analysis.py "i love this movie"
Output:
{'positive': 0.8196957111358643, 'negative': 0.18030422925949097}
python simple_sentiment_analysis.py "i hate this movie"
Output:
{'positive': 0.21716347336769104, 'negative': 0.7828364968299866}
- Downloaded Sentiment Analysis dataset from Kaggle: Twitter Sentiment Analysis
- Created the convert_csv_to_json.py script, which converted the dataset from .csv to .jsonl, with the help of ChatGPT (and some significant tweaks). Ran the .csv training dataset through this script.
- Got the convert.py script, which converted the dataset from .jsonl to .spacy, from this textcat demo project. Ran the .jsonl training dataset through this script twice to create dev.spacy and train.spacy.
- In the future, I would like to figure out how to convert the .csv directly to .spacy, without the intermediate step.
- The spaCy docs on training are a great resource.
- This tutorial on text classification by Catherine Breslin was very helpful.
Feel free to reuse this project to train based on your own dataset. It doesn't just have to be sentiment analysis: you can use this for other things such as text classification – for example, categorizing a news article based on the topic. If you have a csv dataset that you would like to use, you will need to follow these steps:
- Import it into the project directory and edit convert_csv_to_json.py lines 16-26 to process the data based on the format of your csv. Or, if you want to just follow along with what I did, feel free to just download the csv that I used from Kaggle: Twitter Sentiment Analysis
- Run
python convert_csv_to_json.py <dataset.csv> <dataset_jsonl.jsonl>
to convert it to jsonl (<dataset_jsonl.jsonl> is the output file, feel free to name it whatever you want). - Run
python convert.py 'en' <dataset_jsonl.jsonl> train.spacy
andpython convert.py 'en' <dataset_jsonl.jsonl> dev.spacy
to create the dev and train splits of your data, formatted for processing by spaCy. - Run
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy --output textcat_model
to train the model. The output will look something like this:
============================= Training pipeline =============================
ℹ Pipeline: ['textcat']
ℹ Initial learn rate: 0.001
E # LOSS TEXTCAT CATS_SCORE SCORE
--- ------ ------------ ---------- ------
0 0 0.25 50.77 0.51
0 200 47.06 68.29 0.68
0 400 43.01 70.78 0.71
0 600 39.88 72.59 0.73
0 800 38.64 74.03 0.74
0 1000 37.12 74.90 0.75
0 1200 35.03 75.58 0.76
0 1400 35.23 75.92 0.76
0 1600 34.22 76.62 0.77
0 1800 34.85 76.84 0.77
0 2000 32.66 77.30 0.77
0 2200 32.77 77.54 0.78
0 2400 32.39 77.72 0.78`
You may have to wait ~24 hours for this to run, although feel free to stop it once it is no longer improving.
- Run simple_sentiment_analysis.py with your input to check the results.
Please note that the datasets are not included in this repo. If you want to follow along with what I did, just download the CSVs from the dataset from Kaggle, and then follow the steps above in "Customization". Feel free to reach out to me with any feedback or suggestions, or just to let me know how you used this project!