In this project, we used MURIL (Multilingual Unsupervised Representations for Indian Languages), a multilingual BERT model, to perform sentiment analysis on Nepali text.
View Demo »
Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.
Although there are some more works carried out in non-Nepali language, very few works have been carried out in Nepali language. The major objective of this project is to perform sentence level sentiment analysis in case of Nepali Language and perform EDA analysis in the available dataset.
Source of the dataset NepCOV19Tweets dataset with 32,824 total tweets
- positive class: 14, 823 samples
- neutral class: 4,591 samples
- negative class: 13,410 samples
For this project, we have used a deep-learning approach based on MuRIL architecture. MuRIL(Multilingual Representations for Indian Languages) is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. This model uses a BERT base architecture pretrained from scratch using the Wikipedia, Common Crawl, PMINDIA and Dakshina corpora for 17 Indian languages that includes Nepali as one of the languages. The model is then fine-tuned on the Nepali Covid-19 tweets dataset for sentiment analysis.
git clone https://github.com/ashishlamsal/sentiment-analysis.git
cd .\backend
python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txt
python main.py
Note: You need to put the fine-tuned MURIL model in
\backend\ml\sentiment-model\3\
.
cd .\frontend
yarn install
Create .env
file inside frontend
directory and add the following environment variables:
VITE_APP_BASE_URL=http://localhost:8000/run/predict
Alternatively, if you are running the gradio
backend application, you can use the following environment variable:
VITE_APP_BASE_URL=http://127.0.0.1:7860/run/predict
Finally, run the frontend application:
yarn run dev
http://127.0.0.1:5173/
Note that the
gradio
app insidebackend/gradio
uses a private model from huggingface. In order to use private model from huggingface, you need to create a.env
file insidebackend/gradio
directory and add the following environment variables:
HUGGINGFACE_TOKEN=<your-huggingface-token>
Distributed under the MIT License. See LICENSE for more information.
Ashish Lamsal | Janak Sharma |
-
@article{sitaula2021deep, title={Deep learning-based methods for sentiment analysis on Nepali covid-19-related tweets}, author={Sitaula, Chiranjibi and Basnet, Anish and Mainali, A and Shahi, Tej Bahadur}, journal={Computational Intelligence and Neuroscience}, volume={2021}, year={2021}, publisher={Hindawi} }
-
MuRIL: Multilingual Representations for Indian Languages
@misc{khanuja2021muril, title={MuRIL: Multilingual Representations for Indian Languages}, author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar}, year={2021}, eprint={2103.10730}, archivePrefix={arXiv}, primaryClass={cs.CL} }