GitHub

Data

11 countries’ text data from 5_domestic_filter_Ngram_stopwords_lemmatize folder. Countries for training: Canada, US, Singapore, Philippines, New Zealand, Nigeria, Ireland, and Bangladesh. Countries for testing: Australia, Kenya, United Kingdom.

Processing

• Remove named entities: for the column “article_text_Ngram” in each dataset, applied function “en_core_web_sm” in package spacy to remove named entities such as person’s name, country, places. • Manually removed more named entities that were not removed from previous steps. • Remove stop words: applied stopwords.words(‘english’) in package nltk to remove stop words, resulted in a new column “text_no_sw”. • For each peaceful countries’ data, added a new column named “target”==1, and this column in each non-peaceful data equal to 0. • Removed single letters and “#” through regular expression.

Training

• Classified four country pairs regarding their demographic region similarity. • For each pair, split data to training and testing set (80/20 ratio). • Created TF-IDF features for each text in the training dataset. • Trained three models on the training dataset: Logistic Regression, Random Forest, and XGB

Logistic Regression: Logistic function to classify the dependent variable (peace level)
Random Forest: Bagging (ensemble) of Decision Trees to predict the average of Decision Tress and minimize the variance and overfitting. The feature importance is the decrease in node impurity weighted by the probability of reaching that node.
XGB: Extreme Gradient Boosting of Decision Trees with iteration through residuals of previous model to minimize bias and underfitting

• Combine eight countries’ dataset together and trained the same process. • Loaded the top 200 features and their importance of each model. • Compared the performance of the three models and five dataset choices. • Used trained models to predict Australia, Kenya, United Kingdom, and other single countries’ dataset and compared their performance.

Results

The Canada/US pair showed higher accuracy score for predicting non-peaceful countries’ data, and the New Zealand/ Nigeria pair and Ireland/Bangladesh pair showed higher score for predicting peaceful countries’ data. However, the Singapore/Philippines pair tended to predict each text to be peaceful whether it was from peaceful countries or not. Moreover, the combination of eight countries showed higher score for predicting Australia’s data (accuracy = 0.85 Logistic Regression, accuracy = 0.93 Random Forest) and United Kingdom’s data (accuracy = 0.82 Logistic Regression, accuracy = 0.90 Random Forest). The score for Kenya is lower (accuracy = 0.60 Logistic Regression, accuracy = 0.46 Random Forest). Overall, the five choices for country pairs had higher score for predicting peaceful countries’ data than predicting non-peaceful countries’ data.

Exploration

Since Random Forest had higher accuracy overall, we chose the top 200 important words and their feature importance of each trained pairs and joined their peaceful level (1 or 0). The Word Cloud showed the appearances of these words.

The following two WordCLouds are from the training process of eight countries toghther.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Data.ipynb		Data.ipynb
Final Report.docx		Final Report.docx
Four Countries.ipynb		Four Countries.ipynb
MultiCountry.ipynb		MultiCountry.ipynb
Ordered Words.xlsx		Ordered Words.xlsx
README.md		README.md
Training.ipynb		Training.ipynb
Training_freq.ipynb		Training_freq.ipynb
Traning_2.ipynb		Traning_2.ipynb
WordCloud.ipynb		WordCloud.ipynb
paired_training (1).ipynb		paired_training (1).ipynb
record.xlsx		record.xlsx
target.xlsx		target.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

mylittlebecca/PEACE_level

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages