This is the code for predicting geolocaton of tweets trainning on token frequencies using Decision Tree and Naïve Bayes.
In util/preprocessing/merge.py
,
feature_filter
shows it drops single character features like[a, b, ..., n]
merge
shows it intuitively merges similar features like[aha, ahah, ..., ahahahaha]
and[taco, tacos]
merge
also shows it uses Recursive Feature Elimination to rank features and select best 300 features.
In preprocess/merge.py
,
result_combination
shows it uses results of Decision Tree, Random Forest, Bernoulli Naïve Bayes, Complement Naïve Bayes and Multinomial Naïve Bayes to vote out the majority predictions while the following shows the models can be slightly different every time training.
In util/train.py
,
complement_nb
shows it uses bagging to generate multiple training datasets.complement_nb
also shows it uses 42-Fold Cross Validation to generate multiple training datasets.
In util/train.py
,
complement_nb
also shows it uses GridSearchCV to generate multiple classifiers and select the best based onaccuracy
.
- See Eisenstein, Jacob, et al.
Although some file's size in datasets is greater than 50.00 MB, it was still added in
datasets
for convenience. (See, http://git.io/iEPt8g )
- python3+
pip install -r requirements.txt
Note: The code will remove the old models and results every time running. MAKE SURE you have saved your satisfying models..
python run.py -t datasets/train-best200.csv datasets/dev-best200.csv
the output would be like:
INFO:root:[*] Merging datasets/train-best200.csv
42%|████████ | 1006/2396 [00:05<00:20, 92.03 users/s]
...
...
[*] Saved models/0.8126_2019-10-02_20:02
[*] Accuracy: 0.8125955095803455
precision recall f_scoreCalifornia 0.618944 0.835128 0.710966
NewYork 0.899371 0.854647 0.876439
Georgia 0.788070 0.622080 0.695305
weighted 0.827448 0.812596 0.814974
python run.py -p models/ datasets/dev-best200.csv
the output would be like:
...
INFO:root:[*] Saved results/final_results.csv
INFO:root:[*] Time costs in seconds:
PredictTime_cost 11.98s
python run.py -s results/final_results.csv datasets/dev-best200.csv
the output would be like:
[*] Accuracy: 0.8224697308099213
precision recall f_scoreCalifornia 0.653035 0.852199 0.739441
NewYork 0.747993 0.647940 0.694381
Georgia 0.909456 0.858296 0.883136
weighted 0.833854 0.822470 0.824577
INFO:root:[*] Time costs in seconds:
ScoreTime_cost 1.48s
python run.py \
-t datasets/train-best200.csv datasets/dev-best200.csv \
-p models/ datasets/dev-best200.csv \
-s results/final_results.csv datasets/dev-best200.csv
python run.py -h
- sklearn for easily using Complement Naive Bayes, some feature selectors and other learning tools.
- pandas, numpy for easily handling data.
- tqdm for showing the process of loop.
- joblib for dumping/loading memory to/from disk.
nltk for capturing word types on the purpose of feature filtering
See LICENSE file.