Natural Language Processing with Disaster Tweets
This project is licensed under The GNU GPL v3
Please have a look at the version history of each notebook.
Statistical models:
Deep learning models:
- RNN
- RNN with Attention
- CNN
- Multi-channel CNN with RNN - unidirectional & bidirectional
- Multi-channel CNN with RNN (concat) - unidirectional & bidirectional
- LLMs
View training & testing script's help with this command:
python <script>.py --help
Note: use those scripts at your own risk, since I don't normally re-train models on my personal PC.
Different text preprocessing methods used in my implementations, but most methods following these steps
- Removing emojis
- Removing html
- Removing URLs
- Removing punctuations
- Lowercase and remove multiple spaces.
However there are some exceptions where a specific preprocessing method of the pretrained model is applied:
- BERTweet using TweetTokenizer to mask and replace some tokens
- Twitter RoBERTa Sentiment requires masking username and url as specific tokens.
- Training data: Using full training set.
- Hyperparameters: using
sklearn.model_selection.GridSearchCV
to automatically pick best combinations.
Click to view
Hyperparameter | Value |
---|---|
Train:test | 8:2 |
Batch size (train/test) | 64/32 |
Learning rate | 1e-4 |
Embedding dim | 64 |
Epochs | 10 |
Vocab size | 10000 |
Click to view
Hyperparameter | Value |
---|---|
Train:test | 8:2 |
Batch size (Train/test) | 64/32 |
Filter size | 100 |
Window size | [3, 4, 5] |
L2 regularization | 3 |
Dropout rate | 0.5 |
Dense unit | 64 |
Learning rate | 1e-4 |
Epochs | 100 |
Vocab size | 10000 |
Early stopping | 20 epochs |
Classification threshold | 0.5 |
Click to view
CNN & RNN feed model:
Hyperparameter | Value |
---|---|
Train:test | 8:2 |
Batch size (Train/test) | 64/32 |
Recurrent units | 512 |
Filter size | 200 |
Window size | [1, 2, 3] |
Dropout rate | 0.5 |
Dense unit | 64 |
Learning rate | 1e-4 |
Epochs | 100 |
Vocab size | 10000 |
Early stopping | 20 epochs |
Classification threshold | 0.5 |
CNN & BiRNN feed model:
Hyperparameter | Value |
---|---|
Train:test | 8:2 |
Batch size (Train/test) | 64/32 |
Recurrent units | 512 |
Filter size | 200 |
Window size | [1, 2, 3] |
Dropout rate | 0.5 |
Dense unit | 64 |
Learning rate | 1e-4 |
Epochs | 100 |
Vocab size | 10000 |
Early stopping | 10 epochs |
Classification threshold | 0.5 |
CNN & RNN concat model:
Hyperparameter | Value |
---|---|
Train:test | 8:2 |
Batch size (Train/test) | 64/32 |
Recurrent units | 512 |
Filter size | 200 |
Window size | [1, 2, 3] |
Dropout rate | 0.5 |
Dense unit | 64 |
Learning rate | 1e-4 |
Epochs | 100 |
Vocab size | 10000 |
Early stopping | 5 epochs |
Classification threshold | 0.5 |
CNN & BiRNN concat model:
Hyperparameter | Value |
---|---|
Train:test | 8:2 |
Batch size (Train/test) | 64/32 |
Recurrent units | 512 |
Filter size | 200 |
Window size | [1, 2, 3] |
Dropout rate | 0.5 |
Dense unit | 64 |
Learning rate | 1e-4 |
Epochs | 100 |
Vocab size | 10000 |
Early stopping | 10 epochs |
Classification threshold | 0.5 |
Click to view
Hyperparameter | Value |
---|---|
Train:dev:test ratio | 6:2:2 |
Batch size | 64 |
Learning rate | 2e-5 |
Weight decay | 0.01 |
Epochs | 50 |
Early stopping | 5 epochs |
Click to view
Some large LLMs cannot be trained with hyperparameters in the LLMs section. In order to fit those models to Kaggle GPU's RAM, I reduced the batch size and learning rate to following values:
Hyperparameter | Value |
---|---|
Train:dev:test ratio | 6:2:2 |
Batch size | 32 |
Learning rate | 1e-5 |
Weight decay | 0.01 |
Epochs | 50 |
Early stopping | 5 epochs |
All remaining hyperparametes stay the same as LLMs.
Experiment setup: All experiments were conducted under the same Kaggle environment:
Configuration | Value |
---|---|
CPU | Intel Xeon 2.20 GHz CPU, 4vCPU cores |
Memory | 32 GB |
GPU | NVIDIA Tesla T4 (x2) (LLMs) or P100 (RNNs, CNNs) |
Random seed | 42 |
Click to view
Model | Vectorizer | Training configurations | Public F1 | |
---|---|---|---|---|
KMean | TFIDF | [1] | 0.50658 | |
Linear Models | Logistic Regression | TFIDF | [1] | 0.80171 |
Stochastic Gradient Descent | TFIDF | [1] | 0.80386 | |
Support Vector Machine | TFIDF | [1] | 0.80140 | |
Random Forest | TFIDF | [1] | 0.78792 | |
AdaBoost | Decision Tree | TFIDF | [1] | 0.72847 |
Bagging | Decision Tree | TFIDF | [1] | 0.74348 |
Decision Tree | TFIDF | [1] | 0.71069 | |
Gradient Boosting | TFIDF | [1] | 0.73889 | |
XGBoost | TFIDF | [1] | 0.74992 | |
Naive Bayes | Multinomial Naive Bayes | TFIDF | [1] | 0.80447 |
Complement Naive Bayes | TFIDF | [1] | 0.79589 | |
Multilayer Perceptrons | TFIDF | [1] | 0.75911 |
Click to view
Model (with paper link) | Pretrain parameters | Training configurations | Public F1 | Notes | |
---|---|---|---|---|---|
RNN | 1-layer Bidirectional LSTM | 714,369 | [3] | 0.77352 | |
2-layers stacked Bidirectional LSTM | 751,489 | [3] | 0.78026 | ||
1-layer Bidirectional GRU | 698,241 | [3] | 0.77536 | ||
2-layers stacked Bidirectional GRU | 725,249 | [3] | 0.77566 | ||
RNN + Attention | 1-layer Bidirectional LSTM + Dot Attention | 714,369 | [3] | 0.76892 | |
1-layer Bidirectional GRU + Dot Attention | 698,241 | [3] | 0.78516 | ||
1-layer Bidirectional LSTM + General Attention | 730,881 | [3] | 0.77995 | ||
1-layer Bidirectional GRU + General Attention | 714,753 | [3] | 0.77719 | ||
1-layer Bidirectional LSTM + Concatenate Attention | 730,946 | [3] | 0.78148 | ||
1-layer Bidirectional GRU + Concatenate Attention | 714,818 | [3] | 0.77873 | ||
Deep CNN (random + pretrained embedding) | CNN non-static (random embedding) | 299,629 | [3] | 0.71345 | Embedding dimension = 25 (equals to GloVe vector size) |
CNN static (glove-twitter-25) | 299,629 | [3] | 0.77689 | ||
CNN static (glove-twitter-50) | 579,629 | [3] | 0.78700 | ||
CNN static (glove-twitter-100) | 1,139,629 | [3] | 0.79374 | ||
CNN static (glove-twitter-200) | 2,259,629 | [3] | 0.79711 | ||
CNN static (fasttext-wiki-news-subwords-300) | 3,379,629 | [3] | 0.57033 | ||
CNN non-static (glove-twitter-25) | 299,629 | [3] | 0.80478 | ||
CNN non-static (glove-twitter-50) | 579,629 | [3] | 0.79619 | ||
CNN non-static (glove-twitter-100) | 1,139,629 | [3] | 0.79987 | ||
CNN non-static (glove-twitter-200) | 2,259,629 | [3] | 0.80140 | ||
CNN non-static (fasttext-wiki-news-subwords-300) | 3,379,629 | [3] | 0.73980 | ||
Multi-channel CNN and RNN | Random embedding (static) + Unidirectional LSTM | 3,326,169 | [3] | 0.67391 | |
Random embedding (static) + Bidirectional LSTM | 4,411,609 | [3] | 0.68709 | ||
Random embedding (static) + Unidirectional GRU | (todo) | [3] | (todo) | ||
Random embedding (static) + Bidirectional GRU | (todo) | [3] | (todo) | ||
GloVe (glove-twitter-25, static) + Unidirectional LSTM | 1,366,169 | [3] | 0.68372 | ||
GloVe (glove-twitter-25, static) + Bidirectional LSTM | 2,451,609 | [3] | 0.78976 | ||
GloVe (glove-twitter-50, static) + Unidirectional LSTM | 1,646,169 | [3] | 0.77781 | ||
GloVe (glove-twitter-50, static) + Bidirectional LSTM | 2,731,609 | [3] | 0.78148 | ||
GloVe (glove-twitter-100, static) + Unidirectional LSTM | 2,206,169 | [3] | 0.73460 | ||
GloVe (glove-twitter-100, static) + Bidirectional LSTM | 3,291,609 | [3] | 0.78700 | ||
GloVe (glove-twitter-200, static) + Unidirectional LSTM | 3,326,169 | [3] | 0.71835 | ||
GloVe (glove-twitter-200, static) + Bidirectional LSTM | 4,411,609 | [3] | 0.76310 | ||
Random embedding (nonstatic) + Unidirectional LSTM | 3,326,169 | [3] | 0.71284 | ||
Random embedding (nonstatic) + Bidirectional LSTM | 4,411,609 | [3] | 0.75390 | ||
Random embedding (nonstatic) + Unidirectional GRU | (todo) | [3] | (todo) | ||
Random embedding (nonstatic) + Bidirectional GRU | (todo) | [3] | (todo) | ||
GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM | 1,366,169 | [3] | 0.75942 | ||
Glove (glove-twitter-25, nonstatic) + Bidirectional LSTM | 2,451,609 | [3] | 0.79436 | ||
GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM | 1,646,169 | [3] | 0.78240 | ||
GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM | 2,731,609 | [3] | 0.79957 | ||
GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM | 2,206,169 | [3] | 0.78700 | ||
GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM | 3,291,609 | [3] | 0.76064 | ||
GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM | 3,326,169 | [3] | 0.78179 | ||
GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM | 4,411,609 | [3] | 0.77474 | ||
Multi-channel CNN and RNN (concat) | Random embedding (static) + Unidirectional LSTM | 3,772,121 | [3] | 0.78394 | Embedding dimension = 200 |
Random embedding (static) + Bidirectional LSTM | 5,265,113 | [3] | 0.78700 | ||
Random embedding (static) + Unidirectional GRU | 3,408,601 | [3] | 0.78302 | ||
Random embedding (static) + Bidirectional GRU | 4,538,073 | [3] | 0.77627 | ||
GloVe (glove-twitter-25, static) + Unidirectional LSTM | 1,453,721 | [3] | 0.80110 | ||
GloVe (glove-twitter-25, static) + Bidirectional LSTM | 2,588,313 | [3] | 0.79436 | ||
GloVe (glove-twitter-25, static) + Unidirectional GRU | 1,179,801 | [3] | 0.80294 | ||
GloVe (glove-twitter-25, static) + Bidirectional GRU | 2,040,473 | [3] | 0.79528 | ||
GloVe (glove-twitter-50, static) + Unidirectional LSTM | 1,784,921 | [3] | 0.81091 | ||
GloVe (glove-twitter-50, static) + Bidirectional LSTM | 2,970,713 | [3] | 0.81366 | ||
GloVe (glove-twitter-50, static) + Unidirectional GRU | 1,498,201 | [3] | 0.80907 | ||
GloVe (glove-twitter-50, static) + Bidirectional GRU | 2,397,273 | [3] | 0.80937 | ||
GloVe (glove-twitter-100, static) + Unidirectional LSTM | 2,447,321 | [3] | 0.80539 | ||
GloVe (glove-twitter-100, static) + Bidirectional LSTM | 3,735,513 | [3] | 0.81305 | ||
GloVe (glove-twitter-100, static) + Unidirectional GRU | (todo) | [3] | (todo) | ||
GloVe (glove-twitter-100, static) + Bidirectional GRU | 3,110,873 | [3] | 0.80907 | ||
GloVe (glove-twitter-200, static) + Unidirectional LSTM | 3,772,121 | [3] | 0.80723 | ||
GloVe (glove-twitter-200, static) + Bidirectional LSTM | 5,265,113 | [3] | 0.81152 | ||
GloVe (glove-twitter-200, static) + Unidirectional GRU | 3,408,601 | [3] | 3,408,601 | ||
GloVe (glove-twitter-200, static) + Bidirectional GRU | 4,538,073 | [3] | 0.80815 | ||
Random embedding (nonstatic) + Unidirectional LSTM | 3,772,121 | [3] | 0.74164 | ||
Random embedding (nonstatic) + Bidirectional LSTM | 5,265,113 | [3] | 0.77444 | ||
Random embedding (nonstatic) + Unidirectional GRU | 3,408,601 | [3] | 0.80171 | ||
Random embedding (nonstatic) + Bidirectional GRU | 4,538,073 | [3] | 0.80049 | ||
GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM | 1,453,721 | [3] | 0.80876 | ||
GloVe (glove-twitter-25, nonstatic) + Bidirectional LSTM | 2,588,313 | [3] | 0.79834 | ||
GloVe (glove-twitter-25, nonstatic) + Unidirectional GRU | 1,179,801 | [3] | 0.80815 | ||
GloVe (glove-twitter-25, nonstatic) + Bidirectional GRU | 2,040,473 | [3] | 0.79650 | ||
GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM | 1,784,921 | [3] | 0.80539 | ||
GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM | 2,970,713 | [3] | 0.81213 | ||
GloVe (glove-twitter-50, nonstatic) + Unidirectional GRU | 1,498,201 | [3] | 0.80968 | ||
GloVe (glove-twitter-50, nonstatic) + Bidirectional GRU | 2,397,273 | [3] | 0.80386 | ||
GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM | 2,447,321 | [3] | 0.81029 | ||
GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM | 3,735,513 | [3] | 0.80968 | ||
GloVe (glove-twitter-100, nonstatic) + Unidirectional GRU | 2,135,001 | [3] | 0.80570 | ||
GloVe (glove-twitter-100, nonstatic) + Bidirectional GRU | 3,110,873 | [3] | 0.80815 | ||
GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM | 3,772,121 | [3] | 0.80508 | ||
GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM | 5,265,113 | [3] | 0.81182 | ||
GloVe (glove-twitter-200, nonstatic) + Unidirectional GRU | 3,408,601 | [3] | 0.81244 | ||
GloVe (glove-twitter-200, nonstatic) + Bidirectional GRU | 4,538,073 | [3] | 0.80999 |
Click to view
Model (with paper link) | Pretrain parameters | Training configurations | Public F1 | Notes | |
---|---|---|---|---|---|
ALBERT | base-v1 | 11M (huggingface) | [2] | 0.80907 | View list of parameters by huggingface here |
large-v1 | 17M (huggingface) | [2] | 0.80416 | ||
xlarge-v1 | 58M (huggingface) | [4] | 0.81182 | ||
xxlarge-v1 | 223M (huggingface) | [4] | 0.78853 | ||
base-v2 | 11M (huggingface) | [2] | 0.79528 | ||
large-v2 | 17M (huggingface) | [2] | 0.81520 | ||
xlarge-v2 | 58M (huggingface) | [4] | 0.81703 | ||
xxlarge-v2 | 223M (huggingface) | [4] | 0.80570 | ||
BART | base | 140M (facebook-research) | [2] | 0.82684 | View list of parameters by facebook-research here |
large | 400M (facebook-research) | [2] | 0.83726 | ||
large-mnli | 400M (facebook-research) | [2] | 0.83450 | ||
large-cnn | 400M (facebook-research) | [2] | 0.82347 | ||
BERT | base uncased | 110M (huggingface) | [2] | 0.82899 | View list of parameters by huggingface here |
base cased | 110M (huggingface) | [2] | 0.81060 | ||
large uncased | 340M (huggingface) | [2] | 0.83052 | ||
large cased | 340M (huggingface) | [2] | 0.82194 | ||
large uncased whole word masking | 335M (huggingface) | [2] | 0.82255 | ||
large cased whole word masking | 336M (huggingface) | [2] | 0.81244 | ||
multilingual uncased | 168M (huggingface) | [2] | 0.81887 | ||
multilingual cased | 179M (huggingface) | [2] | 0.81918 | ||
BERTweet | base | 135M (vinai) | [2] | 0.83726 | View list of parameters by vinai here |
covid19-base-uncased | 135M (vinai) | [2] | 0.84002 | ||
covid19-base-cased | 135M (vinai) | [2] | 0.82960 | ||
large | 335M (vinai) | [2] | 0.82899 | ||
BORT | base | 56.1M (amazon) | [2] | 0.74563 | Parameters from the original paper |
DeBERTa | base | 100M (microsoft) | [2] | 0.81642 | View list of parameters by microsoft here |
base-mnli | 86M (microsoft) | [2] | 0.80661 | ||
large | 350M (microsoft) | [4] | 0.84308 | ||
large-mnli | 350M (microsoft) | [4] | 0.83757 | ||
DeBERTa v3 | xsmall | 22M (microsoft) | [2] | 0.80815 | View list of parameters by microsoft here |
small | 44M (microsoft) | [2] | 0.82408 | ||
base | 86M (microsoft) | [2] | 0.83205 | ||
large | 304M (microsoft) | [4] | 0.82745 | ||
mdeberta-v3-base | 86M (microsoft) | [2] | 0.82929 | ||
DistilBERT | base uncased | 66M (huggingface) | [2] | 0.82439 | View list of parameters by huggingface here |
base cased | 65M (huggingface) | [2] | 0.82163 | ||
multilingual cased | 134M (huggingface) | [2] | 0.80049 | ||
ELECTRA (discriminator) | small | 14M (google) | [2] | 0.81887 | View list of parameters by google here |
base | 110M (google) | [2] | 0.82776 | ||
large | 335M (google) | [2] | 0.83726 | ||
RoBERTa | base | 125M (huggingface) | [2] | 0.82868 | View list of parameters by huggingface here |
large | 335M (huggingface) | [2] | 0.84033 | ||
large | 355M (huggingface) | [2] | 0.84033 | ||
distilroberta-large | 82M (huggingface) | [2] | 0.82960 | ||
SqueezeBERT | uncased | 51M (huggingface) | [2] | 0.80324 | View list of parameters by huggingface here |
mnli | 51M (huggingface) | [2] | 0.79987 | ||
mnli-headless | 51M (huggingface) | [2] | 0.80416 | ||
Twitter RoBERTa Sentiment | base | N/A | [2] | 0.83389 | CardiffNLP has a huge list of Twitter pretrained models and these are just 3 of them. Try finetuning others (if you have time). |
base latest | N/A | [2] | 0.82776 | ||
base 2021 | 124M (cardiffnlp) | [2] | 0.83083 | ||
XLM-RoBERTa | base | 270M (huggingface) | [2] | 0.82439 | View list of parameters by huggingface here |
large | 550M (huggingface) | [2] | 0.82500 | ||
XLNet | base cased | 110M (huggingface) | [2] | 0.82592 | View list of parameters by huggingface here |
large cased | 340M (huggingface) | [4] | 0.81612 |