This is the final project for the Neural Representation Seminar SS21 @LMU
for this project we will consider 3 transformer models, based or identical with BERT for sentiment classification. Mainly two aspects will be experimented with, the heavy imbalance in the used dataset and the classification algorithm, namely the classification head. All three models are finetuned to the task and a final cross evalutaion is done on an external dataset.
The models considered are bert-base-uncased
, distillbert-base-uncased
and deberta-base-uncased
.
We use them in their pretrained form from Huggingface. For distillbert-base-uncased
and deberta-base-uncased
only the
base is used as we use a custom head for the experiments.
All models are fairly large but still trainable on a decent setup.
#parameters | ||
---|---|---|
distillbert-base-uncased | 66,957.317 | |
bert-base-uncased | 109,486,085 | |
deberta-base-uncased | 139,394,309 |
for distillbert-base-uncased
and deberta-base-uncased
parameters are counted after adding the classification head
The models are fintuned on the Movie Review Sentiment Analysis Dataset by Kaggle, which is based on the Deep Learning for Sentiment Analysis dataset from Stanford which is based on the rottentomatoes.com dataset collected and published by Pang and Lee(2005). The original sentences where split into phrases and then hand annotated with a sentiment ranging from negative 0 - 4 positive.
train.tsv is of the form
PhraseId SentenceId Phrase Sentiment
64 2 This quiet , introspective and entertaining independent is worth seeking . 4
65 2 This quiet , introspective and entertaining independent 3
66 2 This 2
67 2 quiet , introspective and entertaining independent 4
68 2 quiet , introspective and entertaining 3
69 2 quiet 2
70 2 , introspective and entertaining 3
71 2 introspective and entertaining 3
72 2 introspective and 3
73 2 introspective 2
74 2 and 2
75 2 entertaining 4
76 2 independent 2
77 2 is worth seeking . 3
78 2 is worth seeking 4
79 2 is worth 2
80 2 worth 2
81 2 seeking 2
of the 156,060 samples, unsurprisingly the distribution of classes is heavily dominated by the "neurtal" class 2
which, unmitigated, would lead to learning that picking 2 every time would already guarantee around 50% accuracy.
To measure affects of the experiments and external dataset is used. We use the Amazon Review Dataset in the categories Movies/TV obtained from https://cseweb.ucsd.edu/~jmcauley/datasets.html
Id Sentiment Phrase
11 1.0 "Thin plot. Predictable ending. Beautiful setting. Recent hurricanes certainly didn't look this romantic. Shouldn't make anyone want to ""ride it out."""
12 2.0 Oldie but goodie
Traditional methods for mitigating problem arising from unbalanced data include simple strategies like up and downsampling or more sophisticated ones like SMOTE. While upsampling will inevitably lead to overfitting downsampling will reduce overall performance due to throwing away (possibly most of the) data. SMOTE tries to mitigate the above synthesizing data by computing nearest neighbours of datapoints in the minority class. While this might work with lowdomensional representations, with highdimensional token representations this becomes computationally difficult.
Addressing the problems arising from up/downsampling we will try a, admittedly naive, compromise strategy of picking the class with the median magnitude , upsample the smaller classes and downsample the larger classes.
When using a transformer output for classification there is two traditional design choices, either use the representation of the [CLS] token denoting the beginning of the sequence or average all token representations as input for the classification head. Since the original data does not use the [CLS] token the variant of just using the first token is also tried.
The training set is processed by adding a [CLS] token if needed an then split into stratified train- and testset by a ratio 0.1. The maximum length of the sequences was limited at 150 due to GPU memory constraints After sampling was performed, training was done for 4 epochs. This should be increased as testaccuracy has not hit the maximum but one epoch DeBerta training took around 45m on a intel i9 K9900 and a NVIDIA 2070 8GB with 64GB RAM, that posed a limiting factor.
set | sampling | size |
---|---|---|
train | down | 31824 |
test | down | 3536 |
evaluation | down | 14240 |
train | middle | 122728 |
test | middle | 13637 |
train | up | 358119 |
test | up | 35810 |
The datasets used for training differ quite a bit. Max lenth of the kaggle data is 80 tokens while the amazon data exceeds 400 tokens.
We consider changes in the results rather than absolute values in the evaluation results.
Only distillbert-base-uncased
and deberta-base-uncased
have custom classification heads therefore averaging over the outputs is only reported for those
model | sampling | classification | testset accuracy | cross evaluation accuracy |
---|---|---|---|---|
BERT | down | cls | 0.68 | 0.44 |
Distillbert | down | cls | 0.67 | 0.41 |
DeBerta | down | cls | 0.68 | 0.46 |
BERT | middle | cls | 0.77 | 0.45 |
Distillbert | middle | cls | 0.76 | 0.43 |
DeBerta | middle | cls | 0.77 | 0.46 |
BERT | middle | no_cls | 0.77 | 0.44 |
Distillbert | middle | no_cls | 0.77 | 0.44 |
DeBerta | middle | no_cls | 0.77 | 0.47 |
BERT | middle | avg | 0.78 | 0.45 |
Distillbert | middle | avg | 0.77 | 0.44 |
DeBerta* | middle | avg | 0.73 | 0.45 |
BERT | up | no_cls | 0.87 | 0.45 |
Distillbert | up | no_cls | 0.86 | 0.42 |
DeBerta | up | no_cls | 0.86 | 0.47 |
- trained only one epoch, due to memory error
Unfortunately none of my hypotheses seem to be valid. The scores on the testset as well as cross evaluation seem to be mostly correlated to LM size. The cross evaluation results dont seem to indicate a degree of overfitting, which is an expected and almost certain consequence of upsampling while testset accuracy expectedly is strongly correlated to datasize with no indicator that the different classification methods had any impact.