Understanding Comet code #20

joannacknight · 2024-03-19T10:10:24Z

We need a better understanding of some aspects of the COMET code base.

A few things to consider:

Relationship between the hyperparameters encoder_model, pretrained_model and load_pretrained_weights
Is the loss argument actually used for anything? Seems that the sentence loss is hard-coded in the init_losses method
Unfreezing the encoder - want to make sure we are doing this correctly.
Batch size - how to set this correctly

As part of this issue we will work out which hyperparameters we want to search and which we want to be able to set ahead of training.

The text was updated successfully, but these errors were encountered:

joannacknight · 2024-03-19T14:06:02Z

Just logging here the hparams that are downloaded with CometKiwi from Huggingface

activations: Tanh
batch_size: 4
class_identifier: unified_metric
dropout: 0.1
encoder_learning_rate: 1.0e-06
encoder_model: XLM-RoBERTa
final_activation: null
hidden_sizes:
- 3072
- 1024
input_segments:
- mt
- src
keep_embeddings_frozen: true
layer: mix
layer_norm: false
layer_transformation: sparsemax
layerwise_decay: 0.95
learning_rate: 1.5e-05
loss: mse
loss_lambda: 0.65
nr_frozen_epochs: 0.3
optimizer: AdamW
pool: avg
pretrained_model: microsoft/infoxlm-large
sent_layer: mix
train_data:
- data/1720-da.mlqe-src.csv
validation_data:
- data/wmt-ende-newstest2021.csv
- data/wmt-enru-newstest2021.csv
- data/wmt-zhen-newstest2021.csv
word_layer: 24
word_level_training: false
word_weights:
- 0.15
- 0.85

joannacknight · 2024-03-19T14:30:34Z

Suggest we keep a log of whether we think a parameter will be:

Default: taken directly from the hparams.yaml file and not changed
Override: fixed but overriding the value in the hparams.yaml file
Tuned: we will want to search values and fix one for all experiments
Experimental: we will want to vary between experiments

My assumptions are below, but may not be 100% correct and please feel free to edit or discuss.

Parameter	Default	Override	Tune	Experiment	Notes
activations	Y	N	N	N	Activation function used inside the regression head - Tanh (not the `final_activation`)
batch_size	N	N	Y	N	From discussion with James: use largest batch size we can get away with. I assume we will keep this constant between experiments
class_identifier	Y	N	N	N	Will override this when creating the model
dropout	N	N	Y	N	Dropout in the regression head
encoder_learning_rate	?	N	?	N	Learning rate used to fine-tune the encoder model. Should we tune this? Or stick with default?
encoder_model	Y	N	N	N	XLM-R
final_activation	N	Y	N	N	The activation function used in the last layer of the regression head. We will change it to sigmoid to adapt the regression head for classification.
hidden_sizes	Y	N	N	N	Size of hidden layers used in the regression head. Assume that we won't change this - or would there be any benefit to reviewing this?
input_segments	Y	N	N	N	This is just the names of the two model inputs "mt", "src"
keep_embeddings_frozen	N	N	?	Y	If `True` then keeps the encoder frozen during training. Guess we will want to set it to `False`, but might want to do some comparison to see the difference between training time and performance?
layer	Y	N	N	N	Encoder layer to be used for regression ('mix' for pooling info from all layers). Although this appears in the hparams file for the Huggingface model, I don't think this is actually being used. In the code it looks like `sent_layer` is being used to set the `layer` argument.
layer_norm	Y	N	N	N	Apply layer normalization to encoder. Defaults to 'False'.
layer_transformation	Y	N	N	N	Transformation applied when pooling info from all layers of the encoder. Default for UnifiedMetric is `sparsemax`
layerwise_decay	Y	N	N	N	Learning rate % decay from top-to-bottom encoder layers. Default is 0.95
learning_rate	?	N	?	N	Learning rate used to fine-tune the regression head - should we fine-tune this? see also `encoder_learning_rate`
load_pretrained_weights	Y	N	N	N	If set to False it avoids loading the weights of the pretrained model (e.g. XLM-R) before it loads the COMET checkpoint
loss	N	Y	N	N	This isn't actually used (for the UnifiedMetric at least) to define the loss function, but is sent through to wandb as the name of the loss function. Therefore, we want to override this value to `cross-entropy` and also override the method that actually sets the sentence-level loss function
loss_lambda	Y	N	N	N	This won't get applied as we are not doing word-level training, it is a weight assigned to the word-level loss compared to sentence-level loss
nr_frozen_epochs	N	N	Y	N	Number of epochs OR % of epoch that the encoder is frozen. If the value is greater than one, then the encoder is frozen for that number of epochs. If the value is between 0 and 1, then the encoder is frozen for that percentage of the first epoch. I guess this is parameter we might want to tune? This issue reports worse performance when fine-tuning the encoder is left until later
optimizer	Y	N	N	N	Default is AdamW, and I assume we will keep this fixed (unless there are any good reasons to change?)
pool	Y	N	N	N	When a `UnifiedMetric` object is created this isn't a parameter that can be set on initialisation. There is a default value in the base `CometMetric` class, which is `avg`, but this is only used in methods that the `UnifiedMetric` ignores. Instead, they hardcode the sentence embedding to be the CLS token.
pretrained_model	Y	N	N	N	The pre-trained model (i.e. checkpoint) to use from Huggingface. Won't actually be applied in our code as we will set `load_pretrained_weights` to `False`.
sent_layer	Y	N	N	N	Which encoder layer to use as input for the sentence level task (default `mix` indicates to pool across layers).
train_data	N	N	N	Y	List of paths to training data files (in csv format)
validation_data	N	N	N	Y	List of paths to validation data files
word_layer	Y	N	N	N	See `sent_layer`. This won't get used as we are not training for word-level predictions
word_level_training	Y	N	N	N	Determines whether the model will do word-level training as well as sentence-level training. Set to `False`
word_weights	Y	N	N	N	Can't actually find where this is used, doesn't seem to be an argument that the `UnifiedMetric` class is expecting. In any case we are not training for word-level predictions so don't need these values.

radka-j · 2024-03-19T15:38:07Z

According to the UnifiedMetric docstring:

load_pretrained_weights (Bool): If set to False it avoids loading the weights
of the pretrained model (e.g. XLM-R) before it loads the COMET checkpoint

Presumably there are cases where one wants to load the pretrained model weights and not the COMET checkpoint (not sure if this is an option with the current COMET implementation). It does not make sense to load both so this should always be False if using the COMET checkpoint.

joannacknight · 2024-03-19T15:46:58Z

In the COMET code base, there is a function load_from_checkpoint (located in comet/models/__init__.py) which we are overriding in order to create a model using our new class (function load_qe_model_from_checkpoint in src/mtqe/models/comet.py)

This function contains the option of whether or not to load the hparams file when creating a model object. So this is where we will want to upload our own hparams file instead or alternatively define the arguments for creating the object from a config file. This is where we should set the training & validation data file paths too I think (currently set later in the code).

EDIT: from pytorch lightning documenation: Any arguments specified through **kwargs will override args stored in "hyper_parameters"

The code is currently run so that the hparams are not reloaded. Therefore the defaults for the hyperparameters are actually the defaults that are set when the class is created, not the values given in the hparams.yaml file. I think these are mostly (or all?) the same as I've listed in the above post of the hparams file - only the class_identifier has to be used when creating an object, and this is something we will overwrite anyway for our new class. EDIT: My misunderstanding here, if the hparams file is not set to reload, then it is just loaded from wherever the checkpoint is stored.

radka-j · 2024-03-19T15:49:03Z

It looks like you were right about encoder_model and pretrained_model.

The encoder gets initialized in the base model here. The encoder_model string is used to return a COMET encoder class and the pretrained_model is used to specify the HuggingFace weights to download for that encoder.

radka-j · 2024-03-19T16:39:44Z

For reference, the load_from_checkpoint function is a PyTorch Lightning method.

joannacknight · 2024-03-19T16:49:09Z

Bit of a side note, but just wanted to record that the random seed gets set in comet/cli/train.py which we won't be using so will need to set ourselves rather than pass through to comet.

joannacknight · 2024-03-20T11:54:37Z

I've gone through the parameters in the table and these are the ones that I think we might want to search / tune ahead of running the experiments:

batch_size
dropout
encoder_learning_rate
keep_embeddings_frozen
learning_rate
nr_frozen_epochs

@radka-j - do you agree? There may also be other parameters we want to tune for the Trainer object (e.g., number of epochs), but just looking at the COMET model here.

radka-j · 2024-03-21T11:46:11Z

Some notes and questions:

I'm surprised that the default sentence-level pool option is avg since in the paper they claim to use the CLS token - should we leave it or try to replicate the paper?
I am somewhat inclined to not touch the encoder_learning_rate as I think it might involve also changing the layerwise_decay. Maybe it's worth reading up on what best practice is with fine-tuning BERT-style encoders?
For batch_size are [16, 32, 64, 128] good initial choices?
Do we have any recommendations on good ranges of values for the other hyperparameters to search over?
- For dropout, does it matter that the estimator FNN has 3 layers (I think)?

radka-j · 2024-03-21T14:12:34Z

Notes on the different encoders:

InfoXLM and XLM-R are both BERT based models that use different fine-tuning strategies. Briefly:

RoBERTa uses the BERT architecture but trains it for longer and without the next sentence prediction objective
XLM-R extends RoBERTa to multiple languages (i.e., also uses masked language model objective)
Info-XLM starts with XLM-R weights and then fine-tunes the model using a new objective they present in the paper

joannacknight · 2024-03-21T14:32:35Z

Re. the pool parameter - we could ask a question on their github repo? It is confusing me too, and it doesn't seem that it can be set when creating a UnifiedMetric object? I might have missed something.
Happy not to change the encoder_learning_rate - I hadn't thought of the relationship with layerwise_decay
Yes - seems sensible choices for varying the batch_size
I have not looked for recommendations on values to search over for other parameters

radka-j · 2024-03-22T15:46:41Z

Looking through the UnifiedMetric code, they hardcode the sentence embedding to be the CLS token. This confirms that the pool hyperparameter gets ignored.

It is true that there is a default value in the base CometModel class but this only gets used in the compute_sentence_embedding() method, which the UnifiedMetric ignores.

joannacknight · 2024-03-26T12:00:52Z

Looking at the method train_dataloader in the base.py file, I think we will need to make some further changes to load and mix more than one file of training data. Not an issue right now (as we are just training on one language pair), but we will want to also try a model trained on all four language pairs.

joannacknight · 2024-03-28T10:38:03Z

Closing this issue down as I feel we have completed the task we set out to do here. Issue #28 has been created to record our notes in a markdown file in the project and issue #29 has also been created to address my last comment on training on multiple datasets

joannacknight added the trained model label Mar 19, 2024

joannacknight assigned radka-j and joannacknight Mar 19, 2024

This was referenced Mar 19, 2024

Investigate use of weights and biases #7

Closed

Extend COMET training loop #22

Closed

Set up basic COMET training loop #11

Closed

joannacknight mentioned this issue Mar 19, 2024

Conduct hyperparameter search #23

Closed

radka-j mentioned this issue Mar 26, 2024

Write up notes on COMET code #28

Closed

joannacknight mentioned this issue Mar 27, 2024

Training loop - multiple train datasets #29

Closed

joannacknight closed this as completed Mar 28, 2024

radka-j mentioned this issue Apr 2, 2024

[WIP] COMET notes #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Comet code #20

Understanding Comet code #20

joannacknight commented Mar 19, 2024 •

edited

Loading

joannacknight commented Mar 19, 2024

joannacknight commented Mar 19, 2024 •

edited by radka-j

Loading

radka-j commented Mar 19, 2024 •

edited

Loading

joannacknight commented Mar 19, 2024 •

edited

Loading

radka-j commented Mar 19, 2024 •

edited

Loading

radka-j commented Mar 19, 2024

joannacknight commented Mar 19, 2024

joannacknight commented Mar 20, 2024

radka-j commented Mar 21, 2024

radka-j commented Mar 21, 2024

joannacknight commented Mar 21, 2024

radka-j commented Mar 22, 2024 •

edited

Loading

joannacknight commented Mar 26, 2024

joannacknight commented Mar 28, 2024

Understanding Comet code #20

Understanding Comet code #20

Comments

joannacknight commented Mar 19, 2024 • edited Loading

joannacknight commented Mar 19, 2024

joannacknight commented Mar 19, 2024 • edited by radka-j Loading

radka-j commented Mar 19, 2024 • edited Loading

joannacknight commented Mar 19, 2024 • edited Loading

radka-j commented Mar 19, 2024 • edited Loading

radka-j commented Mar 19, 2024

joannacknight commented Mar 19, 2024

joannacknight commented Mar 20, 2024

radka-j commented Mar 21, 2024

radka-j commented Mar 21, 2024

joannacknight commented Mar 21, 2024

radka-j commented Mar 22, 2024 • edited Loading

joannacknight commented Mar 26, 2024

joannacknight commented Mar 28, 2024

joannacknight commented Mar 19, 2024 •

edited

Loading

joannacknight commented Mar 19, 2024 •

edited by radka-j

Loading

radka-j commented Mar 19, 2024 •

edited

Loading

joannacknight commented Mar 19, 2024 •

edited

Loading

radka-j commented Mar 19, 2024 •

edited

Loading

radka-j commented Mar 22, 2024 •

edited

Loading