Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding Comet code #20

Closed
joannacknight opened this issue Mar 19, 2024 · 14 comments
Closed

Understanding Comet code #20

joannacknight opened this issue Mar 19, 2024 · 14 comments
Assignees

Comments

@joannacknight
Copy link
Collaborator

joannacknight commented Mar 19, 2024

We need a better understanding of some aspects of the COMET code base.

A few things to consider:

  • Relationship between the hyperparameters encoder_model, pretrained_model and load_pretrained_weights
  • Is the loss argument actually used for anything? Seems that the sentence loss is hard-coded in the init_losses method
  • Unfreezing the encoder - want to make sure we are doing this correctly.
  • Batch size - how to set this correctly

As part of this issue we will work out which hyperparameters we want to search and which we want to be able to set ahead of training.

@joannacknight
Copy link
Collaborator Author

Just logging here the hparams that are downloaded with CometKiwi from Huggingface

activations: Tanh
batch_size: 4
class_identifier: unified_metric
dropout: 0.1
encoder_learning_rate: 1.0e-06
encoder_model: XLM-RoBERTa
final_activation: null
hidden_sizes:
- 3072
- 1024
input_segments:
- mt
- src
keep_embeddings_frozen: true
layer: mix
layer_norm: false
layer_transformation: sparsemax
layerwise_decay: 0.95
learning_rate: 1.5e-05
loss: mse
loss_lambda: 0.65
nr_frozen_epochs: 0.3
optimizer: AdamW
pool: avg
pretrained_model: microsoft/infoxlm-large
sent_layer: mix
train_data:
- data/1720-da.mlqe-src.csv
validation_data:
- data/wmt-ende-newstest2021.csv
- data/wmt-enru-newstest2021.csv
- data/wmt-zhen-newstest2021.csv
word_layer: 24
word_level_training: false
word_weights:
- 0.15
- 0.85

@joannacknight
Copy link
Collaborator Author

joannacknight commented Mar 19, 2024

Suggest we keep a log of whether we think a parameter will be:

  • Default: taken directly from the hparams.yaml file and not changed
  • Override: fixed but overriding the value in the hparams.yaml file
  • Tuned: we will want to search values and fix one for all experiments
  • Experimental: we will want to vary between experiments

My assumptions are below, but may not be 100% correct and please feel free to edit or discuss.

Parameter Default Override Tune Experiment Notes
activations Y N N N Activation function used inside the regression head - Tanh (not the final_activation)
batch_size N N Y N From discussion with James: use largest batch size we can get away with. I assume we will keep this constant between experiments
class_identifier Y N N N Will override this when creating the model
dropout N N Y N Dropout in the regression head
encoder_learning_rate ? N ? N Learning rate used to fine-tune the encoder model. Should we tune this? Or stick with default?
encoder_model Y N N N XLM-R
final_activation N Y N N The activation function used in the last layer of the regression head. We will change it to sigmoid to adapt the regression head for classification.
hidden_sizes Y N N N Size of hidden layers used in the regression head. Assume that we won't change this - or would there be any benefit to reviewing this?
input_segments Y N N N This is just the names of the two model inputs "mt", "src"
keep_embeddings_frozen N N ? Y If True then keeps the encoder frozen during training. Guess we will want to set it to False, but might want to do some comparison to see the difference between training time and performance?
layer Y N N N Encoder layer to be used for regression ('mix' for pooling info from all layers). Although this appears in the hparams file for the Huggingface model, I don't think this is actually being used. In the code it looks like sent_layer is being used to set the layer argument.
layer_norm Y N N N Apply layer normalization to encoder. Defaults to 'False'.
layer_transformation Y N N N Transformation applied when pooling info from all layers of the encoder. Default for UnifiedMetric is sparsemax
layerwise_decay Y N N N Learning rate % decay from top-to-bottom encoder layers. Default is 0.95
learning_rate ? N ? N Learning rate used to fine-tune the regression head - should we fine-tune this? see also encoder_learning_rate
load_pretrained_weights Y N N N If set to False it avoids loading the weights of the pretrained model (e.g. XLM-R) before it loads the COMET checkpoint
loss N Y N N This isn't actually used (for the UnifiedMetric at least) to define the loss function, but is sent through to wandb as the name of the loss function. Therefore, we want to override this value to cross-entropy and also override the method that actually sets the sentence-level loss function
loss_lambda Y N N N This won't get applied as we are not doing word-level training, it is a weight assigned to the word-level loss compared to sentence-level loss
nr_frozen_epochs N N Y N Number of epochs OR % of epoch that the encoder is frozen. If the value is greater than one, then the encoder is frozen for that number of epochs. If the value is between 0 and 1, then the encoder is frozen for that percentage of the first epoch. I guess this is parameter we might want to tune? This issue reports worse performance when fine-tuning the encoder is left until later
optimizer Y N N N Default is AdamW, and I assume we will keep this fixed (unless there are any good reasons to change?)
pool Y N N N When a UnifiedMetric object is created this isn't a parameter that can be set on initialisation. There is a default value in the base CometMetric class, which is avg, but this is only used in methods that the UnifiedMetric ignores. Instead, they hardcode the sentence embedding to be the CLS token.
pretrained_model Y N N N The pre-trained model (i.e. checkpoint) to use from Huggingface. Won't actually be applied in our code as we will set load_pretrained_weights to False.
sent_layer Y N N N Which encoder layer to use as input for the sentence level task (default mix indicates to pool across layers).
train_data N N N Y List of paths to training data files (in csv format)
validation_data N N N Y List of paths to validation data files
word_layer Y N N N See sent_layer. This won't get used as we are not training for word-level predictions
word_level_training Y N N N Determines whether the model will do word-level training as well as sentence-level training. Set to False
word_weights Y N N N Can't actually find where this is used, doesn't seem to be an argument that the UnifiedMetric class is expecting. In any case we are not training for word-level predictions so don't need these values.

@radka-j
Copy link
Member

radka-j commented Mar 19, 2024

According to the UnifiedMetric docstring:

load_pretrained_weights (Bool): If set to False it avoids loading the weights
of the pretrained model (e.g. XLM-R) before it loads the COMET checkpoint

Presumably there are cases where one wants to load the pretrained model weights and not the COMET checkpoint (not sure if this is an option with the current COMET implementation). It does not make sense to load both so this should always be False if using the COMET checkpoint.

@joannacknight
Copy link
Collaborator Author

joannacknight commented Mar 19, 2024

In the COMET code base, there is a function load_from_checkpoint (located in comet/models/__init__.py) which we are overriding in order to create a model using our new class (function load_qe_model_from_checkpoint in src/mtqe/models/comet.py)

This function contains the option of whether or not to load the hparams file when creating a model object. So this is where we will want to upload our own hparams file instead or alternatively define the arguments for creating the object from a config file. This is where we should set the training & validation data file paths too I think (currently set later in the code).

EDIT: from pytorch lightning documenation: Any arguments specified through **kwargs will override args stored in "hyper_parameters"

The code is currently run so that the hparams are not reloaded. Therefore the defaults for the hyperparameters are actually the defaults that are set when the class is created, not the values given in the hparams.yaml file. I think these are mostly (or all?) the same as I've listed in the above post of the hparams file - only the class_identifier has to be used when creating an object, and this is something we will overwrite anyway for our new class. EDIT: My misunderstanding here, if the hparams file is not set to reload, then it is just loaded from wherever the checkpoint is stored.

@radka-j
Copy link
Member

radka-j commented Mar 19, 2024

It looks like you were right about encoder_model and pretrained_model.

The encoder gets initialized in the base model here. The encoder_model string is used to return a COMET encoder class and the pretrained_model is used to specify the HuggingFace weights to download for that encoder.

@radka-j
Copy link
Member

radka-j commented Mar 19, 2024

For reference, the load_from_checkpoint function is a PyTorch Lightning method.

@joannacknight
Copy link
Collaborator Author

Bit of a side note, but just wanted to record that the random seed gets set in comet/cli/train.py which we won't be using so will need to set ourselves rather than pass through to comet.

@joannacknight
Copy link
Collaborator Author

I've gone through the parameters in the table and these are the ones that I think we might want to search / tune ahead of running the experiments:

  • batch_size
  • dropout
  • encoder_learning_rate
  • keep_embeddings_frozen
  • learning_rate
  • nr_frozen_epochs

@radka-j - do you agree? There may also be other parameters we want to tune for the Trainer object (e.g., number of epochs), but just looking at the COMET model here.

@radka-j
Copy link
Member

radka-j commented Mar 21, 2024

Some notes and questions:

  • I'm surprised that the default sentence-level pool option is avg since in the paper they claim to use the CLS token - should we leave it or try to replicate the paper?
  • I am somewhat inclined to not touch the encoder_learning_rate as I think it might involve also changing the layerwise_decay. Maybe it's worth reading up on what best practice is with fine-tuning BERT-style encoders?
  • For batch_size are [16, 32, 64, 128] good initial choices?
  • Do we have any recommendations on good ranges of values for the other hyperparameters to search over?
    • For dropout, does it matter that the estimator FNN has 3 layers (I think)?

@radka-j
Copy link
Member

radka-j commented Mar 21, 2024

Notes on the different encoders:

InfoXLM and XLM-R are both BERT based models that use different fine-tuning strategies. Briefly:

  • RoBERTa uses the BERT architecture but trains it for longer and without the next sentence prediction objective
  • XLM-R extends RoBERTa to multiple languages (i.e., also uses masked language model objective)
  • Info-XLM starts with XLM-R weights and then fine-tunes the model using a new objective they present in the paper

@joannacknight
Copy link
Collaborator Author

  • Re. the pool parameter - we could ask a question on their github repo? It is confusing me too, and it doesn't seem that it can be set when creating a UnifiedMetric object? I might have missed something.
  • Happy not to change the encoder_learning_rate - I hadn't thought of the relationship with layerwise_decay
  • Yes - seems sensible choices for varying the batch_size
  • I have not looked for recommendations on values to search over for other parameters

@radka-j
Copy link
Member

radka-j commented Mar 22, 2024

Looking through the UnifiedMetric code, they hardcode the sentence embedding to be the CLS token. This confirms that the pool hyperparameter gets ignored.

It is true that there is a default value in the base CometModel class but this only gets used in the compute_sentence_embedding() method, which the UnifiedMetric ignores.

@joannacknight
Copy link
Collaborator Author

Looking at the method train_dataloader in the base.py file, I think we will need to make some further changes to load and mix more than one file of training data. Not an issue right now (as we are just training on one language pair), but we will want to also try a model trained on all four language pairs.

@joannacknight
Copy link
Collaborator Author

Closing this issue down as I feel we have completed the task we set out to do here. Issue #28 has been created to record our notes in a markdown file in the project and issue #29 has also been created to address my last comment on training on multiple datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants