Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparsemax not actually used in COMET-KIWI, XCOMET-XL/XXL #195

Open
emjotde opened this issue Jan 19, 2024 · 4 comments
Open

Sparsemax not actually used in COMET-KIWI, XCOMET-XL/XXL #195

emjotde opened this issue Jan 19, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@emjotde
Copy link

emjotde commented Jan 19, 2024

Hi,
I have been playing around with re-implementing some of your models in Marian and while progressing through the code I noticed that you are not actually using sparsemax for Comet-KIWI and Comet-XL/XXL, instead you are falling back to a softmax.

In both cases you forgot to pass the layer_transformation parameter to its base class:

See here for UnifiedMetric

layer_transformation: str = "sparsemax",

and here for XCOMETMetric

layer_transformation: str = "sparsemax",

In both cases the layer_transformation parameter does not appear in the parameter list of the base class below, but the base class has softmax as the default.

In my re-implementation I am reproducing your exact numbers for Comet-KIWI with a softmax, not the sparsemax. While the sparsemax works fine for COMET-22 ref-based.

It's not clear to me if the model was trained with a softmax or sparsemax, but you might either have a train/inference mismatch here or at the very least your models are doing something different than you expected/described.

@emjotde emjotde added the bug Something isn't working label Jan 19, 2024
@emjotde
Copy link
Author

emjotde commented Jan 19, 2024

Follow-up on that... I am also wondering if you realized that Roberta-XL and Roberta-XXL are pre-norm, while the base model you used for Comet-KIWI is post-norm, but you treat them the same during training/inference. The huggingface implementation is collecting the hidden states without normalization for the XL models with the exception of very last hidden state which is normed.

That seems to mean that the hidden states that you use for your layer-mixing have wildly different magnitudes across layers -- the first and the last one (the most important one?) have very small norms, the ones in-between are unnormed. I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

@ricardorei
Copy link
Collaborator

ricardorei commented Jan 22, 2024

@emjotde nothing like a re-implementation challenge to find bugs 😄... I just confirmed and you are right. Its defaulting to softmax instead of sparsemax.

>>> from comet import download_model, load_from_checkpoint
>>> model = load_from_checkpoint(download_model("Unbabel/wmt23-cometkiwi-da-xxl"))
>>> model.layerwise_attention.transform_fn
<built-in method softmax of type object at 0x7fda5cbd2460>
>>> model.layerwise_attention.layer_norm
False

same thing for XCOMET models.

Regarding Roberta-XL and XXL I realised the change from post-norm to pre-norm. I did not realised the impact on the embeddings returned from HF. Actually HF took a long long time to integrate Roberta-XL/XXL because of this issue... but I never inspected the magnitudes across layers.

Btw the rational for using sparsemax instead of softmax was not performance related. Our goal when integrating Sparsemax was to study if all layers are relevant or not. The performance between sparsemax and softmax is usually the same. Yet, for wmt22-comet-da, because of sparsemax, we can clearly observe which layers are relevant:

e.g:

>>> model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0849, 0.0738, 0.0504, 0.0463, 0.0166, 0.0125, 0.0103, 0.0027, 0.0000,
        0.0000, 0.0007, 0.0088, 0.0151, 0.0463, 0.0591, 0.0466, 0.0516, 0.0552,
        0.0581, 0.0621, 0.0666, 0.0609, 0.0621, 0.0645, 0.0448],
       grad_fn=<SparsemaxFunctionBackward>)

Here we can see that some layers are set to 0 and thus ignored. This provides some layer of interpretability... Ideally, the model would ignore the top layers and we could, after training, prune those (unfortunately this usually does not happen).

With XCOMET, the learned weights are all very similar.... But like you said probably because of the different norms?

>>> model = load_from_checkpoint(download_model("Unbabel/XCOMET-XL"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0285, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267,
        0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0268, 0.0268,
        0.0268, 0.0268, 0.0268, 0.0269, 0.0270, 0.0271, 0.0271, 0.0272, 0.0273,
        0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0272,
        0.0287], grad_fn=<SoftmaxBackward0>)

Also, not sure if you noticed but we only use the layerwise attention for creating the sentence-embedding that are used for regression. The embeddings used for classifying the individual tokens as error spans are those from the word_layer (model.hparams.word_layer). We have not played a lot with this hyper-parameters but our goal was to make an individual layer more specialised on that task (usually a top layer because its closer to the MLM objective) while for regression we would like to pool information from all layers.

I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

It did not... I was actually surprised but the training was very stable from the get go.... I had some issues with distributed training and pytorch-lightning and ended up implementing something without lightning but after that was done, training was smooth.

@emjotde
Copy link
Author

emjotde commented Jan 22, 2024

Yeah, I am currently not looking at the word-level predictions yet, stopped at the regressor-implementation.

Regarding the weights above, the fact that they are near-uniform after softmax despite the that the norms over the hidden states are so different is what made me wonder if proper learning happens or rather some form of saturation (always hard to tell with those neural models).

I would have expected the model to strongly push down the weights for the models with high norms. On the other hand, if this becomes bascially an unweighted arithmetic average then the two very small vectors pull everything down by a lot considering that averages reward outliers. Who knows...

@ricardorei
Copy link
Collaborator

Its the black magic art of NN 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants