Different scores from different COMET package versions 1.1.2 and 2.2.1 #203

PinzhenChen · 2024-02-27T12:52:40Z

🐛 Bug

When the same source, target, reference files are evaluated using the same wmt22-comet-da checkpoint, unbabel-comet 2.2.1 under python3.9 and unbabel-comet 1.1.2 under python3.7 gave me dramatically different numbers.

To Reproduce

In python3.7, pip install --upgrade unbabel-comet gives 1.1.2 as the latest version, while in python3.9 it gives 2.2.1.

Scoring the same source, target, and reference files under the above two environments gave different scores. unbabel-comet 1.1.2 results in a score of 0.86 while the 2.2.1 version gave 0.79. I used WMT22-COMET-DA downloaded from Hugging Face https://huggingface.co/Unbabel/wmt22-comet-da.

Attaching the files which gave 0.79 and 0.86 below, but I think any file combination can be used to reproduce this behaviour since it's associated with the COMET package version.
target.en.txt
source.mt.txt
hypothesis.en.txt

Expected behaviour

I would expect different COMET package versions to give the same score if the same checkpoint and files are given.

Environment

Managed python3.7 and python3.9 with conda.

Additional context

If there is indeed some package mismatch between unbabel-comet 1.1.2 and 2.2.1, it might be difficult to go back and fix the problem. Users probably are unaware of this and will not update. Moreover, python3.7 only supports 1.1.2 as the latest even if users upgrade COMET in python3.7. Maybe this behaviour can be highlighted in README to encourage the user to use specific Python and unbabel-comet versions . On the other hand, this could imply that research papers should report COMET package version in addition to COMET version. Would it be possible to implement some kind of COMET signature just like that in sacrebleu?

The text was updated successfully, but these errors were encountered:

BramVanroy · 2024-03-02T17:16:36Z

This confirms what we learnt for BLEU, too: one should ALWAYS report version numbers (signatures), also for COMET!

Side note: in my MATEO, I added a custom signature for neural metrics like bertscore, bleurt and comet, too. For COMET it looks like this (inspired by sacrebleu):

comet: nrefs:1|bs:1000|seed:12345|c:Unbabel/wmt22-comet-da|version:2.0.1|mateo:1.1.3

where c stands for the checkpoint used and version is self-explanatory. Wasn't sure how far one had to go with this because difference in torch, cuda and transformers versions may or may not also lead to difference in results. Hell, even then the CUDA optimisation might lead to different results on different hardware.

PinzhenChen · 2024-03-03T23:18:47Z

Admittedly the README currently says it requires 3.8, so maybe I installed COMET in the stone age and pip install —upgrade unbabel-comet never warned me. Anyway I think the score mismatch should not be expected

Your signature is very thoughtful!

PinzhenChen added the bug Something isn't working label Feb 27, 2024

bhaddow mentioned this issue Feb 27, 2024

Different versions of COMET code give different scores with the same model and date. #204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different scores from different COMET package versions 1.1.2 and 2.2.1 #203

Different scores from different COMET package versions 1.1.2 and 2.2.1 #203

PinzhenChen commented Feb 27, 2024 •

edited

Loading

BramVanroy commented Mar 2, 2024 •

edited

Loading

PinzhenChen commented Mar 3, 2024

Different scores from different COMET package versions 1.1.2 and 2.2.1 #203

Different scores from different COMET package versions 1.1.2 and 2.2.1 #203

Comments

PinzhenChen commented Feb 27, 2024 • edited Loading

🐛 Bug

To Reproduce

Expected behaviour

Environment

Additional context

BramVanroy commented Mar 2, 2024 • edited Loading

PinzhenChen commented Mar 3, 2024

PinzhenChen commented Feb 27, 2024 •

edited

Loading

BramVanroy commented Mar 2, 2024 •

edited

Loading