Replicating WER on FLEURS #2076

michailmelonas · 2024-03-09T14:58:19Z

michailmelonas
Mar 9, 2024

I'm having trouble replicating the stated WER given in Table 13. In particular, on the Afrikaans test split of the FLEURS dataset, I get a WER of 110.59% when using the following snippet with the Tiny model:

def calculate_wer(model: whisper.model.Whisper, data_loader: torch.utils.data.DataLoader) -> float:
    hypotheses, references = [], []
    for mels, _, _, texts in tqdm(data_loader):
        results = model.decode(
            mels,
            whisper.DecodingOptions(language="af", without_timestamps=True)
        )
        hypotheses.extend([result.text for result in results])
        references.extend(texts)

    normalizer = BasicTextNormalizer()
    hypotheses = [normalizer(s) for s in hypotheses]
    references = [normalizer(s) for s in references]

    return jiwer.wer(hypothesis=hypotheses, reference=references) * 100

This is much higher than the value of 91.2% given in the paper.

I also looked using the whisper.transcribe.transcribe function (which uses various decoding strategies), but this gave a WER of 99.98% (which is still higher).

I'd appreciate any thoughts on what explains this difference. I see a similar point was raised in #702, but no answer has yet been provided.

Update: when using the whisper.transcribe.transcribe I'm finding different results when doing multiple runs. Also, specifying the target language seems to improve the WER when using this function.

roudimit · 2024-03-15T18:29:32Z

roudimit
Mar 15, 2024

Hi! I'm the author of the referenced post. I think the python API doesn't set beam_size=5, best_of=5, which are used in the paper - would be good to double check if those are set!

1 reply

michailmelonas Mar 24, 2024
Author

Hey! Thanks for the reply.

It seems beam_size and best_of are mutually exclusive -- only one of them can be used at a time.

The closest I'm getting to replicating the results in the paper (for the Afrikaans language on FLEURS) is by running:

import os

from datasets import load_dataset
from whisper.normalizers import BasicTextNormalizer
import jiwer
import whisper


model = whisper.load_model("tiny")
model.eval()

afr_fleurs = load_dataset("google/fleurs", "af_za", split="test")

hypotheses, references = [], []

for i in range(len(afr_fleurs)):
    directory, filename = os.path.split(afr_fleurs[i]["path"])
    audio_path = os.path.join(directory, "test", filename)
    
    result = model.transcribe(audio_path, language="af")
    hypotheses.append(result["text"])
    references.append(afr_fleurs[i]["transcription"])

normalizer = BasicTextNormalizer()
hypotheses = [normalizer(s) for s in hypotheses]
references = [normalizer(s) for s in references]
wer = jiwer.wer(hypothesis=hypotheses, reference=references) * 100

This amounts to best_of=5. Doing 3 distinct runs, I'm getting WERs of: 92.26, 91.97 and 92.0. Given that the output is non-deterministic (due to sampling), it seems conceivable that the authors could have gotten 91.2 (as reported in the paper).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating WER on FLEURS #2076

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Replicating WER on FLEURS #2076

michailmelonas Mar 9, 2024

Replies: 1 comment · 1 reply

roudimit Mar 15, 2024

michailmelonas Mar 24, 2024 Author

michailmelonas
Mar 9, 2024

Replies: 1 comment 1 reply

roudimit
Mar 15, 2024

michailmelonas Mar 24, 2024
Author