Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cifkao authored Aug 5, 2024
1 parent 297f737 commit 858252c
Showing 1 changed file with 17 additions and 3 deletions.
20 changes: 17 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ The package implements metrics designed to work well with lyrics formatted accor

Under the hood, the text is pre-processed using the [`sacremoses`](https://github.com/hplt-project/sacremoses) tokenizer and punctuation normalizer.
Note that apostrophes and single quotes are never treated as quotation marks, but as part of a word, marking an elision or a contraction.
For writing systems that do not use spaces to separate words (Chinese, Japanese, Thai, Lao, Burmese, …), each character is considered as a separate word, as per [Radford et al. (2022)](https://arxiv.org/abs/2212.04356).
See the [test cases](./tests/test_tokenizer.py) for examples of how different languages are tokenized.

## Usage
Install the package with `pip install alt-eval`.
Expand All @@ -25,11 +23,27 @@ compute_metrics(references, hypotheses)
```
where `references` and `hypotheses` are lists of strings. To specify the language (English by default), use the `languages` parameter, passing either a single language code, or a list of language codes corresponding to individual examples.

For JamALT, use:
For Jam-ALT, use:
```python
from datasets import load_dataset
dataset = load_dataset("audioshake/jam-alt")["test"]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])
```

If you are only interested in WER, formatting- and punctuation-related metrics can be skipped by passing `include_other=False`.

Use `visualize_errors=True` to also get a list of HTML snippets that can be used to visualize the errors in each transcript.

## Language support
The package implements language-specific tokenization via `sacremoses`, enhanced with custom rules. Support is well tested for English, Spanish, German, and French.

For writing systems that do not use spaces to separate words (Chinese, Japanese, Thai, Lao, Burmese, …), each character is considered as a separate word, as per [Radford et al. (2022)](https://arxiv.org/abs/2212.04356), making the WER equivalent to CER (character error rate).

See the [test cases](./tests/test_tokenizer.py) for examples of how different languages are tokenized.
Contributions adding support for additional languages are welcome.

## Optional lyrics normalization
The [Jam-ALT annotation guide](https://huggingface.co/datasets/audioshake/jam-alt/blob/main/GUIDELINES.md) forbids certain end-of-line punctuation and requires the first letter of each line to be uppercase.
For transcription systems that do not respect these rules, the results on Jam-ALT can be improved by normalizing the transcripts using the `normalize_lyrics()` function, which fixes these specific issues.
Note, however, that this relies on the line break predictions being correct. Moreover, other datasets may follow different rules.
For these reasons, this normalization is **not** included as a fixed pre-processing step in `compute_metrics()`, and instead made optional.

0 comments on commit 858252c

Please sign in to comment.