Update README.md

audioshake · Aug 5, 2024 · 858252c · 858252c
1 parent 297f737
commit 858252c
Showing 1 changed file with 17 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -12,8 +12,6 @@ The package implements metrics designed to work well with lyrics formatted accor
 
 Under the hood, the text is pre-processed using the [`sacremoses`](https://github.com/hplt-project/sacremoses) tokenizer and punctuation normalizer.
 Note that apostrophes and single quotes are never treated as quotation marks, but as part of a word, marking an elision or a contraction.
-For writing systems that do not use spaces to separate words (Chinese, Japanese, Thai, Lao, Burmese, …), each character is considered as a separate word, as per [Radford et al. (2022)](https://arxiv.org/abs/2212.04356).
-See the [test cases](./tests/test_tokenizer.py) for examples of how different languages are tokenized.
 
 ## Usage
 Install the package with `pip install alt-eval`.
@@ -25,11 +23,27 @@ compute_metrics(references, hypotheses)
 ```
 where `references` and `hypotheses` are lists of strings. To specify the language (English by default), use the `languages` parameter, passing either a single language code, or a list of language codes corresponding to individual examples.
 
-For JamALT, use:
+For Jam-ALT, use:
 ```python
 from datasets import load_dataset
 dataset = load_dataset("audioshake/jam-alt")["test"]
 compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])
 ```
 
+If you are only interested in WER, formatting- and punctuation-related metrics can be skipped by passing `include_other=False`.
+
 Use `visualize_errors=True` to also get a list of HTML snippets that can be used to visualize the errors in each transcript.
+
+## Language support
+The package implements language-specific tokenization via `sacremoses`, enhanced with custom rules. Support is well tested for English, Spanish, German, and French.
+
+For writing systems that do not use spaces to separate words (Chinese, Japanese, Thai, Lao, Burmese, …), each character is considered as a separate word, as per [Radford et al. (2022)](https://arxiv.org/abs/2212.04356), making the WER equivalent to CER (character error rate).
+
+See the [test cases](./tests/test_tokenizer.py) for examples of how different languages are tokenized.
+Contributions adding support for additional languages are welcome.
+
+## Optional lyrics normalization
+The [Jam-ALT annotation guide](https://huggingface.co/datasets/audioshake/jam-alt/blob/main/GUIDELINES.md) forbids certain end-of-line punctuation and requires the first letter of each line to be uppercase.
+For transcription systems that do not respect these rules, the results on Jam-ALT can be improved by normalizing the transcripts using the `normalize_lyrics()` function, which fixes these specific issues.
+Note, however, that this relies on the line break predictions being correct. Moreover, other datasets may follow different rules.
+For these reasons, this normalization is **not** included as a fixed pre-processing step in `compute_metrics()`, and instead made optional.