diff --git a/README.md b/README.md index cbea270..dceb902 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ For attribution in academic contexts, please cite this work as: - **Hyphenation**: Use hyphens (-) for compound adjectives (e.g., video-to-pose). - **Lists**: Use "-" for list items, followed by a space. - **Code**: Use backticks (`) for inline code, and triple backticks (```) for code blocks. -- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater. +- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater. For large numbers, separate with commas and do not abbreviate with suffixes such as "k" (11,000, not 11k) - **Contractions**: Avoid contractions (e.g., use "do not" instead of "don't"). - **Compound Words**: Use a forward slash (/) to separate alternative compound words (e.g., 2D / 3D). - **Phrasing**: Prefer active voice over passive voice (e.g., "The authors used..." instead of "The work was used by the authors..."). diff --git a/src/index.md b/src/index.md index 51a543e..23500e5 100644 --- a/src/index.md +++ b/src/index.md @@ -837,6 +837,7 @@ They apply several low-resource machine translation techniques used to improve s Their findings validate the use of an intermediate text representation for signed language translation, and pave the way for including sign language translation in natural language processing research. #### Text-to-Notation + @jiang2022machine also explore the reverse translation direction, i.e., text to SignWriting translation. They conduct experiments under a same condition of their multilingual SignWriting to text (4 language pairs) experiment, and again propose a neural factored machine translation approach to decode the graphemes and their position separately. They borrow BLEU from spoken language translation to evaluate the predicted graphemes and mean absolute error to evaluate the positional numbers. @@ -846,14 +847,61 @@ They borrow BLEU from spoken language translation to evaluate the predicted grap --- #### Notation-to-Pose + @shalev2022ham2pose proposed Ham2Pose, a model to animate HamNoSys into a sequence of poses. They first encode the HamNoSys into a meaningful "context" representation using a transform encoder, and use it to predict the length of the pose sequence to be generated. Then, starting from a still frame they used an iterative non-autoregressive decoder to gradually refine the sign over $T$ steps, In each time step $t$ from $T$ to $1$, the model predicts the required change from step $t$ to step $t-1$. After $T$ steps, the pose generator outputs the final pose sequence. -Their model outperformed previous methods like @saunders2020progressive, animating HamNoSys into more realistic sign language sequences. +Their model outperformed previous methods like @saunders2020progressive, animating HamNoSys into more realistic sign language sequences. + +#### Evaluation Metrics + +Methods for automatic evaluation of sign language processing are typically dependent only on the output and independent of the input. + +##### Text output + +For tasks that output spoken language text, standard machine translation metrics such as BLEU, chrF, or COMET are commonly used. + + +##### Gloss Output + +Gloss outputs can be automatically scored as well, though not without issues. +In particular, @muller-etal-2023-considerations analysed this and provide a series of recommendations (see the section on "Glosses", above). + +##### Pose Output + +For translation from spoken languages to signed languages, automatic evaluation metrics are an open line of research, though some metrics involving back-translation have been developed (see Text-to-Pose and Notation-to-Pose, above). + + + + +Naively, works in this domain have used metrics such as Mean Squared Error (MSE) or Average Position Error (APE) for pose outputs [ahuja2019Language2PoseNaturalLanguage;ghosh2021SynthesisCompositionalAnimations;petrovich2022TEMOSGeneratingDiverse]. +However, these metrics have significant limitations for Sign Language Production. + +For example, MSE and APE do not account for variations in sequence length. +In practice, the same sign will not always take exactly the same amount of time to produce, even by the same signer. +To address time variation, @huang2021towards introduced a metric for pose sequence outputs based on measuring the distance between generated and reference pose sequences at the joint level using dynamic time warping, termed DTW-MJE (Dynamic Time Warping - Mean Joint Error). +However, this metric did not clearly address how to handle missing keypoints. +@shalev2022ham2pose experimented with multiple evaluation methods, and proposed adding a distance function that accounts for these missing keypoints. +They applied this function with normalization of keypoints, naming their metric nDTW-MJE. + + +##### Multi-Channel Block output +As an alternative to gloss sequences, @kim-etal-2024-signbleu-automatic proposed a multi-channel output representation for sign languages and introduced SignBLEU, a BLEU-like scoring method for these outputs. +Instead of a single linear sequence of glosses, the representation segments sign language output into multiple linear channels, each containing discrete "blocks". +These blocks represent both manual and non-manual signals, for example, one for each hand and others for various non-manual signals like eyebrow movements. +The blocks are then converted to n-grams: temporal grams capture sequences within a channel, and channel grams capture co-occurrences across channels. +The SignBLEU score is then calculated for these n-grams of varying orders. +They evaluated SignBLEU on the DGS Corpus v3.0 [@dataset:Konrad_2020_dgscorpus_3; @dataset:prillwitz2008dgs], NIASL2021 [@dataset:huerta-enochian-etal-2022-kosign], and NCSLGR [@dataset:Neidle_2020_NCSLGR_ISLRN; @Vogler2012ASLLRP_data_access_interface] datasets, comparing it with single-channel (gloss) metrics such as BLEU, TER, chrF, and METEOR, as well as human evaluations by native signers. +The authors found that SignBLEU consistently correlated better to human evaluation than these alternatives. +However, one limitation of this approach is the lack of suitable datasets. +The authors reviewed a number of sign language corpora, noting the relative scarcity of multi-channel annotations. +The [source code for SignBLEU](https://github.com/eq4all-projects/SignBLEU) is available. +As with SacreBLEU [@post-2018-call-sacrebleu], the code can generate "version signature" strings summarizing key parameters, to enhance reproducibility. + ```{=ignore} #### Pose-to-Notation diff --git a/src/references.bib b/src/references.bib index b5c3c6a..11d8b70 100644 --- a/src/references.bib +++ b/src/references.bib @@ -3457,3 +3457,129 @@ @inproceedings{dataset:dal2022lsa url = {https://doi.org/10.1007/978-3-031-22419-5_25}, year = {2023} } + + +@inproceedings{kim-etal-2024-signbleu-automatic, + address = {Torino, Italia}, + author = {Kim, Jung-Ho and +Huerta-Enochian, Mathew John and +Ko, Changyong and +Lee, Du Hui}, + booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, + pages = {14796--14811}, + publisher = {ELRA and ICCL}, + title = {{S}ign{BLEU}: Automatic Evaluation of Multi-channel Sign Language Translation}, + url = {https://aclanthology.org/2024.lrec-main.1289}, + year = {2024} +} + +@inproceedings{dataset:huerta-enochian-etal-2022-kosign, + address = {Marseille, France}, + author = {Huerta-Enochian, Mathew and +Lee, Du Hui and +Myung, Hye Jin and +Byun, Kang Suk and +Lee, Jun Woo}, + booktitle = {Proceedings of the 7th International Workshop on Sign Language Translation and Avatar Technology: The Junction of the Visual and the Textual: Challenges and Perspectives}, + pages = {59--66}, + publisher = {European Language Resources Association}, + title = {{K}o{S}ign Sign Language Translation Project: Introducing The {NIASL}2021 Dataset}, + url = {https://aclanthology.org/2022.sltat-1.9}, + year = {2022} +} + +@article{Bevilacqua_Blloshmi_Navigli_2021, + author = {Bevilacqua, Michele and Blloshmi, Rexhina and Navigli, Roberto}, + doi = {10.1609/aaai.v35i14.17489}, + journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, + number = {14}, + pages = {12564-12573}, + title = {One {SPRING} to Rule Them Both: Symmetric {AMR} Semantic Parsing and Generation without a Complex Pipeline}, + url = {https://ojs.aaai.org/index.php/AAAI/article/view/17489}, + volume = {35}, + year = {2021} +} + +@misc{dataset:Konrad_2020_dgscorpus_3, + author = {Konrad, Reiner and Hanke, Thomas and Langer, Gabriele and Blanck, Dolly and Bleicken, Julian and Hofmann, Ilona and Jeziorski, Olga and K{\"o}nig, Lutz and K{\"o}nig, Susanne and Nishio, Rie and Regen, Anja and Salden, Uta and Wagner, Sven and Worseck, Satu and B{\"o}se, Oliver and Jahn, Elena and Schulder, Marc}, + doi = {10.25592/dgs.corpus-3.0}, + publisher = {Universit{\"a}t Hamburg}, + title = {{{MEINE DGS}} -- Annotiert. {{{\"O}ffentliches}} Korpus Der Deutschen Geb{\"a}rdensprache, 3. {{Release}} / {{MY DGS}} -- Annotated. {{Public}} Corpus of German Sign Language, 3rd Release}, + type = {Languageresource}, + url = {https://doi.org/10.25592/dgs.corpus-3.0}, + version = {3.0}, + year = {2020} +} + +@misc{dataset:Neidle_2020_NCSLGR_ISLRN, + author = {Carol Neidle and Stan Sclaroff}, + publisher = {Boston University}, + title = {National Center for Sign Language and Gesture Resources (NCSLGR) corpus. {ISLRN} 833-505-711-564-4}, + type = {Languageresource}, + url = {https://www.islrn.org/resources/833-505-711-564-4/}, + year = {2012} +} + +@inproceedings{Vogler2012ASLLRP_data_access_interface, + author = {Christian Vogler and C. Neidle}, + title = {A new web interface to facilitate access to corpora: development of the {ASLLRP} data access interface}, + url = {https://api.semanticscholar.org/CorpusID:58305327}, + year = {2012} +} + +@inproceedings{huangFastHighQualitySign2021, + title = {Towards Fast and {High-Quality Sign Language Production}, + booktitle = {Proceedings of the 29th {{ACM International Conference}} on {{Multimedia}}}, + author = {Huang, Wencan and Pan, Wenwen and Zhao, Zhou and Tian, Qi}, + year = {2021}, + month = oct, + series = {{{MM}} '21}, + pages = {3172--3181}, + publisher = {Association for Computing Machinery}, + address = {New York, NY, USA}, + doi = {10.1145/3474085.3475463}, + url = {https://doi.org/10.1145/3474085.3475463}, + urldate = {2024-06-19}, + isbn = {978-1-4503-8651-7} +} + +@inproceedings{ahuja2019Language2PoseNaturalLanguage, + author = {Ahuja, Chaitanya and Morency, Louis-Philippe}, + booktitle = {2019 {{International Conference}} on {{3D Vision}} ({{3DV}})}, + doi = {10.1109/3DV.2019.00084}, + issn = {2475-7888}, + pages = {719--728}, + shorttitle = {{{Language2Pose}}}, + title = {{Language2Pose}: Natural Language Grounded Pose Forecasting}, + url = {https://ieeexplore.ieee.org/document/8885540}, + urldate = {2024-06-19}, + year = {2019} +} + +@inproceedings{ghosh2021SynthesisCompositionalAnimations, + author = {Ghosh, Anindita and Cheema, Noshaba and Oguz, Cennet and Theobalt, Christian and Slusallek, Philipp}, + booktitle = {2021 {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}} ({{ICCV}})}, + doi = {10.1109/ICCV48922.2021.00143}, + issn = {2380-7504}, + pages = {1376--1386}, + title = {Synthesis of Compositional Animations from Textual Descriptions}, + url = {https://ieeexplore.ieee.org/document/9710802}, + urldate = {2024-06-19}, + year = {2021} +} + +@inproceedings{petrovich2022TEMOSGeneratingDiverse, + address = {Cham}, + author = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l}, + booktitle = {Computer {{Vision}} -- {{ECCV}} 2022}, + doi = {10.1007/978-3-031-20047-2_28}, + editor = {Avidan, Shai and Brostow, Gabriel and Ciss{\'e}, Moustapha and Farinella, Giovanni Maria and Hassner, Tal}, + isbn = {978-3-031-20047-2}, + langid = {english}, + pages = {480--497}, + publisher = {Springer Nature Switzerland}, + title = {{{TEMOS}}: Generating Diverse Human Motions from Textual Descriptions}, + year = {2022} +} + +