MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators. Full paper
This repository presents MQM-APE, an enhanced framework for leveraging LLMs in translation evaluation. We also provide the performance of MQM-APE for the replication of the study.
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM has shown SOTA performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, MQM-APE, based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as ① evaluator to provide error annotations, ② post-editor to determine whether errors impact quality improvement and ③ pairwise quality verifier as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirm the effectiveness of each module and offer valuable insights into evaluator design and LLMs selection.
We employ MQM-APE by prompting the same LLM to perform multiple roles without fine-tuning for each task. MQM-APE evaluates a given translation
-
Error Analysis Evaluator identifies errors in
$y$ , providing error demonstrations$\mathcal{E}$ with error span, category and severity; -
Automatic Post-Editor post-edits
$y$ based on each identified error$e_i \in \mathcal{E}$ , producing a set of corrected translations$\mathcal{Y}_{pe}$ ; -
Pairwise Quality Verifier checks whether the post-edited translations improve upon the original translation
$y$ .
Errors for which the APE translation fails to improve on the original are discarded, leaving a refined set of errors
$\mathcal{E}^* \in \mathcal{E}$ that contribute to quality improvement. The translation is finally scored based on$\mathcal{E}^*$ following the MQM weighting scheme.
Please refer to ./MQM_APE/run.sh for an example of using MQM-APE.
cd ./MQM_APE
python3 main.py \
--config ./configs/llmconfig.yaml \
--src ./test/srcs_zh.txt \
--tgt ./test/tgts_en.txt \
--srclang Chinese \
--tgtlang English \
--out ./test/outs/llm_verifier \
[--metric_verifier] [--save_llm_response]
MQM-APE can be performed in two ways, differing in the verifier module, which can use either an LLM or a metric (COMETKiwi in our experiments). Here is the introduction of the parameters:
-
config: The path of configuration, containing the LLM for inference, the inference hyper-parameters for each module. Please related to ./MQM_APE/configs/ for more details.
-
src: The path of source segments.
-
tgt: The path of target segments.
-
srclang: Source language, such as "English", "Chinese".
-
tgtlang: Target language, such as "English", "Chinese".
-
out: The path of output annotations, scores and other informations. A example output can be found in ./MQM_APE/test/outs/.
-
metric_verifier: A bool value controlling whether COMETKiwi is used to replace LLM verifier.
-
save_llm_response: A bool value controlling whether to save the responses of LLM in each module.
MQM-APE is a training-free approach that improves upon GEMBA-MQM and complements training-dependent approaches such as Tower. It offers high-quality error annotations and post-edited translations.
Error-based MT Evaluation | Fine-grained Feedback | Error Span Enhancement | Post-Edited Translation |
---|---|---|---|
Training-Dependent Approaches | |||
InstructScore (Xu et al., 2023) | ✔️ | ✔️ | ❌ |
xCOMET (Guerreiro et al., 2023) | ✔️ | ✔️ | ❌ |
LLMRefine (Xu et al., 2024) | ✔️ | ❌ | ✔️ |
Tower (Alves et al., 2024) | ✔️ | ❌ | ✔️ |
Training-Free Approaches | |||
GEMBA (Kocmi & Federmann, 2023) | ❌ | ❌ | ❌ |
EAPrompt (Lu et al., 2024) | ✔️ | ❌ | ❌ |
AutoMQM (Fernandes et al., 2023) | ✔️ | ❌ | ❌ |
GEMBA-MQM (Kocmi & Federmann, 2023) | ✔️ | ❌ | ❌ |
MQM-APE (This work) | ✔️ | ✔️ | ✔️ |
Table: Comparison of performance between GEMBA-MQM ("MQM") and MQM-APE on WMT22 with human-labeled MQM, evaluated using pairwise accuracy (%) at the system level, pairwise accuracy with tie calibration (%) at the segment level, and error span precision of errors (SP) and major errors (MP), respectively.
Building upon GEMBA-MQM, our purposed MQM-APE has the following advantages:
-
Better Reliability: MQM-APE consistently enhances GEMBA-MQM at both system and segment levels.
-
Better Interpretability: MQM-APE obtains higher error annotation quality compared with GEMBA-MQM.
-
Evaluator Applicability: MQM-APE complements MQM-based evaluators specifically trained for translation-related tasks.
-
Language Generalizability: MQM-APE obtains consistent improvements for almost all tested LLM on both high- and low-resource langauges.
Based on our analysis, we provide a guide on how to select LLMs as translation evaluators. For instance, Mixtral-8x22b-inst is the optimal choice for evaluator reliability when adopting large-scale LLMs for quality assessment. Users can download the model off-the-shelf and perform MQM-APE evaluation directly.
Aspect | Model Scale | LLM Selection |
---|---|---|
Reliability | ○ Small | Tower-13b-inst |
● Large | Mixtral-8x22b-inst | |
Interpretability | ○ Small | Tower-13b-inst |
● Large | Llama3-70b-inst | |
Inference Cost | ○ Small | Qwen1.5-14b-chat |
● Large | Qwen1.5-72b-chat |
Table: Performance of Automatic Post Editor measured with
$CometKiwi_{22}^{QE}$ and$BLEURT_{20}$ . "†" indicates that the metrics difference ($\Delta$ ) has >95% estimated accuracy with humans (kocmi et al., 2024). For segment comparison, we define Win as cases where both$CometKiwi_{22}^{QE}$ and$BLEURT_{20}$ rate APE higher than TGT, Lose where they rate APE lower, and Tie when their evaluations conflict.
Table: Comparison of the pairwise quality verifier's consistency with
$CometKiwi_{22}^{QE}$ and$BLEURT_{20}$ , which serve as ground truth.
Figure: Comparison between MQM-APE, random error filter ("Random") and GEMBA-MQM ("MQM") on segment-level performance.
Table: Analysis of inference cost averaged for each segment across different LLMs for each module, presenting input and generated tokens seperately.
5. A cost-reducing alternative of implementing MQM-APE is to replace the verifier with metrics for comparable performance.
Figure: Comparison between MQM-APE with an LLM verifier and viwh
$COMETKiwi_{22}^{QE}$ as a replacement on segment-level performance.
Figure: (Upper) Average Number of errors retained or discarded for each severity level with MQM-APE. (Lower) Distribution of error categories generated from GEMBA-MQM ("MQM") evaluator, MQM-APE, discarded errors, and human-annotated MQM, respectively.
Please refer to our arXiv preprint for more details.
If you find this work helpful, please consider citing as follows:
@article{Lu2024MQMAPE,
title={MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators},
author={Lu, Qingyu and Ding, Liang and Zhang, Kanjian and Zhang, Jinxia and Tao, Dacheng},
journal={arXiv preprint},
url={https://arxiv.org/pdf/2409.14335},
year={2024}
}