Abstract
Large language models (LLMs) have demonstrated impressive capabilities in general scenarios, exhibiting a level of aptitude that approaches, in some aspects even surpasses, human-level intelligence. Among their numerous skills, the translation abilities of LLMs have received considerable attention. Compared to typical machine translation that focuses solely on source-to-target mapping, LLM-based translation can potentially mimic the human translation process, which might take preparatory steps to ensure high-quality translation. This work explores this possibility by proposing the MAPS framework, which stands for Multi-Aspect Prompting and Selection. Specifically, we enable LLMs first to analyze the given source sentence and induce three aspects of translation-related knowledge (keywords, topics, and relevant demonstrations) to guide the final translation process. Moreover, we employ a selection mechanism based on quality estimation to filter out noisy and unhelpful knowledge. Both automatic (3 LLMs × 11 directions × 2 automatic metrics) and human evaluation (preference study and MQM) demonstrate the effectiveness of MAPS. Further analysis shows that by mimicking the human translation process, MAPS reduces various translation errors such as hallucination, ambiguity, mistranslation, awkward style, untranslated text, and omission. Source code is available at https://github.com/zwhe99/MAPS-mt.
1 Introduction
Large language models (LLMs) have recently demonstrated remarkable general capabilities across a wide range of tasks, making substantial strides in the field of artificial general intelligence. These capabilities have led to LLMs exhibiting a certain degree of human-level intelligence, particularly in the areas of language understanding and generation (Liang et al., 2022; Bubeck et al., 2023; Wu et al., 2023; Moghaddam and Honey, 2023). Among numerous tasks, translation has emerged as a prominent area where LLMs have shown impressive capacity and competence (Jiao et al., 2023b; Agrawal et al., 2023; Zhang et al., 2023a; Vilar et al., 2022; Moslem et al., 2023; Pilault et al., 2023; Garcia et al., 2023; Hendy et al., 2023; Zhu et al., 2023b; Jiao et al., 2023a; Wang et al., 2023b; Karpinska and Iyyer, 2023; Peng et al., 2023; Lyu et al., 2023; Bawden and Yvon, 2023; Lu et al., 2023). This progress above harkens back to the long-term aspirations and dreams of earlier machine translation research in the 1960s (Bar-Hillel, 1960; Macklovitch, 1995): Can LLMs employ a translation process similar to human translators?
Figure 1 illustrates the difference between the processes of machine and human translation. While conventional machine translation is typically a direct source-to-target mapping process, professional human translators tend to take preparatory steps when working with the given source text, including gathering and meticulously analyzing information such as keywords, topics, and relevant example sentences (Baker, 2018; Koehn, 2009; Bowker, 2002; Hatim and Munday, 2004). These steps are critical for ensuring high-quality translations that accurately capture the nuances of the source material. Although recent advances in LLM research indicate that current LLMs are approaching human-like general intelligence (Bubeck et al., 2023; Park et al., 2023), the extent to which LLMs can emulate such strategies remains underexplored.
The primary focus of this paper is to explore whether LLMs can imitate the translation strategies employed by human translators. Specifically, we aim to investigate whether LLMs can effectively preprocess the source text and leverage the relevant knowledge to improve their translations.
To this end, we propose a method called MAPS, which stands for Multi-Aspect Prompting and Selection. MAPS prompts the LLMs to analyze the source sentence and elicit translation-related knowledge in three aspects: keywords, topics, and relevant demonstrations. This knowledge then guides the LLM toward generating more accurate translations. To further enhance translation quality, we also employ a post-selection process to filter out unhelpful knowledge and select the best translation based on reference-free quality estimation (QE). We validate our approach across 11 translation directions (covering high-, medium- and low-resource language pairs) from WMT22 (Kocmi et al., 2022) and 3 LLMs (text-davinci-003, Alpaca, and Vicuna). Automatic evaluation shows that MAPS achieves significant improvement over other baselines in terms of COMET and BLEURT. Further analysis emphasizes the importance of the extracted knowledge in resolving hallucination and ambiguity in translation. We also conduct human preference studies and Multidimensional Quality Metrics (MQM) evaluation (Burchardt, 2013) which show that MAPS produces more favorable translations by reducing mistranslation, awkward style, untranslated text, and omission errors.
In contrast to other LLM-based translation approaches, such as Dictionary-based Prompting (Ghazvininejad et al., 2023) and In-context Learning (ICL) (Agrawal et al., 2023), MAPS focuses on translating general scenarios without any preconceived assumptions about the domain of translation. As a result, MAPS does not require the preparation of any external “datastore”, which might include a meticulously constructed glossary (Moslem et al., 2023), dictionary (Ghazvininejad et al., 2023), or sample pool (Agrawal et al., 2023), for specific language pairs and domains in advance.
In summary, the contributions of this work are detailed as follows:
Inspired by human translation strategy, we propose the MAPS method, which mimics the human process of analyzing the source text to gather useful knowledge, ultimately leading to an accurate translation.
We demonstrate that the three types of translation-related knowledge (keywords, topics, and relevant demonstrations) complement each other. The best translation performance can be achieved by using all three types of knowledge simultaneously.
Our in-depth analyses of MAPS, encompassing both automatic and human evaluations, demonstrates its proficiency in resolving ambiguities and reducing hallucinations and other prevalent translation errors. Furthermore, we examined the inference time of MAPS and investigated potential acceleration techniques.
2 MAPS: Multi-Aspect Prompting and Selection
In this section, we introduce the MAPS framework. As depicted in Figure 2, MAPS consists of three steps—knowledge mining, integration, and selection. When mining the knowledge, the LLM operates in a manner akin to a human translator, analyzing the source text and generating background knowledge that is beneficial to translation purposes. The acquired knowledge is integrated as contextual guidance, enabling the LLM to produce translation candidates. However, the generated knowledge may contain noise (see §4.3 for further analysis). As a result, a filtering mechanism becomes necessary to select useful knowledge while filtering out unhelpful or noisy ones.
2.1 Knowledge Mining
Akin to the initial understanding and interpretation phase that human translators take (Gile, 2009), the knowledge mining step requires the LLM first to analyze the source text and elicit three aspects of knowledge generally beneficial to translation:
Keywords are essential words or phrases that convey the core meaning of a text and act as focal points for understanding the main idea. Accurate translation of keywords is crucial for conveying the intended meaning and ensuring faithfulness (Baker, 2018; Koehn, 2009). Additionally, identifying and maintaining a list of keywords guarantees that specific terms are translated consistently across different parts of the text.
Topic refers to the overall subject or theme being discussed. A keen awareness of the topic helps translators sidestep potential issues arising from ambiguity, such as mistranslations or misinterpretations (Bowker, 2002). It is important to highlight that topics are generally more specific than the broader domains that have been widely discussed within the machine translation community. For example, while the news domain encompasses a wide range of subjects, subcategories like political news and entertainment news should adopt different registers and tones.
Demonstrations, or example sentences, illustrate how comparable sentences can be translated accurately. They assist the translators in identifying appropriate equivalents within the target language, enabling translators to produce natural and fluent translations to native speakers (Hatim and Munday, 2004).
2.2 Knowledge Integration
Just as human translators weave their understanding of the source text into their translations (Pym, 2014), knowledge integration embeds the acquired knowledge into the context (Step 2 in Figure 2) and enables the LLM to utilize this information to generate multiple translation candidates. We obtain four candidates, which the LLM generates without guidance from any external knowledge.
2.3 Knowledge Selection
Knowledge selection resembles the final decision-making phase in human translation, where the best translation of the source text is chosen based on the context. Although keywords, topics, and relevant demonstrations generally benefit translation, not all the LLM-generated knowledge is helpful. For example, LLMs may generate trivial or noisy content that might distract the translation process (Shi et al., 2023; Agrawal et al., 2023). Our quantitative experiments in §4.3 support this hypothesis. Therefore, we employ a filtering mechanism to select the most useful knowledge and filter out the unhelpful or noisy ones. Specifically, we adopt quality estimation (QE) to select the best candidate as the final output (Step 3 in Figure 2). The selection method is flexible, and both an externally trained QE model and the LLM itself served as QE are effective in our experiments.
3 Experiments
3.1 Experimental Setup
Models.
We adopt three LLMs, encompassing both closed- and open-source models.
• text-davinci-003: A strong yet closed-source LLM developed by OpenAI, which employs advanced Reinforcement Learning with Human Feedback (RLHF) techniques (Ouyang et al., 2022). We query it via the official API.
• Alpaca (Taori et al., 2023): An open-source and instruction-following LLM fine-tuned on LLaMA model (Touvron et al., 2023a) with 52K Self-Instruct (Wang et al., 2022b) data.
• Vicuna (Chiang et al., 2023): An open-source and instruction-following LLM fine-tuned on LLaMA-2 (Touvron et al., 2023b) with user-shared conversations collected from ShareGPT (ShareGPT, 2023).
For both Alpaca and Vicuna, we use the 7B version and perform inference on a single NVIDIA V100 32GB GPU.
Comparative Methods.
For a rigorous comparison, we consider several variants, including single-candidate and multi-candidate methods. Within single-candidate methods, we consider:
Baseline: Standard zero-shot translation with temperature set to 0 (default value in this work).
5-Shot (Hendy et al., 2023): Five high-quality labeled examples from the training data are prepended to the test input, which performs best overall in Hendy et al. (2023); meanwhile, increasing the number of examples will not result in meaningful improvement. This method requires meticulous construction of training data for each translation direction, including collecting, cleaning, and sorting by quality.
Within multi-candidate methods, we consider:
Rerank: Using the same prompt as the Baseline, but with the temperature set to 0.3 (following Moslem et al., 2023). We randomly sample three times and add Baseline to form four candidates. The best candidate is selected through QE. It can be considered as a pure reranking method without any guidance from extracted knowledge (Fernandes et al., 2022).
MAPS: Our proposed method described in Section 2. Three translation candidates are generated with guidance from three aspects of knowledge. Combined with the Baseline, the best one is selected using QE.
Knowledge Selection Methods.
LLM-SCQ: Composing a single choice question (SCQ) that asks the LLM to choose the best candidate on its own.
Comet-QE: A trained QE scorer that assigns a numerical score to each candidate. Selection is based on the highest score.
Comet (oracle): A reference-based scorer that assigns a numerical score to each candidate. It can be considered as the oracle QE method, representing the upper bound of selection.
Test Data.
To avoid data leakage issues (Bubeck et al., 2023; Garcia et al., 2023; Zhu et al., 2023b), we use the latest WMT22 test set, covering 11 translation directions at different resource levels (English ⇔ Chinese, English ⇔ German, English ⇔ Japanese, German ⇔ French, Ukrainian ⇔ Czech and English ⇒ Croatian). WMT22 moves away from testing only on the news domain like in previous years and shifts to focus on the general scenario covering news, social, conversational, and e-commerce (Kocmi et al., 2022).
Metrics.
We adopt COMET (Rei et al., 2022a) and BLEURT (Sellam et al., 2020) as the main metrics. These neural-based learned metrics show superiority over string-based metrics like BLEU (Kocmi et al., 2021; Bawden and Yvon, 2023) and have been adopted broadly by LLM-based translation literature (Moslem et al., 2023; Hendy et al., 2023; Garcia et al., 2023; Pilault et al., 2023). We use wmt22-comet-da and BLEURT-20 checkpoints for these two metrics.
3.2 Results
For consistency, we are solely interested in comparing different methods under the same LLM. As presented in Table 1, MAPS is broadly effective and exhibits a higher upper bound. To be detailed, we have the following observations:
• The effectiveness of MAPS has been validated across a wide range of settings. Across 11 language pairs, 3 LLMs, and 2 metrics, MAPS consistently outperforms Rerank and Baseline. After employing MAPSComet-QE, text-davinci-003 surpasses the best submissions in WMT22 in 5 out of the 11 translation directions. This suggests that LLMs can enhance translation quality by emulating the human strategy of analyzing before translating.
• MAPS outperforms Rerank consistently when the knowledge selection method is held constant. This indicates that the improvements brought by MAPS stem from three types of translation-related knowledge: keywords, topics, and relevant demonstrations. We delve into the utilization of different types of knowledge and ablation study in §4.2.
• Different knowledge selection methods can affect the final performance, and MAPS exhibits a higher upper bound for selection. When using LLM-SCQ, the performance of MAPS is on par with 5-Shot (MAPSLLM-SCQ≈5-Shot); when using Comet-QE, MAPS consistently outperforms 5-Shot (MAPSComet-QE >5-Shot). More importantly, MAPS shows higher upper bounds for selection than Rerank (MAPSComet > RerankComet), implying that superior knowledge selection methods like a better QE model (Rei et al., 2022b), AutoMQM (Fernandes et al., 2023) or ranking strategy (Fernandes et al., 2022) can further improve MAPS.
4 Analysis
In this section, we conduct analyses to understand the MAPS framework. If not otherwise specified, MAPSComet-QE, text-davinci-003, and WMT22 En-Zh are default tested method, model and language pair, respectively.
4.1 Human Evaluation
Preference Study.
We perform human preference studies on En⇔Zh test sets. For each test sample, our annotators (professional translators) were presented with a source sentence and two translations. They were then tasked with selecting the superior translation or determining that neither translation was better than the other. Figure 3 shows the results of human preference studies, and MAPS is generally more preferred by humans.
MQM Evaluation.
To understand which aspects of translation that MAPS improves, we carried out MQM evaluations (Burchardt, 2013). MQM requires the annotators to identify the errors in translation and label the category and severity level for each error. Based on the weights of the different error types, the MQM ends up with a penalty score. We followed the assessment method in Freitag et al. (2021), including guidelines to annotators, error category, severity level and error weighting. We employed professional translators who had MQM experience as the annotators. We evaluated the first 1K samples on the Chinese⇔English test sets for cost reasons. Table 2 shows that MAPS outperforms Base and Rerank significantly. In terms of error categories, the improvements brought about by MAPS are mainly in the reduction of mistranslation, awkward style, untranslated text, and omission errors, as presented in Figure 4.
4.2 Utilization of Knowledge
Although Table 1 reports the overall performance of MAPS, the utilization of the three aspects of knowledge remains unclear. For instance, it is uncertain whether the majority of samples rely on relevant demonstrations rather than keywords and topics to guide the translation process. To provide further insight, we illustrate the utilization of three types of knowledge in Figure 5. We additionally present the performance differences among these three aspects of knowledge when applied to different subsets, relative to the baseline. Figure 5 reveals a relatively balanced utilization among them. This implies that the three types of knowledge complement each other well within the MAPS framework. The ablation study presented in Table 3 further demonstrates the effectiveness of each type of knowledge. Replacing any knowledge-guided translation with random sampling leads to performance degradation.
Method . | COMET . | BLEURT . |
---|---|---|
Rerank | 87.0 | 71.8 |
MAPS | 87.6 | 72.6 |
- w/o Keyword | 87.1↓0.5 | 72.1↓0.5 |
- w/o Topic | 87.2↓0.4 | 72.4↓0.2 |
- w/o Demo | 86.9↓0.7 | 72.0↓0.6 |
Method . | COMET . | BLEURT . |
---|---|---|
Rerank | 87.0 | 71.8 |
MAPS | 87.6 | 72.6 |
- w/o Keyword | 87.1↓0.5 | 72.1↓0.5 |
- w/o Topic | 87.2↓0.4 | 72.4↓0.2 |
- w/o Demo | 86.9↓0.7 | 72.0↓0.6 |
In Figure 5, we also note that the three types of knowledge cause different degrees of performance degradation when applied to the Base subset. We conjecture that the knowledge elicited from the LLM is not always helpful and may even be noisy. This finding motivates the knowledge selection step and is discussed in detail in §4.3.
4.3 Noise in Elicited Knowledge
The statistical results in Table 4 show that: (1) although most LLM-generated keywords appear in the source sentences, only about half of them appear in the target sentences (55.8% for En-Zh; 41.8% for Zh-En); (2) the LLM strictly follows the given keyword pairs when performing translations (97.1% for En-Zh; 89.5% for Zh-En).
En-Zh . | Zh-En . | ||||
---|---|---|---|---|---|
Psrc . | Ptgt . | R . | Psrc . | Ptgt . | R . |
98.8 | 55.8 | 97.1 | 99.2 | 41.8 | 89.5 |
En-Zh . | Zh-En . | ||||
---|---|---|---|---|---|
Psrc . | Ptgt . | R . | Psrc . | Ptgt . | R . |
98.8 | 55.8 | 97.1 | 99.2 | 41.8 | 89.5 |
Combining the above two observations, we can conclude that the LLM-generated knowledge contains a certain degree of noise (at least content that is not consistent with the reference), which can easily mislead the translation process. This explains why incorporating that knowledge in the “Base” part of Figure 5 brings negative effects. Hence, knowledge selection is a crucial step in the MAPS framework to reduce the impact of noise.
4.4 MAPS Helps Ambiguity Resolution
Ambiguity resolution has long been one of the most challenging problems in machine translation. To evaluate the ambiguity resolution capability of machine translator, He et al. (2020) provide a lexical ambiguity test set for Chinese→English. The hard part of this test set involves Chinese sentences which are difficult to translate correctly unless the translator resolves their ambiguities. Our test results in Table 5 show the superiority of MAPS in ambiguity resolution, where the “accuracy” indicates the percentage of successfully disambiguated sentences (evaluated by human).
4.5 MAPS Reduces LLMs’ Hallucinations
Hallucination issue in natural language generation (NLG) refers to the phenomenon where the content generated by the model is nonsensical or unfaithful to the provided source content (Ji et al., 2023; Filippova, 2020; Maynez et al., 2020; Parikh et al., 2020; Zhou et al., 2021; He et al., 2022). This has been one of the key challenges in LLMs (Zhang et al., 2023c). In this section, we analyze the phenomenon of hallucination through automatic and human evaluation.
In automatic evaluation, we use the hallucination detector provided by Zhou et al. (2021) to identify token-level hallucination in Alpaca’s translation on Chinese→English test set. The detector assigns a binary label to each generated token. In Table 6, MAPS outperforms Rerank and demonstrates a higher upper bound.
Method . | Δ% hallucinations . |
---|---|
Baseline | – |
RerankComet-QE | −3% |
MAPSComet-QE | −8% |
RerankComet | −6% |
MAPSComet | −12% |
Method . | Δ% hallucinations . |
---|---|
Baseline | – |
RerankComet-QE | −3% |
MAPSComet-QE | −8% |
RerankComet | −6% |
MAPSComet | −12% |
In human evaluation, we employed professional human translators to label the hallucination errors in both MAPS and Rerank. We sampled 500 sentences from each of the English⇔Chinese test sets and evaluated text-davinci-003, Alpaca, and Vicuna. The human annotators were required to decide whether the translation belongs to the hallucination error following the definition from Guerreiro et al. (2023b). The results from Figure 6 show that MAPS outperforms Rerank by a notable margin in resolving hallucination.
We conjecture that one of the key differences between MAPS and Rerank is that MAPS is enabled to correct the probability distribution of the next token prediction, while Rerank is not. If a hallucinatory token occupies a high probability mass (Wang et al., 2022a), it is difficult for Rerank to avoid selecting this token by diverse sampling. In contrast, MAPS, providing additional translation-related knowledge in the prompt, enables the model to redistribute the probability of the next token, thus offering more possibilities to avoid choosing the hallucinatory token.
4.6 Three-in-One Prompting
So far, we have discussed the case where the LLM uses the three types of knowledge separately. An immediate question is how the LLM would perform if the three types of knowledge were integrated into one prompt. We call this method three-in-one prompting and present results in Table 7.
Within single-candidate methods (Baseline vs. Three-in-One), three-in-one prompting brings positive results overall, which means that the LLM can use three types of knowledge simultaneously. However, the degree of improvement varies significantly under different language pairs, with notable absence of effect in De-En translation. Regarding multi-candidate methods (MAPSComet-QE vs. MAPS), incorporating three-in-one prompting into MAPS yields only marginal improvements (≤0.2). Considering that the candidate set generated by the three-in-one prompting overlaps significantly with the candidate sets generated individually by the three types of knowledge, this result is as expected.
5 Related Work
5.1 LLMs for Translation
Research evaluating the translation capabilities of LLMs falls into two main lines. The first line involves issues specific to LLMs, including the impact of demonstration selection in ICL (Vilar et al., 2022; Zhang et al., 2023a; Garcia et al., 2023) and prompt templates (Zhang et al., 2023a; Jiao et al., 2023b) on translation performance. The second line focuses on comprehensive evaluations of LLMs under various translation scenarios, covering multilingual (Jiao et al., 2023b; Zhu et al., 2023b; Hendy et al., 2023), document-level (Hendy et al., 2023; Wang et al., 2023b; Karpinska and Iyyer, 2023), low-resource translation (Jiao et al., 2023b; Garcia et al., 2023; Zhu et al., 2023b; Bawden and Yvon, 2023), robustness (Jiao et al., 2023b), hallucination (Guerreiro et al., 2023a), and domain adaptation (Hendy et al., 2023; Wang et al., 2023a). Our work evaluates the translation capabilities of LLMs across eleven translation directions, varying from same-family (En⇔De), distant (En⇔Ja, En⇔Zh), and non-English-centric (De⇔Fr) and low-resource (Cs⇔Uk, En⇒Hr) language pairs. Zhu et al. (2023b) emphasizes the risk of data leakage. Therefore, we adopt the latest WMT22 test sets. Our work also quantitatively evaluates ambiguity resolution and token-/sentence-hallucination in LLM-based translation.
Jiao et al. (2023a) incorporate human evaluation into instruction data for training, resulting in translations that are preferred by humans during interactive chat sessions. In contrast, our work takes a different approach by mimicking the human translation process and achieves higher-quality translations without training.
Agrawal et al. (2023) propose an algorithm based on n-gram recall for demonstration selection. Given the ground-truth context, Pilault et al. (2023) introduce an interactive-chain prompting method for ambiguity resolution. Moslem et al. (2023) suggest prompting the LLMs with terminology hints extracted from the selected demonstrations or a compiled glossary for domain-specific translation such as COVID-19. Concurrently, Ghazvininejad et al. (2023) and Lu et al. (2023) use external dictionaries to augment prompts for low-resource and domain-specific translation. While our work can be viewed as a form of “prompting strategy”, it differs from this line of research in that it does not rely on any external “datastore”, such as sample pools, dictionaries, or ground-truth context, which should be curated carefully for specified language pairs or domains. In contrast, we consider the LLM itself as a “datastore” containing broad knowledge that can assist its translation process.
5.2 Chain-of-Thought Prompting
Wei et al. (2022) explore how chain-of-thought (CoT) prompting improves the ability of LLMs to perform complex reasoning such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning. By guiding LLMs through generating intermediate reasoning chains prior to reaching a final solution, CoT prompting has propelled the multi-step reasoning abilities of LLMs to an extraordinary level, as substantiated by previous research (Wei et al., 2022; Wang et al., 2023c). CoT prompting manifests through two distinct paradigms, namely, zero-shot CoT (Kojima et al., 2023; Yang et al., 2023) and few-shot CoT (Wei et al., 2023; Zhang et al., 2023b). Zero-shot CoT simply appends a trigger prompt such as Let’s think step by step after the test question, with the motivation to harness the step-by-step reasoning capacities of LLMs in a zero-shot manner. Few-shot CoT operates by utilizing a few input-output demonstrations, each of which comprises a question, a reasoning chain, and the corresponding answer. These demonstrations are seamlessly integrated before the test question, resulting in a prompted input that is subsequently processed by an LLM to deduce the answer.
So far, most CoT prompting studies focus on complex reasoning problems. Although there are a few preliminary attempts to extend CoT prompting techniques to machine translation tasks, Peng et al. (2023) find that straightforwardly applying CoT to translation tasks resulted in word-by-word translations, which is less than satisfactory. Following this line, our work can also be viewed as a form of CoT prompting for translation as it dissects the translation process into distinct steps, which is the first successful attempt of CoT in translation tasks to the best of our knowledge. Notably, our work has successfully achieved improved translation performance by inducing three aspects of translation-related knowledge including keywords, topics, and relevant demonstrations to guide the final translation process.
5.3 Self-Prompting
Self-prompting is a line of research that utilizes the LLMs to prompt themselves and extract relevant knowledge to aid downstream tasks (Li et al., 2022; Wang et al., 2023d). Diverging from CoT prompting, which focuses on providing intermediate reasoning steps on the output side, self-prompting techniques dissect the input problem into specific sub-problems on the input side and extract the salient knowledge for the sub-problems one by one. This extracted knowledge is then utilized to deduce the ultimate solution.
Several studies exemplify the diversity of self-prompting applications. Specifically, Kim et al. (2022) and Li et al. (2022) use the LLMs to generate in-context exemplars for text classification and open-domain question answering, respectively. Yu et al. (2023) generate diverse documents from the LLMs to improve knowledge-intensive tasks. Wang et al. (2023d) compel LLMs to first extract the core elements for news texts, such as entity, date, event, and result. Then, the extracted elements are used to generate summaries. Further innovations emerge in multimedia contexts. Zhu et al. (2023a) and Chen et al. (2023) empower LLMs to pose inquiries regarding provided images and videos to enrich the caption. Remarkably, MAPS extends the domain of self-prompting into machine translation for the first time.
6 Conclusion
This work introduces MAPS, a method that enables LLMs to mimic human translation strategy for achieving high-quality translation. MAPS allows LLMs to take preparatory steps before translation. Specifically, LLMs analyze the given source text and generate three aspects of translation-related knowledge: keywords, topics, and relevant demonstrations. Using a filtering mechanism based on quality estimation, the selected knowledge guides the LLMs’ translation process. In experiments with text-davinci-003, Alpaca, and Vicuna, MAPS yields significant and consistent improvements across eleven translation directions from WMT22 and exhibits a higher upper bound of candidate selection. Human evaluations show that MAPS provides more favorable translations by reducing mistranslation, awkward style, untranslated text, and omission errors. Further analyses show that MAPS effectively resolves ambiguities and hallucinations in translation. Future work includes designing more aspects of translation-related knowledge and better filtering mechanisms to improve the translation capabilities of LLMs further. Another interesting direction is to explore the human-like translation strategy in training LLMs (e.g., instruction tuning).
7 Discussion
7.1 Inference Time
Since MAPS consists of three sequential stages, the main limitation of MAPS lies in inference time. As shown in Figure 7, when processing serially, the inference times of Three-in-One, Rerank, and MAPS are 3×, 11×, and 14× the Baseline, respectively.
Given that all three methods involve processing multiple types of knowledge or candidates without any dependencies between them, a practical approach for acceleration is to parallel processing, which drastically reduces the running times ( for Three-in-One; for Rerank; for MAPS) to an acceptable level.
The additional overhead from MAPS is mainly in the knowledge mining phase, where the LLM generates three types of knowledge separately. One possible acceleration is to have the LLM generate three types of knowledge in a single call. By controlling the format of the output, e.g., JSON, we can extract each type of knowledge. However, the LLM is not guaranteed to output valid JSON content, which may lead to degradation of the final translation performance (see Table 8).
Method . | En-Zh . | Zh-En . | ||||
---|---|---|---|---|---|---|
CT . | BT . | JSON E. . | CT . | BT . | JSON E. . | |
MAPS | 87.6 | 72.6 | — | 82.6 | 70.8 | — |
MAPSJSON | 87.7 | 72.6 | 0.1% | 82.1↓ | 70.3↓ | 2.0% |
Method . | En-Zh . | Zh-En . | ||||
---|---|---|---|---|---|---|
CT . | BT . | JSON E. . | CT . | BT . | JSON E. . | |
MAPS | 87.6 | 72.6 | — | 82.6 | 70.8 | — |
MAPSJSON | 87.7 | 72.6 | 0.1% | 82.1↓ | 70.3↓ | 2.0% |
In addition, the running time of the QE scoring can be reduced by techniques such as model quantization or compression.
7.2 Is MAPS Overfitting Evaluation Metrics?
In this work, we rely on COMET and BLEURT for automatic evaluation for their strong alignment with human evaluation, as highlighted by Freitag et al. (2022). We also use COMET-QE as one of the knowledge selection methods, whose training data has overlap with evaluation metrics. This leads to a pertinent question: Is MAPS merely overfitting to COMET and BLEURT?
To ensure reliable evaluations, we integrated human assessments into all our experiments, including: MQM evaluation (§4.1), human preference studies (§4.1), ambiguity resolution (§4.4), and analysis of hallucination (§4.5). These evaluations substantiate MAPS’s effectiveness from the viewpoint of human translators.
Furthermore, we demonstrate that MAPS remains effective even in the absence of COMET-QE. As shown in Table 1, by formulating single-choice questions, the LLM itself can select the best translation candidates (RerankLLM-SCQ and MAPSLLM-SCQ).
From a data perspective, all three models above were trained using datasets from WMT. However, they use the data in different ways. COMET-QE is reference-free and does not utilize reference data during training or inference. On the contrary, COMET and BLEURT are reference-based, with both training and inference processes relying on reference data. This difference allows COMET and BLEURT to penalize translation errors against a reference, a function that COMET-QE lacks due to its reference-free design (see Table 9).
Overall, MAPS is widely effective by employing human strategy for translation.
Acknowledgments
Zhiwei and Rui are with MT-Lab, Department of Computer Science and Engineering, School of Electronic Information and Electrical Engineering, and also with the MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai 200204, China. Rui and Zhiwei are supported by the Tencent Open Fund (RBFR2023012), the National Natural Science Foundation of China (62176153), and the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).
We are grateful to the action editor and reviewers, whose insightful suggestions and exceptionally prompt feedback significantly enhanced the quality of our manuscript.
Notes
To ensure a uniform response format, we manually constructed 5-shot exemplars for each kind of knowledge.
For simplicity’s sake, we use subset notation to represent substring relationship.
References
A Knowledge-specific Prompting for Rerank
Author notes
Rui Wang and Zhaopeng Tu are co-corresponding authors.
Action Editor: David Chiang