We present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT.

Current machine translation (MT) systems have reached the stage where researchers are now debating whether or not they can rival human translators in performance (Hassan et al. 2018; Läubli, Sennrich, and Volk 2018; Toral et al. 2018; Popel et al. 2020). However, these MT systems are typically trained on data sets consisting of tens or even hundreds of millions of parallel sentences. Data sets of this magnitude are only available for a small number of highly resourced language pairs (typically English paired with some European languages, Arabic, and Chinese). The reality is that for the vast majority of language pairs in the world the amount of data available is extremely limited, or simply non-existent.

In Table 1 we select three language pairs that display a range of resource levels. We show the estimated number of first language speakers1 of the non-English language, together with the number of parallel sentences available in Opus (Tiedemann 2012), the largest collection of publicly available translated data.2,3 Although there is typically a correlation between speaker numbers and size of available resources, there are many exceptions where either widely spoken languages have little parallel data, or languages with very small speaker numbers are richly resourced (mainly European languages). For one of the world’s most spoken languages, French, there are nearly 280 million parallel sentences of English–French in OPUS. However, when we search for English–Myanmar, we find only around 700,000 parallel sentences, despite Myanmar having tens of millions of speakers. If we consider Fon, which has around 2 million speakers, then we find far fewer parallel sentences, only 35,000.4 Developing MT systems for these three language pairs will require very different techniques.

Table 1

Examples of language pairs with different levels of resources. The number of speakers is obtained from Ethnologue, and the parallel sentence counts are from Opus.

Language PairSpeakers (approx)Parallel Sentences
English–French 267M 280M
English–Myanmar 30M 0.7M
English–Fon 2M 0.035M
Language PairSpeakers (approx)Parallel Sentences
English–French 267M 280M
English–Myanmar 30M 0.7M
English–Fon 2M 0.035M

The lack of parallel data for most language pairs is only one part of the problem. Existing data is often noisy or from a very specialized domain. Looking at the resources that are available for Fon–English, we see that the only corpus available in Opus is extracted from Jehovah’s Witness publications (Agić and Vulić 2019, JW300).5 For many language pairs, the only corpora available are those derived from religious sources (e.g., Bible, Quran) or from IT localization data (e.g., from open-source projects such as GNOME and Kubuntu). Not only is such data likely to be in a very different domain from the text that we would like to translate, but such large-scale multilingual automatically extracted corpora are often of poor quality (Kreutzer et al. 2022), and this problem is worse for low-resource language pairs. This means that low-resource language pairs suffer from multiple compounding problems: lack of data, out-of-domain data, and noisy data. And the difficulty is not just with parallel data—low-resource languages often lack good linguistic tools, and even basic tools like language identification do not exist or are not reliable.

Partly in response to all these challenges, there has been an increasing interest in the research community in exploring more diverse languages, and language pairs that do not include English. This survey paper presents a high-level summary of approaches to low-resource MT, with focus on neural machine translation (NMT) techniques, which should be useful for researchers interested in this broad and rapidly evolving field. There are currently a number of other survey papers in related areas, for example, a survey of monolingual data in low-resource NMT (Gibadullin et al. 2019) and a survey of multilingual NMT (Dabre, Chu, and Kunchukuttan 2020). There have also been two very recent surveys of low-resource MT, which have been written concurrently with this survey (Ranathunga et al. 2021; Wang et al. 2021). Our survey aims to provide the broadest coverage of existing research in the field and we also contribute an extensive overview of the tools and techniques validated across 18 low-resource shared tasks that ran between 2018 and 2021.

One of the challenges of surveying the literature on low-resource MT is how to define what a low-resource language pair is. This is hard, because “resourced-ness” is a continuum and any criterion must be arbitrary. We also note that the definition of low-resource can change over time. We could crawl more parallel data, or we could find better ways of using related language data or monolingual data, which means that some language pairs are no longer so resource-poor. We maintain that for research to be considered to be on “low-resource MT,” there should be some way in which the research should either aim to understand the implications of the lack of data, or propose methods for overcoming the lack of data. We do not take a strict view of what to include in this survey, though; if authors consider that they are studying low-resource MT, then that is sufficient. We do feel, however, that it is important to distinguish between simulated low-resource settings (where a limited amount of data from otherwise high-resource language pairs is used) and genuinely low-resource languages (where additional difficulties apply). We also discuss some papers that do not explicitly consider low-resource MT but that present important techniques, and we mention methods that we think have the potential to improve low-resource MT. Even though there has been a lot of interest recently in low-resource NLP, the field is limited to languages where some textual data is freely available. This means that so far low-resource MT has only considered 100-odd languages, and there is a long tail of languages that is still unexplored.

In Figure 1 we show how we structure the diverse research methods addressing low-resource MT, and this article follows this structure. We start the survey by looking at work that aims to increase the amount and quality of parallel and monolingual data available for low-resource MT (Section 2). We then look at work that uses other types of data: monolingual data (Section 3), parallel data from other language pairs (Section 4), and other types of linguistic data (Section 5). Another avenue of important research is how to make better use of existing, limited resources through better training or modeling (Section 6). In Section 7 we pause our discussion on methods to improve low-resource MT to consider how to evaluate these improvements. In our final section, we look at efforts in the community to build research capacity through shared tasks and language-specific collectives (Section 8), providing a practical summary of commonly used approaches and other techniques often used by top-performing shared task systems.

Figure 1

Overview of research methods covered in this survey.

Figure 1

Overview of research methods covered in this survey.

Close modal

This survey aims to provide researchers and practitioners interested in low-resource MT with an overview of the area, and we hope it will be especially useful with those who are new to low-resource MT, and looking to quickly assimilate the recent research directions. We assume that our readers have prior knowledge of MT techniques and are already familiar with basic concepts, including the main architectures used. We therefore do not redefine them in this survey and refer the interested reader to other resources, such as Koehn (2020).

The first consideration when applying data-driven MT to a new language pair is to determine what data resources are already available. In this section we discuss commonly used data sets and how to extract more data, especially parallel data, for low-resource languages. A recent case study (Hasan et al. 2020) has shown how carefully targeted data gathering can lead to clear MT improvements in a low-resource language pair (in this case, Bengali–English). Data is arguably the most important factor in our success at modeling translation and we encourage readers to consider data set creation and curation as important areas for future work.

### 2.1 Searching Existing Data Sources

The largest range of freely available parallel data is found on Opus6 (Tiedemann 2012), which hosts parallel corpora covering more than 500 different languages and variants. Opus collects contributions of parallel corpora from many sources in one convenient Web site, and provides tools for downloading corpora and metadata.

Monolingual corpora are also useful, although not nearly as valuable as parallel data. There have been a few efforts to extract text from the CommonCrawl7 collection of Web sites, and these generally have broad language coverage. The first extraction effort was by Buck, Heafield, and van Ooyen (2014), although more recent efforts such as Oscar8 (Ortiz Suárez, Sagot, and Romary 2019), CC100 (Conneau et al. 2020), and mc49 (Raffel et al. 2020) have focused on cleaning the data, and making it easier to access. A much smaller (but generally less noisy) corpus of monolingual news10 is updated annually for the WMT shared tasks (Barrault et al. 2020), and currently covers 59 lan- guages. For languages not covered in any of these data sets, Wikipedia currently has text in over 300 languages, although many language sections are quite small.

### 2.2 Web-crawling for Parallel Data

Once freely available sources of parallel data have been exhausted, one avenue for improving low-resource NMT is to obtain more parallel data by Web-crawling. There is a large amount of translated text available on the Web, ranging from small-scale tourism Web sites to large repositories of government documents. Identifying, extracting, and sentence-aligning such texts is not straightforward, and researchers have considered many techniques for producing parallel corpora from Web data. The links between source texts and their translations are rarely recorded in a consistent way, so techniques ranging from simple heuristics to neural sentence embedding methods are used to extract parallel documents and sentences.

Paracrawl,11 a recent large-scale open-source crawling effort (Bañón et al. 2020), has mainly targeted European languages, only a few of which can be considered as low- resource, but it has created some releases for non-European low-resource language pairs, and the crawling pipeline is freely available. Related to this are other broad parallel data extraction efforts, where recently developed sentence-embedding-based alignment methods (Artetxe and Schwenk 2019a,2019b) were used to extract large parallel corpora in many language pairs from Wikipedia (Schwenk et al. 2021a) and CommonCrawl (Schwenk et al. 2021b). Similar techniques (Feng et al. 2020) were used to create the largest parallel corpora of Indian languages, Samanantar (Ramesh et al. 2022).

Broad crawling and extraction efforts are good for gathering data from the “long tail” of small Web sites with parallel data, but they tend to be much more effective for high-resource language pairs, because there is more text in these languages, and routine translation of Web sites is more common for some languages. Focused crawling efforts, where the focus is on particular language pairs, or particular sources, can be more effective for boosting the available data for low-resource language pairs. Religious texts are often translated into many different languages, with the texts released under permissive licences to encourage dissemination, so these have been the focus of some recent crawling efforts, such as corpora from the Bible (Mayer and Cysouw 2014; Christodouloupoulos and Steedman 2015) or the Quran (Tiedemann 2012). However, a permissive license should not be automatically assumed even for religious publications—for instance, a corpus of Jehovah’s Witness publications (Agić and Vulić 2019) was recently withdrawn due to a copyright infringement claim. We therefore recommend to always check the license of any material one intends to crawl and, if unclear, ask the original publisher for permission.

In India, government documents are often translated into many of the country’s official languages, most of which would be considered low-resource for MT. Recent efforts (Haddow and Kirefu 2020; Siripragada et al. 2020; Philip et al. 2021) have been made to gather and align this data, including producing parallel corpora between different Indian languages, rather than the typical English-centric parallel corpora. The last of these works uses an iterative procedure, starting with an existing NMT system for alignment (Sennrich and Volk 2011), and proceeding through rounds of crawling, alignment, and retraining, to produce parallel corpora for languages of India. Further language-specific data releases for low-resource languages, including not only crawled data from MT but also annotated data for other NLP tasks, were recently provided by the Lorelei project (Tracey et al. 2019), although this data is distributed under restrictive and sometimes costly Linguistic Data Consortium licenses.

### 2.3 Low-resource Languages and Web-crawling

Large-scale crawling faces several problems when targeting low-resource language pairs. Typical crawling pipelines require several stages and make use of resources such as text preprocessors, bilingual dictionaries, sentence-embedding tools, and translation systems, all of which may be unavailable or of poor quality for low-resource pairs. Also, parallel sentences are so rare for low-resource language pairs (relative to the size of the web) that even a small false-positive rate will result in a crawled corpus that is mostly noise (e.g., sentences that are badly aligned, sentences in the wrong language, or fragments of html/javascript).

One of the first stages of any crawling effort is language identification, perhaps thought to be a solved problem with wide-coverage open-source toolkits such as CLD3.12 However, it has been noted (Caswell et al. 2020) that language identification can perform quite poorly on Web-crawled corpora, especially on low-resource languages, where it is affected by class imbalance, similar languages, encoding problems, and domain mismatches. Further down the crawling pipeline, common techniques for document alignment and sentence alignment rely on the existence of translation systems (Uszkoreit et al. 2010; Sennrich and Volk 2010, 2011) or sentence embeddings (Artetxe and Schwenk 2019a), which again may not be of sufficient quality in low-resource languages and so we often have to fall back on older, heuristic alignment techniques (Varga et al. 2005) (and even this may perform worse if a bilingual lexicon is not available). The consequence is that the resulting corpora are extremely noisy and require extensive filtering before they can be useful for NMT training.

Filtering of noisy corpora is itself an active area of research, and has been explored in recent shared tasks, which particularly emphasized low-resource settings (Koehn et al. 2019, 2020). In an earlier version of the task (Koehn et al. 2018), dual conditional cross-entropy (DCCE; Junczys-Dowmunt 2018) was found to be very effective for English–German. However, in the 2019 and 2020 editions of the task, DCCE was much less used, possibly indicating that it is less effective for low-resource and/or distant language pairs. Instead, we see that all participants apply some heuristic filtering (e.g., based on language identification and length) and then strong submissions typically used a combination of embedding-based methods (such as LASER [Artetxe and Schwenk 2019b], GPT-2 [Radford et al. 2019], and YiSi [Lo 2019]) with feature-based systems such as Zipporah (Xu and Koehn 2017) or Bicleaner (Sánchez-Cartagena et al. 2018). Although the feature-based methods are much faster than sentence-embedding based methods, both types of methods require significant effort in transferring to a new language pair, especially if no pre-trained sentence embeddings or other models are available.

The conclusion is that all crawled data sources should be treated with care, especially in low-resource settings, as they will inevitably contain errors. A large-scale quality analysis (Kreutzer et al. 2022) of crawled data has highlighted that many sets contain incorrect language identification, non-parallel sentences, low-quality text, as well as offensive language, and these problems can be more acute in low-resource languages.

### 2.4 Other Data Sources

In addition to mining existing sources of translations, researchers have turned their attention to ways of creating new parallel data. One idea for doing this in a cost-effective manner is crowdsourcing of translations. Post, Callison-Burch, and Osborne (2012) showed that this method is effective in collecting a parallel corpus covering several languages of India. Pavlick et al. (2014) used crowdsourcing to collect bilingual dictionaries covering a large selection of languages. Although not specifically aimed at MT, the Tatoeba13 collection of crowdsourced translations provides a useful resource with broad language coverage. An MT challenge set covering over 100 language pairs has been derived from Tatoeba (Tiedemann 2020).

### 2.5 Test Sets

Obtaining more training data is important, but we should not forget the role of standardized and reliable test sets in improving performance on low-resource translation. Important contributions have come from shared tasks, such as those organized by the WMT Conference on Machine Translation (Bojar et al. 2018; Barrault et al. 2019, 2020; Fraser 2020), the Workshop on Asian Translation (Nakazawa et al. 2018, 2019, 2020), and the Workshop on Technologies for MT of Low Resource Languages (LoResMT) (Karakanta et al. 2019; Ojha et al. 2020).

The test sets in these shared tasks are very useful, but inevitably only cover a small selection of low-resource languages, and usually English is one of the languages in the pair; non-English language pairs are poorly covered. A recent initiative toward rectifying this situation is the FLORES-101 benchmark (Goyal et al. 2021), which covers a large number of low-resource languages with multi-parallel test sets, vastly expanding on the original FLORES release (Guzmán et al. 2019). Since FLORES-101 consists of the same set of English sentences, translated into 100 other languages, it can also be used for testing translation between non-English pairs. It has the limitation that, for non-English pairs, the two sides are “translationese,” and not mutual translations of each other, but there is currently no other data set with the coverage of FLORES-101.

There is also some important and recent work on data set creation, distribution, and benchmarking for individual language pairs, and these efforts are becoming more and more frequent, particularly for African languages—for example Adelani et al. (2021) for Yoruba–English, Ezeani et al. (2020) for Igbo–English, and Mukiibi, Claire, and Joyce (2021) for Luganda–English, to cite just a few.

We summarize the main multilingual data sets described in this section in Table 2.

Table 2

Summary of useful sources of monolingual, parallel, and benchmarking data discussed in this section.

Corpus nameURLDescription
CC100 http://data.statmt.org/cc-100/ Monolingual data from CommonCrawl (100 languages).
Oscar https://oscar-corpus.com/ Monolingual data from CommonCrawl (166 languages).
mc4 https://huggingface.co/datasets/mc4 Monolingual data from CommonCrawl (108 languages).
news-crawl http://data.statmt.org/news-crawl/ Monolingual news text in 59 languages.

Opus https://opus.nlpl.eu/ Collection of parallel corpora in 500+ languages, gathered from many sources.
WikiMatrix https://bit.ly/3DrTjPo Parallel corpora mined from Wikipedia
CCMatrix https://bit.ly/3Bin6rQ Parallel corpora mined from CommonCrawl
Samanantar https://indicnlp.ai4bharat.org/samanantar/ A parallel corpus of 11 languages of India paired with English
Bible https://github.com/christos-c/bible-corpus A parallel corpus of 100 languages extracted from the Bible
Tanzil https://opus.nlpl.eu/Tanzil.php A parallel corpus of 42 languages translated from the Quran by the Tanzil project.

Tatoeba Challenge https://bit.ly/3Drp36U Test sets in over 100 language pairs.
WMT corpora http://www.statmt.org/wmt21/ Test (and training) sets for many shared tasks.
WAT corpora https://lotus.kuee.kyoto-u.ac.jp/WAT/ Test (and training) sets for many shared tasks.
FLORES-101 https://github.com/facebookresearch/flores Test sets for 100 languages, paired with English.

Pavlick dictionaries https://bit.ly/3DgI0cu Crowdsourced bilingual dictionaries in many languages.
Corpus nameURLDescription
CC100 http://data.statmt.org/cc-100/ Monolingual data from CommonCrawl (100 languages).
Oscar https://oscar-corpus.com/ Monolingual data from CommonCrawl (166 languages).
mc4 https://huggingface.co/datasets/mc4 Monolingual data from CommonCrawl (108 languages).
news-crawl http://data.statmt.org/news-crawl/ Monolingual news text in 59 languages.

Opus https://opus.nlpl.eu/ Collection of parallel corpora in 500+ languages, gathered from many sources.
WikiMatrix https://bit.ly/3DrTjPo Parallel corpora mined from Wikipedia
CCMatrix https://bit.ly/3Bin6rQ Parallel corpora mined from CommonCrawl
Samanantar https://indicnlp.ai4bharat.org/samanantar/ A parallel corpus of 11 languages of India paired with English
Bible https://github.com/christos-c/bible-corpus A parallel corpus of 100 languages extracted from the Bible
Tanzil https://opus.nlpl.eu/Tanzil.php A parallel corpus of 42 languages translated from the Quran by the Tanzil project.

Tatoeba Challenge https://bit.ly/3Drp36U Test sets in over 100 language pairs.
WMT corpora http://www.statmt.org/wmt21/ Test (and training) sets for many shared tasks.
WAT corpora https://lotus.kuee.kyoto-u.ac.jp/WAT/ Test (and training) sets for many shared tasks.
FLORES-101 https://github.com/facebookresearch/flores Test sets for 100 languages, paired with English.

Pavlick dictionaries https://bit.ly/3DgI0cu Crowdsourced bilingual dictionaries in many languages.

Parallel text is, by definition, in short supply for low-resource language pairs, and even applying the data collection strategies of Section 2 may not yield sufficient text to train high-quality MT models. However, monolingual text will almost always be more plentiful than parallel, and leveraging monolingual data has therefore been one of the most important and successful areas of research in low-resource MT.

In this section, we provide an overview of the main approaches used to exploit monolingual data in low-resource scenarios. We start by reviewing work on integrating external language models into NMT (Section 3.1), work largely inspired from the use of language models in statistical MT (SMT). We then discuss research on synthesizing parallel data (Section 3.2), looking at the dominant approach of backtranslation and its variants, unsupervised M,T, and the modification of existing parallel data using language models. Finally, we turn to transfer learning (Section 3.3), whereby a model trained on monolingual data is used to initialize part or all of the NMT system, either through using pre-trained embeddings or through the initialization of model parameters from pre-trained language models. The use of monolingual data for low-resource MT has previously been surveyed by Gibadullin et al. (2019), who choose to categorize methods according to whether they are architecture-independent or architecture-dependent. This categorization approximately aligns with our split into (i) approaches based on synthetic data and the integration of external LMs (architecture-independent), and (ii) those based on transfer learning (architecture-dependent).14

### 3.1 Integration of External Language Models

For SMT, monolingual data was normally incorporated into the system using a language model, in a formulation that can be traced back to the noisy channel model (Brown et al. 1993). In early work on NMT, researchers drew inspiration from SMT, and several studies have focused on integrating external language models into NMT models.

The first approach, proposed by Gülçehre et al. (2015), was to modify the scoring function of the MT model by either interpolating the probabilities from a language model with the translation probabilities (they call this shallow fusion) or integrating the hidden states from the language model within the decoder (they call this deep fusion). Importantly, they see improved scores for a range of scenarios, including a (simulated) low-resource language direction (Turkish$→$English), with best results achieved using deep fusion. Building on this, Stahlberg, Cross, and Stoyanov (2018) proposed simple fusion as an alternative method for including a pre-trained LM. In this case, the NMT model is trained from scratch with the fixed LM, offering it a chance to learn to be complementary to the LM. The result is improved translation performance as well as training efficiency, with experiments again on low-resource Turkish–English, as well as on larger data sets for Xhosa$→$English and Estonian$→$English.

The addition of a language model to the scoring function as in the works described above has the disadvantage of increasing the time necessary for decoding (as well as training). An alternative approach was proposed by Baziotis, Haddow, and Birch (2020), who aim to overcome this by using the language model as a regularizer during training, pushing the source-conditional NMT probabilities to be closer to the unconditional LM prior. They see considerable gains in very low-resource settings (albeit simulated), using small data sets for Turkish–English and German–English.

### 3.2 Synthesizing Parallel Data Using Monolingual Data

One direction in which the use of monolingual data has been highly successful is in the production of synthetic parallel data. This is particularly important in low-resource settings when genuine parallel data is scarce. It has been the focus of a large body of research and has become the dominant approach to exploiting monolingual data due to the improvements it brings (particularly in the case of backtranslation) and the potential for progress in the case of unsupervised MT. Both backtranslation and unsupervised MT belong to a category of approaches involving self-learning, which we discuss in Section 3.2.1. We then discuss an alternative method of synthesizing parallel data in Section 3.2.3, which involves modifying existing parallel data using a language model learned on monolingual data.

#### 3.2.1 Self-learning: Backtranslation and its Variants

One of the most successful strategies for leveraging monolingual data has been the creation of synthetic parallel data through translating monolingual texts either using a heuristic strategy or an intermediately trained MT model. This results in parallel data where one side is human-generated, and the other is automatically produced. We focus here on backtranslation and its variants, before exploring in the next section how unsupervised MT can be seen as an extension of this idea.

##### Backtranslation.

Backtranslation corresponds to the scenario where target-side monolingual data is translated using an MT system to give corresponding synthetic source sentences, the idea being that it is particularly beneficial for the MT decoder to see well-formed sentences. Backtranslation was already being used in SMT (Bertoldi and Federico 2009; Bojar and Tamchyna 2011), but because monolingual data could already be incorporated easily into SMT systems using language models, and because inference in SMT was quite slow, backtranslation was not widely used. For NMT, however, it was discovered that backtranslation was a remarkably effective way of exploiting monolingual data (Sennrich, Haddow, and Birch 2016a), and it remains an important technique for both low-resource MT and MT in general.

There has since been considerable interest in understanding and improving backtranslation. For example, Edunov et al. (2018a) showed that backtranslation improved performance even at a very large scale, but also that it provided improvements in (simulated) low-resource settings, where it was important to use beam-search rather than sampling to create backtranslations (the opposite situation to high-resource pairs). Caswell, Chelba, and Grangier (2019) showed that simply adding a tag to the back-translated data during training, to let the model know which was back-translated data and which was natural data, could improve performance.

##### Variants of Backtranslation.

Forward translation, where monolingual source data is translated into the target language (Zhang and Zong 2016), is also possible but has received considerably less interest than backtranslation for low-resource MT, presumably because of the noise it introduces for the decoder. However, He et al. (2019) actually find it more effective than backtranslation in their experiments for low-resource English$→$Nepali when coupled with noising (as a kind of dropout), which they identify as an important factor in self-supervised learning. A related (and even simpler) technique related to forward and backtranslation, that of copying from target to source to create synthetic data, was introduced by Currey, Miceli Barone, and Heafield (2017). They showed that this was particularly useful in low-resource settings (tested on Turkish–English and Romanian–English) and hypothesize that it helped particularly with the translation of named entities.

##### Iterative Backtranslation.

For low-resource NMT, backtranslation can be a particularly effective way of improving quality (Guzmán et al. 2019). However, one possible issue is that the initial model used for translation (trained on available parallel data) is often of poor quality when parallel data is scarce, which inevitably leads to poor quality backtranslations. The logical way to address this is to perform iterative backtranslation, whereby intermediate models of increasing quality in both language directions are successively used to create synthetic parallel data for the next step. This has been successfully applied to low-resource settings by several authors (Hoang et al. 2018; Dandapat and Federmann 2018; Bawden et al. 2019; Sánchez-Martínez et al. 2020), although successive iterations offer diminishing returns, and often two iterations are sufficient, as has been shown experimentally (Chen et al. 2020).

Other authors have sought to improve on iterative backtranslation by introducing a round-trip (i.e., autoencoder) objective for monolingual data, in other words performing backtranslation and forward translation implicitly during training. This was simultaneously proposed by Cheng et al. (2016) and He et al. (2016) and also by Zhang and Zong (2016), who also added forward translation. However, none of these authors applied their techniques to low-resource settings. In contrast, Niu, Xu, and Carpuat (2019) developed a method using Gumbel softmax to enable back-propagation through backtranslation: They tested in low-resource settings but achieved limited success.

Despite the large body of literature on applying backtranslation (and related techniques) and evidence that it works in low-resource NMT, there are few systematic experimental studies of backtranslation specifically for low-resource NMT, apart from (Xu et al. 2019), which appears to confirm the findings of (Edunov et al. 2018a) that sampling is best when there are reasonable amounts of data and beam search is better when data is very scarce.

#### 3.2.2 Unsupervised MT

The goal of unsupervised MT is to learn a translation model without any parallel data, so this can be considered an extreme form of low-resource MT. The first unsupervised NMT models (Lample et al. 2018a; Artetxe et al. 2018) were typically trained in a two-phase process: A rough translation system is first created by aligning word embeddings across the two languages (e.g., using bilingual seed lexicons), and then several rounds of iterative backtranslation and denoising autoencoding are used to further train the system. Although this approach has been successfully applied to high-resource language pairs (by ignoring available parallel data), it has been shown to perform poorly on genuine low-resource language pairs (Guzmán et al. 2019; Marchisio, Duh, and Koehn 2020; Kim, Graça, and Ney2020), mainly because the initial quality of the word embeddings and their cross-lingual alignments is poor (Edman, Toral, and van Noord 2020). The situation is somewhat improved using transfer learning from models trained on large amounts of monolingual data (Section 3.3), and some further gains can be achieved by adding a supervised training step with the limited parallel data (i.e., semi-supervised rather than unsupervised) (Bawden et al. 2019). However, the performance remains limited, especially compared with high-resource language pairs.

These negative results have focused researchers’ attention on making unsupervised MT work better for low-resource languages. Chronopoulou, Stojanovski, and Fraser (2021) improved the cross-lingual alignment of word embeddings in order to get better results on unsupervised Macedonian–English and Albanian–English. A separate line of work is concerned with using corpora from other languages to improve unsupervised NMT (see Section 4.2.2).

#### 3.2.3 Modification of Existing Parallel Data

Another way in which language models have been used to generate synthetic parallel data is to synthesize parallel examples from new ones by replacing certain words.15 In translation, it is important to maintain the relation of translation between the two sentences in the parallel pair when modification of the pair occurs. There are, to our knowledge, few studies so far in this area. Fadaee, Bisazza, and Monz (2017) explore data augmentation for MT for a simulated low-resource setting (using English–German). They rely on bi-LSTM language models to predict plausible but rare equivalents of words in sentences. They then substitute the rare words and replace the aligned word in the corresponding parallel sentence with its translation (obtained through a look-up in an SMT phrase table). They see improved BLEU scores (Papineni et al. 2002) and find that it is a complementary technique to backtranslation. More recently, Arthaud, Bawden, and Birch (2021) apply a similar technique to improve the adaptation of a model to new vocabulary for the low-resource translation direction Gujarati$→$English. They use a BERT language model to select training sentences that provide the appropriate context to substitute new and unseen words in order to create new synthetic parallel training sentences. While their work explores the trade-off between specializing in the new vocabulary and maintaining overall translation quality, they show that the approach can improve the translation of new words more effectively following data augmentation.

### 3.3 Introducing Monolingual Data Using Transfer Learning

The third category of approaches we explore looks at transfer learning, by which we refer to techniques where a model trained using the monolingual data is used to initialize some or all of the NMT model. A related, but different idea, multilingual models, where the low-resource NMT model may be trained with the help of other (high-resource) languages, will be considered in Section 4.

##### Pre-trained Embeddings.

When neural network methods were introduced to NLP, transfer learning meant using pre-trained word embeddings, such as word2vec (Mikolov et al. 2013) or GloVe (Pennington, Socher, and Manning 2014), to introduce knowledge from large unlabeled monolingual corpora into the model. The later introduction of the multilingual fastText embeddings (Bojanowski et al. 2017) meant that pre-trained embeddings could be tested with NMT (Di Gangi and Federico 2017; Neishi et al. 2017; Qi et al. 2018). Pre-trained word embeddings were also used in the first phase of unsupervised NMT training (Section 3.2.2). Of most interest for low-resource NMT was the study by Qi et al. (2018), who showed that pre-trained embeddings could be extremely effective in some low-resource settings.

##### Pre-trained Language Models.

Another early method for transfer learning was to pre-train a language model, and then to use it to initialize either the encoder or the decoder, or both (Ramachandran, Liu, and Le 2017). Although not MT per se, Junczys-Dowmunt et al. (2018b) applied this method to improve grammatical error correction, which they modeled as a low-resource MT task.

The pre-trained language model approach has been extended with new objective functions based on predicting masked words, trained on large amounts of monolingual data. Models such as ELMo (Peters et al. 2018) and BERT (Devlin et al. 2019) have been shown to be very beneficial to natural language understanding tasks, and researchers have sought to apply related ideas to NMT. One of the blocking factors identified by Yang et al. (2020) in using models such as BERT for pre-training is the problem of catastrophic forgetting (Goodfellow et al. 2014). They propose a modification to the learning procedure involving a knowledge distillation strategy designed to retain the model’s capacity to perform language modeling during translation. They achieve increased translation performance according to BLEU, although they do not test on low-resource languages.

Despite the success of ELMo and BERT in NLP, large-scale pre-training in NMT did not become popular until the success of the XLM (Conneau and Lample 2019), MASS (Song et al. 2019b), and mBART (Liu et al. 2020) models. These models allow transfer learning for NMT by initial training on large quantities of monolingual data in several languages, before fine-tuning on the languages of interest. Parallel data can also be incorporated into these pre-training approaches (Wang, Zhai, and Hassan Awadalla 2020; Tang et al. 2021; Chi et al. 2021). Because they use data from several languages, they will be discussed in the context of multilingual models in Section 4.3.

In the previous section we considered approaches that exploit monolingual corpora to compensate for the limited amount of parallel data available for low-resource language pairs. In this section we consider a different but related set of methods, which use additional data from different languages (i.e., in languages other than the language pair that we consider for translation). These multilingual approaches can be roughly divided into two categories: (i) transfer learning and (ii) multilingual models.

Transfer learning (Section 4.1) was introduced in Section 3.3 in the context of pre-trained language models. These methods involve using some or all of the parameters of a “parent” model to initialize the parameters of the “child” model. The idea of multilingual modeling (Section 4.2) is to train a system that is capable of translating between several different language pairs. This is relevant to low-resource MT, because low-resource language pairs included in a multilingual model may benefit from other languages used to train the model. Finally (Section 4.3), we consider more recent approaches to transfer learning, based on learning large pre-trained models from multilingual collections of monolingual and parallel data.

### 4.1 Transfer Learning

In the earliest form of multilingual transfer learning for NMT, a parent model is trained on one language pair, and then the trained parameters are used to initialize a child model, which is then trained on the desired low-resource language pair.

This idea was first explored by Zoph et al. (2016), who considered a French–English parent model, and child models translating from four low-resource languages (Hausa, Turkish, Uzbek, and Urdu) into English. They showed that transfer learning could indeed improve over random initialization, and the best performance for this scenario was obtained when the values of the target embeddings were fixed after training the parent, but the training continued for all the other parameters. Zoph et al. (2016) suggest that the choice of the parent language could be important, but did not explore this further for their low-resource languages.

Whereas Zoph et al. (2016) treat the parent and child vocabularies as independent, Nguyen and Chiang (2017) showed that when transferring between related languages (in this case, within the Turkic family), it is beneficial to share the vocabularies between the parent and child models. To boost this effect, subword segmentation such as byte-pair encoding (BPE) (Sennrich, Haddow, and Birch 2016b) can help to further increase the vocabulary overlap. In cases where there is little vocabulary overlap (e.g., because the languages are distantly related), mapping the bilingual embeddings between parent and child can help (Kim, Gao, and Ney 2019). In cases where the languages are highly related but are written in different scripts, transliteration may be used to increase the overlap in terms of the surface forms (Dabre et al. 2018; Goyal, Kumar, and Sharma 2020). Interestingly, in the case of transferring from a high-resource language pair into a low-resource one where the target language is a variant of the initial parent language, Kumar et al. (2021) found it useful to pre-train embeddings externally and then fix them during the training of the parent, before initializing the embeddings of the low-resource language using those of the high-resource variant. Tested from English into Russian (transferring to Ukranian and Belarusian), Norwegian Bokmål (transferring to Nynorsk), and Arabic (transferring to four Arabic dialects), they hypothesize that decoupling the embeddings from training helps to avoid mismatch when transferring from the parent to the child target language.

The question of how to choose the parent language for transfer learning, as posed by Zoph et al. (2016), has been taken up by later authors. One study suggests that language relatedness is important (Dabre, Nakagawa, and Kazawa 2017). However, Kocmi and Bojar (2018) showed that the main consideration in transfer learning is to have a strong parent model, and it can work well for unrelated language pairs. Still, if languages are unrelated and the scripts are different—for example, transferring from an Arabic–Russian parent to Estonian–English—transfer learning is less useful. Lin et al. (2019) perform an extensive study on choosing the parent language for transfer learning, showing that data-related features of the parent models and lexical overlap are often more important than language similarity. Further insight into transfer learning for low-resource settings was provided by Aji et al. (2020), who analyzed the training dynamics and concluded that the parent language is not important. The effectiveness from transfer learning with strong (but linguistically unrelated) parent models has been confirmed in shared task submissions such as Bawden et al. (2020)—see Section 8.2.4.

Multi-stage transfer learning methods have also been explored. Dabre, Fujita, and Chu (2019) propose a two-step transfer with English on the source side for both parent and child models. First, a one-to-one parent model is used to initialize weights in a multilingual one-to-many model, using a multi-way parallel corpora that includes the child target language. Second, the intermediate multilingual model is fine-tuned on parallel data between English and the child target language. Kim et al. (2019) use a two-parent model and a pivot language. One parent model is between the child source language and the pivot language (e.g., German–English), and the other translates between the pivot and the child target language (e.g., English–Czech). Then, the encoder parameters from the first model and the decoder parameters of the second models are used to initialize the parameters of the child model (e.g., German–Czech).

### 4.2 Multilingual Models

The goal of multilingual MT is to have a universal model capable of translation between any two languages. Including low-resource language pairs in multilingual models can be seen as a means of exploiting additional data from other, possibly related, languages. Having more languages in the training data helps to develop a universal representation space, which in turn allows for some level of parameter sharing among the language-specific model components.

The degree to which parameters are shared across multiple language directions varies considerably in the literature, with early models showing little sharing across languages (Dong et al. 2015) and later models exploring the sharing of most or all parameters (Johnson et al. 2017). The amount of parameter sharing can be seen as a trade-off between ensuring that each language is sufficiently represented (has enough parameters allocated) and that low-resource languages can benefit from the joint training of parameters with other (higher-resource) language pairs (which also importantly reduces the complexity of the model by reducing the number of parameters required).

Dong et al. (2015) present one of the earliest studies in multilingual NMT, focused on translation from a single language into multiple languages simultaneously. The central idea of this approach is to have a shared encoder and many language-specific decoders, including language-specific weights in the attention modules. By training on multiple target languages (presenting as a multi-task set-up), the motivation is that the representation of the source language will not only be trained on more data (thanks to the multiple language pairs), but the representation may be more universal, since it is being used to decode several languages. They find that the multi-decoder set-up provides systematic gains over the bilingual counterparts, although the model was only tested in simulated low-resource settings.

As an extension of this method, Firat, Cho, and Bengio (2016) experiment with multilingual models in the many-to-many scenario. They too use separate encoders and decoders for each language, but the attention mechanism is shared across all directions, which means that adding languages increases the number of model parameters linearly (as opposed to quadratic increase when attention is language-direction-specific). In all cases, the multi-task model performed better than the bilingual models according to BLEU scores, although it was again only tested in simulated low-resource scenarios.

More recent work has looked into the benefits of sharing only certain parts of mul- tilingual models, ensuring language-dependent components. For example, Platanios et al. (2018) present a contextual parameter generator component, which allows finer control of the parameter sharing across different languages. Fan et al. (2021) also include language-specific components by sharing certain parameters across pre-defined language groups in order to efficiently and effectively upscale the number of languages included (see Section 4.2.1).

In a bid to both simplify the model (also reducing the number of parameters) and to maximally encourage sharing between languages, Ha, Niehues, and Waibel (2016) and Johnson et al. (2017) proposed using a single encoder and decoder to train all language directions (known as the universal encoder-decoder). Whereas Ha, Niehues, and Waibel (2016) propose language-specific embeddings, Johnson et al. (2017) use a joint vocabulary over all languages included, which has the advantage of allowing shared lexical representations (and ultimately this second strategy is the one that has been retained by the community). The control over the target language was ensured in both cases by including pseudo-tokens indicating the target language in the source sentence. Although not trained or evaluated on low-resource language pairs, the model by Johnson et al. (2017) showed promise in terms of the ability to model multilingual translation with a universal model, and zero-shot translation (between language directions for which no parallel training data was provided) was also shown to be possible. The model was later shown to bring improvements when dealing with translation into several low-resource language varieties (Lakew, Erofeeva, and Federico 2018), a particular type of multilingual MT where the several target languages are very similar. We shall see in the next section (Section 4.2.1) how scaling up the number of languages used for training can be beneficial in the low-resource setting.

Combining multilingual models with the transfer learning approaches of the previous section, Neubig and Hu (2018) present a number of approaches for adaptation of multilingual models to new languages. The authors consider cold- and warm-start scenarios, depending on whether the training data for the new language was available for training the original multilingual model. They find that multilingual models fine-tuned with the low-resource language training data mixed in with data from a similar high-resource language (i.e., similar-language regularization) give the best trans- lation performance.

#### 4.2.1 Massively Multilingual Models

In the last couple of years, efforts have been put into scaling up the number of languages included in multilingual training, particularly for the universal multilingual model (Johnson et al. 2017). The motivation is that increasing the number of languages should improve the performance for all language directions, thanks to the addition of extra data and to increased transfer between languages, particularly for low-resource language pairs. For example, Neubig and Hu (2018) trained a many-to-English model with 57 possible source languages, and more recent models have sought to include even more languages; Aharoni, Johnson, and Firat (2019) train an MT model for 102 languages to and from English as well as a many-to-many MT model between 59 languages, and Fan et al. (2021), Zhang et al. (2020a), and Arivazhagan et al. (2019) train many-to-many models for over 100 languages.

While an impressive feat, the results show that it is non-trivial to maintain high translation performance across all languages as the number of language pairs is increased (Mueller et al. 2020; Aharoni, Johnson, and Firat 2019; Arivazhagan et al. 2019). There is a trade-off between transfer (how much benefit is gained from the addition of other languages) and interference (how much performance is degraded due to having to also learn to translate other languages) (Arivazhagan et al. 2019). It is generally bad news for high-resource language pairs, for which the performance of multilingual models is usually below that of language-direction-specific bilingual models. However, low-resource languages often do benefit from multilinguality, and the benefits are more noticeable for the many-to-English than for the English-to-many (Johnson et al. 2017; Arivazhagan et al. 2019; Adelani et al. 2021). It has also been shown that for zero-shot translation, the more languages included in the training, the better the results are (Aharoni, Johnson, and Firat 2019; Arivazhagan et al. 2019), and that having multiple bridge pairs in the parallel data (i.e., not a single language such as English being the only common language between the different language pairs) greatly benefits zero-shot translation, even if the amount of non-English parallel data remains small (Rios, Müller, and Sennrich 2020; Freitag and Firat 2020).

There is often a huge imbalance in the amount of training data available across language pairs, and, for low-resource language pairs, it is beneficial to upsample the amount of data. However, upsampling low-resource pairs has the unfortunate effect of harming performance on high-resource pairs (Arivazhagan et al. 2019), and there is the additional issue of the model overfitting on the low-resource data even before it has time to converge on the high-resource language data. A solution to this problem is the commonly used strategy of temperature-based sampling, which involves adjusting how much we sample from the true data distribution (Devlin et al. 2019; Fan et al. 2021), providing a certain compromise between making sure low-resource languages are sufficiently represented but reducing the deterioration in performance seen in high-resource language pairs. Temperature-based sampling can also be used when training the subword segmentation to create a joint vocabulary across all languages so that the low-resource languages are sufficiently represented in the joint vocabulary despite there being little data.

Several works have suggested that the limiting factor is the capacity of the model (i.e., the number of parameters). Although multilingual training with shared parameters can increase transfer, increasing the number of languages decreases the per-task capacity of the model. Arivazhagan et al. (2019) suggest that model capacity may be the most important factor in the transfer-interference trade-off; they show that larger models (deeper or wider) show better translation performance across the board, deeper models being particularly successful for low-resource languages, whereas wider models appeared more prone to overfitting. Zhang et al. (2020a) show that online backtranslation combined with a deeper Transformer architecture and a special language-aware layer normalization and linear transformations between the encoder and the decoder improve translation in a many-to-many set-up.

#### 4.2.2 Multilingual Unsupervised Models

As noted in Section 3.2.2, unsupervised MT performs quite poorly in low-resource language pairs, and one of the ways in which researchers have tried to improve its performance is by exploiting data from other languages. Sen et al. (2019b) demonstrate that a multilingual unsupervised NMT model can perform better than bilingual models in each language pair, but they only experiment with high-resource language pairs. Later works (Garcia et al. 2021; Ko et al. 2021) directly address the problem of unsupervised NMT for a low-resource language pair in the case where there is parallel data in a related language. More specifically, they use data from a third language (Z) to improve unsupervised MT between a low-resource language (X) and a high-resource language (Y). In both works, they assume that X is closely related to Z, and that there is parallel data between Y and Z. As in the original unsupervised NMT models (Lample et al. 2018a; Artetxe et al. 2018), the training process uses denoising autoencoders and iterative backtranslation.

### 4.3 Large-scale Multilingual Pre-training

The success of large-scale pre-trained language models such as ELMo (Peters et al. 2018) and BERT (Devlin et al. 2019) has inspired researchers to apply related techniques to MT. Cross-lingual language models (XLM; Conneau and Lample 2019) are a direct application of the BERT masked language model (MLM) objective to learn from parallel data. The training data consists of concatenated sentence pairs, so that the model learns to predict the identity of the masked words from the context in both languages simultaneously. XLM was not applied to low-resource MT in the original paper, but was shown to improve unsupervised MT, as well as language modeling and natural language inference in low-resource languages.

The first really successful large-scale pre-trained models for MT were mBART (Liu et al. 2020) and MASS (Song et al. 2019b), which demonstrated improvements to NMT in supervised, unsupervised, and semi-supervised (i.e., with backtranslation) conditions, including low-resource language pairs. The idea of these models is to train a noisy autoencoder using large collections of monolingual (i.e., not parallel) data in 2 or more languages. The autoencoder is a transformer-based encoder-decoder, and the noise is introduced by randomly masking portions of the input sentence. Once the autoencoder has been trained to convergence, its parameters can be used to initialize the MT model, which is trained as normal. Using mBART, Liu et al. (2020) were able to demonstrate unsupervised NMT working on the distant low-resource language pairs Nepali–English and Sinhala–English, as well as showing improvements in supervised NMT in low-resource language pairs such as Gujarati–English.

The original mBART was trained on 25 different languages and its inclusion in HuggingFace (Wolf et al. 2020) makes it straightforward to use for pre-training. It has since been extended to mBART50 (Tang et al. 2021), which is trained on a mixture of parallel and monolingual data, and includes 50 different languages (as the name suggests); mBART50 is also available on HuggingFace. A recent case study (Birch et al. 2021) has demonstrated that mBART50 can be combined with focused data-gathering techniques to quickly develop a domain-specific, state-of-the-art MT system for a low-resource language pair (in this case, Pashto–English).

A recent multilingual pre-trained method called mRASP (Lin et al. 2020) has shown strong performance across a range of MT tasks: medium, low, and very low-resource. mRASP uses unsupervised word alignments generated by MUSE (Conneau et al. 2018) to perform random substitutions of words with their translations in another language, with the aim of bringing words with similar meanings across multiple languages closer in the representation space. They show gains of up to 30 BLEU points for some very low-resource language pairs such as Belarusian–English. mRASP2 (Pan et al. 2021) extends this work by incorporating monolingual data into the training.

Of course, pre-trained models are useful if the languages you are interested in are included in the pre-trained model, and you have the resources to train and deploy these very large models. On the former point, Muller et al. (2021) have considered the problem of extending multilingual BERT (mBERT) to new languages for natural language understanding tasks. They find greater difficulties for languages that are more distant from those in mBERT and/or have different scripts—but the latter problem can be mitigated with careful transliteration.

For some languages, alternative sources of linguistic information, for example (i) linguistics tools (Section 5.1) and (ii) bilingual lexicons (Section 5.2), can be exploited. They can provide richer information about the source or target languages (in the case of tagging and syntactic analysis) and additional vocabulary that may not be present in parallel data (in the case of bilingual lexicons and terminologies). Although there has been a large body of work in this area in MT in general, only some have been applied to true low-resource settings. We assume that this is because of the lack of tools and resources for many of these languages, or at least the lack of those of sufficiently good quality. We therefore review work looking at exploiting these two sources of additional information, for languages where such resources are available and put a particular focus on those that have been applied to low-resource languages.

### 5.1 Linguistic Tools and Resources

Additional linguistic analysis such as part-of-speech (PoS) tagging, lemmatization, and parsing can help to reduce sparsity by providing abstraction from surface forms, as long as the linguistic tools and resources are available. A number of different approaches have developed for the integration of linguistic information in NMT. These include morphological segmentation, factored representations (Section 5.1.2), multi-task learning (Section 5.1.3), interleaving of annotations (Section 5.1.4), and syntactic reordering (Section 5.1.5). At the extreme, these resources can be used to build full rule-based translation models (Section 6.5).

#### 5.1.1 Morphological Segmentation

A crucial part of training the NMT system is the choice of subword segmentation, a pre-processing technique providing the ability to represent an infinite vocabulary with a fixed number of units and to better generalize over shorter units. For low-resource languages, it is even more important because there is a greater chance of coming across words that were not seen at training time. The most commonly used strategies are statistics-based, such as BPE (Sennrich, Haddow, and Birch 2016b) and SentencePiece (Kudo and Richardson 2018). Not only might these strategies not be optimal from a point of view of linguistic generalization, but for low-resource languages especially they have also been shown to give highly variable results, depending on what degree of segmentation is selected; this degree is a parameter that therefore must be chosen wisely (Ding, Renduchintala, and Duh 2019; Sennrich and Zhang 2019).

Works exploring linguistic subword segmentation go back to statistical MT (Oflazer, Durgar El-Kahlout, and Durgar El-Kahlout 2007; Goldwater and McClosky 2005). Much of the focus has been on morphologically rich languages, with high degrees of inflection and/or compounding, for example, for German, where minor gains can be seen over standard BPE (Huck, Riess, and Fraser 2017). Specifically for low-resource languages, several works have tested the use of morphological analyzers to assist the segmentation of texts into more meaningful units. In their submission to the WMT19 shared task for English$→$Kazakh, Sánchez-Cartagena, Pérez-Ortiz, and Sánchez-Martínez (2019) use the morphological analyzer from Apertium (Forcada and Tyers 2016) to segment Kazakh words into stem (often corresponding to the lemma in Kazakh) and the remainder of the word. They then learned BPE over the morphological segmented data. Ortega, Castro Mamani, and Cho (2020) also use a BPE approach, guided by a list of suffixes, which are provided to the algorithm and are not segmented. They see better performance than using Morfessor or standard BPE. Saleva and Lignos (2021) also test morphologically aware subword segmentation for three low-resource language pairs: Nepali, Sinhala, and Kazakh to and from English. They test segmentations using the LMVR (Ataman et al. 2017) and MORSEL (Lignos 2010) analyzers, but found no gain over BPE and no consistent pattern in the results. These results go against previous results from Grönroos et al. (2014) that showed that an LMVR segmentation can outperform BPE when handling low-resource Turkish, but they are in accordance with more recent ones for Kazakh–English (Toral et al. 2019) and Tamil–English (Dhar, Bisazza, and van Noord 2020), where it does not seem to improve over BPE.

#### 5.1.2 Factored Models

Factored source and target representations (Garcia-Martinez, Barrault, and Bougares 2016; Sennrich and Haddow 2016; Burlot et al. 2017) were de- signed as a way of decomposing word units into component parts, which can help to provide some level of composite abstraction from the original wordform. For example, a wordform may be represented by its lemma and its PoS, which together can be used to recover the original surface form. This type of modeling can be particularly useful for morphologically rich languages (many of which are already low-resource), for which the large number of surface forms can result in greater data sparsity and normally necessitate greater quantities of data.

Factored models originated in SMT (Koehn and Hoang 2007), but were notably not easy to scale. The advantage of factored representations in NMT is that the factors are represented in continuous space and therefore may be combined more easily, without resulting in an explosion in the number of calculations necessary. Garcia-Martinez, Barrault, and Bougares (2016), Sennrich and Haddow (2016), and Burlot et al. (2017) evaluate on language pairs involving at least one morphologically rich language and show that improvements in translation quality can be seen, but this is dependent on the language pair and the type of linguistic information included in the factors. Nădejde et al. (2017) use factors to integrate source-side syntactic information in the form of combinatory categorial grammar (CCG) tags (Steedman 2000; Clark and Curran 2007), which they combine with an interleaving approach on the target side (see Section 5.1.4) to significantly improve MT performance for high-resource (German$→$English) and mid-low-resource (Romanian$→$English) language directions.

Multi-task learning can be seen as a way of forcing the model to learn better internal representations of wordforms by training the model to produce a secondary output (in this case linguistic analyses) as well as a translation.

Initial work in multi-task learning for MT did not concentrate on the low-resource scenario. Luong et al. (2016) explore different multi-task set-ups for translation (testing on English–German), among which a set-up where they use parsing as an auxiliary task to translation, which appears to help translation performance as long as the model is not overly trained on the parsing task. The question of how to optimally train such multi-task models has inevitably since been explored, inspired in part by concurrent work in multi-encoder and multi-decoder multilingual NMT (See Section 4), since it appears that sharing all components across all tasks is not the optimal setting. Niehues and Cho (2017) experiment with PoS tagging and named entity recognition as auxiliary tasks to MT and test different degrees of sharing. They find that sharing the en- coder only (i.e., separate attention mechanisms and decoders) works best and that using both auxiliary tasks enhances translation performance in a simulated low-resource German$→$English scenario.

Since then, there have been some applications of multi-task learning to lower-resource scenarios, with slight gains in translation performance. Nădejde et al. (2017) also share encoders in their multi-task setting for the integration of target-side syntactic information in the form of CCG supertags (for German$→$English and mid-low-resource Romanian$→$English). Similarly, Zaremoodi, Buntine, and Haffari (2018) develop a strategy to avoid task interference in a multi-task MT set-up (with named entity recognition, semantic parsing, and syntactic parsing as auxiliary tasks). They do so by extending the recurrent components of the model with multiple blocks and soft routing between them to act like experts. They test in real low-resource scenarios (Farsi–English and Vietnamese–English) and obtain gains of approximately 1 BLEU point by using the additional linguistic information in the dynamic sharing set-up they propose.

#### 5.1.4 Interleaving of Linguistic Information in the Input

As well as comparing factored representations and multi-task decoding, Nădejde et al. (2017) also introduce a new way of integrating target-side syntactic information, which they call interleaving. The idea is to annotate the target side of the training data with token-level information (CCG supertags in their case) by adding a separate token before each token containing the information pertaining to it, so that the model learns to produce the annotations along with the translation. They found this to work better than multi-task for the integration of target-side annotations and was also complementary with the use of source factors. Inspired by these results, Tamchyna, Weller-Di Marco, and Fraser (2017) also followed the interleaving approach (for English$→$Czech and English$→$German, so not low-resource scenarios), but with the prediction of interleaved morphological tags and lemmas, followed by a deterministic wordform generator. Whereas Nădejde et al. (2017) seek to create representations that are better syntactically informed, the aim of Tamchyna, Weller-Di Marco, and Fraser (2017) is different: They aim to create a better generalization capacity for translation into a morphologically rich language by decomposing wordforms into their corresponding tags and lemmas. They see significantly improved results with the two-step approach, but find that simply interleaving morphological tags (similar to Nădejde et al. [2017]) does not lead to improvements. They hypothesize that the morphological tags are less informative than CCG supertags and therefore the potential gain in information is counterbalanced by the added difficulty of having to translate longer target sequences.

In a systematic comparison with both recurrent neural networks and transformer architectures and for 8 language directions (and in particular for low-resource languages), Sánchez-Cartagena, Pérez-Ortiz, and Sánchez-Martínez (2020) find that inter- leaving (with PoS information and morphological tags) is beneficial, in line with the conclusions from Nădejde et al. (2017). Interestingly, they find that (i) interleaving lin- guistic information in the source sentence can help and morphological information is better than PoS tags, and (ii) interleaving in the target sentence can also help, but PoS tagging is more effective than morphological information, despite the translations being more grammatical with added morphological information.

#### 5.1.5 Syntactic Reordering

Other than being used as an additional form of input, syntactic information can also be used a priori to facilitate the translation task by reordering words within sentences to better match a desired syntactic order. Murthy, Kunchukuttan, and Bhattacharyya (2019) found this to be particularly effective for very-low-resource languages in a transfer learning set-up, when transferring from a high-resource language pair to a low-resource pair (see Section 4.1). Testing on translation into Hindi from Bengali, Gujarati, Marathi, Malayalam, and Tamil, having transferred from the parent language direction English$→$Hindi, they apply syntactic reordering rules on the source-side to match the syntactic order of the child source language, resulting in significant gains in translation quality.

### 5.2 Bilingual Lexicons

Bilingual lexicons are lists of terms (words or phrases) in one language associated with their translations in a second language. The advantage of bilingual lexicons is that they may well provide specialist or infrequent terms that do not appear in available parallel data, with the downside that they do not give information about the translation of terms in context, notably when there are several possible translations of the same term. However, they may be important resources to exploit, since they provide complementary information to parallel data and may be more readily available and cheaper to produce.16

The approaches developed so far to exploit bilingual lexicons to directly improve NMT can be summarized as follows: (i) as seed lexicons to initialize unsupervised MT (Lample et al. 2018b; Duan et al. 2020) (as described in Section 3.2.2); (ii) as an additional scoring component, particularly to provide coverage for rare or otherwise unseen vocabulary (Arthur, Neubig, and Nakamura 2016; Feng et al. 2017); and (iii) as anno- tations in the source sentence by adding translations from lexicons just after their corresponding source words (Dinu et al. 2019)17 or by replacing them in a code-switching-style set-up (Song et al. 2019a).

The most recent work on using lexicons in pre-trained multilingual models (Lin et al. 2020, mRASP) shows the most promise. Here, translations of words are substituted into the source sentence in pre-training, with the goal of bringing words with similar meanings across multiple languages closer in the representation space. See Section 4.3 for more details.

In the previous sections we have looked at using monolingual data, data from other language pairs, and other linguistic data to improve translation. In this section we explore work that aims to make better use of the data we already have by investigating better modeling, training, and inference techniques.

In recent years, MT systems have converged toward a fairly standardized architecture—a sequence-to-sequence neural network model with an encoder and an auto-regressive decoder, typically implemented as a Transformer (Vaswani et al. 2017)—although recurrent models (Bahdanau, Cho, and Bengio 2015) are still used. Training is performed on a parallel corpus by minimizing the cross-entropy of the target translations conditional on the source sentences. Monolingual examples, if available, are typically converted to parallel sentences, as discussed in Section 3. Once the model is trained, translations are usually generated by beam search with heuristic length control, which returns high-probability sentences according to the learned distribution of target sentences conditional on the source sentences.

This approach has been very successful for MT on high-resource language pairs where there is enough high-quality parallel and monolingual text covering a wide variety of domains to wash out most of the misaligned inductive bias18 that the model might have. However, for low-resource language pairs the inductive bias of the model becomes more prominent, especially when the model operates out of the training distribution, as it frequently does when the training data has sparse coverage of the language. Therefore it can be beneficial to design the neural network architecture and training and inference procedures to be more robust to low-resource conditions, for instance by explicitly modeling the aleatoric uncertainty19 that is intrinsic to the translation task due to its nature of being a many-to-many mapping (one source sentence can have multiple correct translations and one translation can result from multiple source sentences [Ott et al. 2018]).

In this section, we will review recent machine learning techniques that can improve low-resource MT, such as meta-learning for data-efficient domain adaptation and multilingual learning (Section 6.1), Bayesian and latent variable models for explicit quantification of uncertainty (Section 6.2), alternatives to cross-entropy training (Section 6.3), and beam search inference (Section 6.4). We will also briefly discuss rule-based approaches for translation between related low-resource languages (Section 6.5).

### 6.1 Meta Learning

In Section 4 we discussed using multilingual training to improve low-resource MT by combining training sets for different language pairs in joint-learning or transfer learning schemes. A more extreme form of this approach involves the application of meta learning: Rather than training a system to directly perform well on a single task or fixed set of tasks (language pairs in our case), a system can be trained to quickly adapt to a novel task using only a small number of training examples, as long as this task is sufficiently similar to tasks seen during (meta-)training.

One of the most successful meta-learning approaches is Model-Agnostic Meta-Learning (MAML) (Finn, Abbeel, and Levine 2017), which was applied to multilingual MT by Gu et al. (2018). In MAML, we train task-agnostic model parameters $θ-$ so that they can serve as a good initial value that can be further optimized toward a task- specific parameter vector $θ*m$ based on a task-specific training set Dm. This is accomplished by repeatedly simulating the fine-tuning procedure, evaluating each fine-tuned model on its task-specific evaluation set, and then updating the task-agnostic parameters in the direction that improves this score.

Once training is completed, the fine-tuning procedure can be directly applied to any novel task. Gu et al. (2018) apply MAML by meta-training on synthetic low-resource tasks obtained by randomly subsampling parallel corpora of high-resource language pairs and then fine-tune on true low-resource language pairs, obtaining substantial improvements.

An alternative approach to meta-learning involves training memory-augmented networks that receive the task-specific training examples at execution time and maintain a representation of them which they use to adapt themselves on the fly (Vinyals et al. 2016; Santoro et al. 2016), an approach related to the concept of “fast weights” computed at execution time as opposed to “slow weights” (the model parameters) computed at training time (Schmidhuber 1992). Lake (2019) applied memory-augmented networks to synthetic sequence-to-sequence tasks in order to evaluate out-of-distribution generalization under a variety of conditions. Curiously, very large language models such as GPT-2 (Radford et al. 2019) and in particular GPT-3 (Brown et al. 2020) also exhibit this meta-learning capability even without any modification of the network architecture or training procedure, suggesting that meta-learning itself can be learned from a sufficient large amount of data. In fact, GPT-3 achieves near state-of-the-art quality when translating into English, even with a single translation example, for multiple source languages including Romanian, a medium-low resource language.

### 6.2 Latent Variable Models

Auto-regressive NMT models can in principle represent arbitrary probability distributions given enough model capacity and training data. However, in low-resource conditions, the inductive biases of the models might be insufficient for a good generalization, and different factorizations of probability distributions that result in different inductive biases may be beneficial.

Various approaches have attempted to tackle these issues by introducing latent variables, random variables that are neither observed as source sentences nor as target sentences, and are rather inferred internally by the model. This can be done with a source-conditional parametrization, which applies latent-variable modeling only on the target sentence or with a joint parametrization, which applies it to both the source and the target sentences.

Latent variable models enable a higher model expressivity and more freedom in the engineering of the inductive biases, at the cost of more complicated and computationally expensive training and inference. For this reason, approximation techniques such as Monte Carlo sampling or MAP inference over the latent variable are used, typically based on the Variational Autoencoder framework (VAE) (Kingma and Welling 2014; Rezende, Mohamed, and Wierstra 2014).

In the earliest variational NMT approach by Zhang et al. (2016), a source-conditional parametrization is used and the latent variable is a fixed-dimension continuous variable that is intended to capture global information about the target sentence. Training is performed by maximizing a lower bound, known as the Evidence Lower Bound (ELBO) of the conditional cross-entropy of the training examples, which is computed using an auxiliary model component known as inference network that approximates the posterior of the latent variable as a diagonal Gaussian conditional on both the source and the target sentence. During inference the latent variable is either sampled from the prior or more commonly approximated as its mode (which is also its mean). This basic approach, similar to image VAEs and the Variational Language Model of Bowman et al. (2016), adds limited expressivity to autoregressive NMT because a fixed-dimensional unimodal distribution is not especially well-suited to represent the variability of a sentence, but it can be extended in various ways: Su et al. (2018) and Schulz, Aziz, and Cohn (2018) use a sequence of latent variables, one for each target token, parametrized with temporal dependencies between each other.

Setiawan et al. (2020) parametrize the latent posterior using normalizing flows (Rezende and Mohamed 2015), which can represent arbitrary and possibly multi- modal distributions as a sequence of transformation layers applied to a simple base distribution.

Eikema and Aziz (2019) use a joint parametrization as they claim that explicitly modeling the source sentence together with the target sentence provides additional information to the model. Inference is complicated by the need to post-condition the joint probability distribution on the source sentence, hence a series of approximations is used in order to ensure efficiency.

The latent variable models described so far have been evaluated on high-resource language pairs, although most of them have been evaluated on the IWSLT data set, which represents a low-resource domain. However, latent-variable MT has also been applied to fully low-resource language pairs, using models where the latent variables have been designed to have linguistically-motivated inductive bias. Ataman, Aziz, and Birch (2019) introduce a NMT model with latent word morphology in a hierarchical model, allowing for both word level representations and character level generation to be modeled. This is beneficial for morphologically rich languages, which include many Turkic and African low-resource languages. These languages use their complex morphologies to express syntactic and semantic nuance, which might not be captured by the purely unsupervised and greedy BPE preprocessing, especially when the BPE vocabulary is trained on a small corpus. The proposed model uses for each word one multivariate Gaussian latent variable representing a lemma embedding and a sequence of quasi-discrete latent variables representing morphology. Training is performed in a variational setting using a relaxation based on the Kumaraswamy distributionz (Kumaraswamy 1980; Louizos, Welling, and Kingma 2018; Bastings, Aziz, and Titov 2019), and inference is performed by taking the modes of the latent distributions, as the model is source-conditional. This approach has been evaluated on morphologically rich languages including Turkish, yielding improvements both in in-domain and out-of-domain settings.

### 6.3 Alternative Training Objectives

When an autoregressive model is trained to optimize the cross-entropy loss, it is only exposed to ground-truth examples during training. When this model is then used to generate a sequence with ancestral sampling, beam search, or another inference method, it has to incrementally extend a prefix that it has generated itself. Since the model in general cannot learn exactly the “true” probability distribution of the target text, the target prefix that it receives as input will be out-of-distribution, which can cause the estimate of the next token probability to become even less accurate. This issue, named exposure bias by Ranzato et al. (2016), can compound with each additional token and might result in the generated text to eventually become completely nonsensical. Exposure bias theoretically occurs regardless of the task, but while its impact has been argued to be small in high-resource settings (Wu et al. 2018), in low-resource MT it has been shown to be connected to the phenomenon of hallucination, where the system generates translations that are partially fluent but contain spurious information not present in the source sentence (Wang and Sennrich 2020).

A number of alternatives to cross-entropy training have been proposed in order to avoid exposure bias, which all involve exposing the model during training to complete or partial target sequences generated by itself. Ranzato et al. (2016) explore multiple training strategies and propose a method called MIXER, which is a variation of the REINFORCE algorithm (Williams 1992; Zaremba and Sutskever 2015). In practice REINFORCE suffers from high variance; therefore they apply it only after the model has already been pre-trained with cross-entropy, a technique also used by all the other training methods described in this section. They further extend the algorithm by combining cross-entropy training and REINFORCE within each sentence according to a training schedule that interpolates from full cross-entropy to full REINFORCE. They do not evaluate on a true low-resource language pair, but they do report improvements on German$→$English translation on the relatively small IWSLT 2014 data set (Cettolo et al. 2014).

Contrastive Minimum Risk Training (CMRT or just MRT) (Och 2003; Shen et al. 2016; Edunov et al. 2018b) is a similar training technique that can be considered a biased variant of REINFORCE that focuses on high-probability translations generated by decoding from the model itself. Wang and Sennrich (2020) apply CMRT to low-resource translation (German$→$Romansh) as well as German$→$English IWSLT 2014, reporting improvements in the out-of-domain test case, as well as a reduction of hallucinations.

Both REINFORCE and CMRT use a reward function that measures the similarity between generated and reference translations, often based on an automatic evaluation metric such as BLEU. However, the exact mechanism that makes such approaches work is not completely clear; Choshen et al. (2020) show that REINFORCE and CMRT also work when the reward function is a trivial constant function rather than a sentence similarity metric, suggesting that their primary effect is to regularize the model pre-trained with cross-entropy by exposing it to its own translations, hence reducing exposure bias.

An alternative training technique involving beam search decoding in the training loop has been proposed by Wiseman and Rush (2016), based on the LaSO (Daumé and Marcu 2005) structured learning algorithm. This approach also exposes the model to its own generations during training, and it has the benefit that training closely matches the inference process, reducing any mismatch. The authors report improvements on German$→$English IWSLT 2014. However, they do not evaluate on a true low-resource language pair.

An even simpler technique that exposes the model’s own generations during training is scheduled sampling (Bengio et al. 2015), which also starts with cross-entropy training and progressively replaces part of the ground truth target prefixes observed by the model with its own samples. Plain scheduled sampling is theoretically unsound (Huszar 2015), but it can be made more consistent by backpropagating gradients through a continuous relaxation of the sampling operation, as shown by Xu, Niu, and Carpuat (2019), who report improvements on the low-resource Vietnamese$→$English language pair.

Regularization techniques have also been applied to low-resource MT. Sennrich and Zhang (2019) evaluated different hyperparameter settings, in particular, batch size and dropout regularization, for German$→$English with varying amounts of training data and low-resource Korean$→$English. Müller, Rios, and Sennrich (2020) experimented with various training and inference techniques for out-of-distribution MT both for a high-resource (German$→$English) and low-resource (German$→$Romansh) pair. For the low-resource pair they report improvements by using sub-word regularization (Kudo 2018), defensive distillation, and source reconstruction. An alternate form of subword regularization, known as BPE dropout, has been proposed by Provilkov, Emelianenko, and Voita (2020), reporting improvements on various high-resource and low-resource language pairs. He, Haffari, and Norouzi (2020) apply a dynamic programming approach to BPE subword tokenization, evaluating during training all possible ways of tokenizing each target word into subwords, and computing an optimal tokenization at inference time. Because their method is quite slow, however, they only use it to tokenize the training set and then train a regular Transformer model on it, combining it with BPE dropout on source words, reporting improvements on high-resource and medium-resource language pairs.

### 6.4 Alternative Inference Algorithms

In NMT, inference is typically performed using a type of beam search algorithm with heuristic length normalization20 (Jean et al. 2015; Koehn and Knowles 2017). Ostensibly, beam search seeks to approximate maximum a posteriori (MAP) inference, although it has been noticed that increasing the beam size, which improves the accuracy of the approximation, often degrades translation quality after a certain point (Koehn and Knowles 2017). It is actually feasible to exactly solve the MAP inference problem, and the resulting mode is often an abnormal sentence; in fact, it is often the empty sentence (Stahlberg and Byrne 2019). It is arguably dismaying that NMT relies on unprincipled inference errors in order to generate accurate translations. Various authors have attributed this “beam search paradox” to modeling errors caused by exposure bias or other training issues and they have proposed alternative training schemes such as those discussed in Section 6.3. Even a perfect probabilistic model, however, could well exhibit this behavior due to a counterintuitive property of many high-dimensional random variables that causes the mode of the distribution to be very different from typical samples, which have a log-probability close to the entropy of the distribution. (See Cover and Thomas [2006] for a detailed discussion of typicality from an information theory perspective). Eikema and Aziz (2020) recognize this issue in the context of NMT and tackle it by applying Minimum Bayes Risk (MBR) inference (Goel and Byrne 2000).

Minimum Bayes Risk seeks to generate a translation that is maximally similar, according to a metric such as BLEU or METEOR (Denkowski and Lavie 2011), to other translations sampled from the model itself, each weighted according to its probability. The intuition is that the generated translation will belong to a high-probability cluster of similar candidate translations; highly abnormal translations such as the empty sentence will be excluded. Eikema and Aziz (2019) report improvements over beam search on the low-resource language pairs of the FLORES data set (Nepali–English and Sinhala–English) (Guzmán et al. 2019), while they lose some accuracy on English–German. They also evaluate inference though ancestral sampling, the simplest and theoretically least biased inference technique, but they found that it performs worse than both beam search and MBR.

Energy-based models (EBMs) (LeCun et al. 2006) are alternative representations of a probability distribution that can be used for inference. An ERM of a random variable (a whole sentence, in our case) is a scalar-valued energy function, implemented as a neural network, which represents an unnormalized log-probability. This lack of normalization means that only probability ratios between two sentences can be computed efficiently; for this reason training and sampling from EBMs requires a proposal distribution to generate reasonably good initial samples to be re-ranked by the model, and in the context of MT this proposal distribution is a conventional autoregressive NMT model. Bhattacharyya et al. (2021) define source-conditional or joint EBMs trained on ancestral samples from an autoregressive NMT model using a reference-based metric (e.g., BLEU). During inference they apply the EBM to re-rank a list of N ancestral sam- ples from the autoregressive NMT model. This approximates MAP inference on a probability distribution that tracks the reference-based metric, which would not give high weight to abnormal translations such as the empty sentence. They report improvements on multiple language pairs, in particular for medium-resource and low-resource language pairs such as Romanian, Nepali, and Sinhala to English.

Reranking has also been applied under the generalized noisy channel model initially developed for SMT (Koehn, Och, and Marcu 2003), where translations are scored not just under the probability of the target conditional on the source (direct model) but also under the probability of the source conditional on the target (channel model) and the unconditional probability of the target (language model prior), combined by a weighted sum of their logarithms. This reranking can be applied at sentence level on a set of candidate translations generated by the direct model by conventional beam search (Chen et al. 2020) or at token level interleaved with beam search (Bhosale et al. 2020), resulting in improvements in multiple language pairs including low-resource ones.

### 6.5 Rule-based Approaches

Rule-based machine translation (RBMT) consists of analyzing and transforming a source text into a translation by applying a set of hand-coded linguistically motivated rules. This was the oldest and the most common paradigm for machine translation before being largely supplanted by corpus-based approaches such as phrase-based statistical machine translation and neural machine translation, which usually outperform it on both accuracy and fluency, especially when translating between language pairs where at least one language is high-resource, such as English. However, rule-based techniques can still be successfully applied to the task of translation between closely related languages.

Modern implementations, such as the Apertium system (Forcada et al. 2011; Forcada and Tyers 2016; Khanna et al. 2021), use lexical translation and shallow transfer rules that avoid full parsing and instead exploit the similarities between the source and target language to restructure a sentence into its translation. This approach has been applied to various language pairs, especially in the Western Romance and the South Slavic sub-families. NMT approaches tend to have better fluency than RBMT but they can produce hallucinations in low-resource settings where RBMT can instead benefit from lexical translation with explicit bilingual dictionaries; thus a line of research has developed that attempts to combine both approaches. For instance, Sánchez-Cartagena, Forcada, and Sánchez-Martínez (2020) used multi-source Transformer and deep-GRU (Miceli Barone et al. 2017) models to post-edit translations produced by a RBMT system for the Breton–French language pair.

One of the main drawbacks of RBMT is that it requires substantial language-specific resources and expertise that might not be available for all low-resource languages. See Section 5 for a discussion of other methods to use various linguistic resources that might be more readily available.

As researchers build different MT systems in order to try out new ideas, how do they know whether one is better than another? If a system developer wants to deploy an MT system, how do they know which is the best? Answering these questions is the goal of MT evaluation—to provide a quantitative estimate of the quality of an MT system’s output. MT evaluation is a difficult problem, since there can be many possible correct translations of a given source sentence. The intended use is an important consideration in evaluation; if translation is mainly for assimilation (gisting), then adequacy is of primary importance and errors in fluency can be tolerated; but if the translation is for dissemination (with post-editing), then errors in meaning can be corrected, but the translation should be as close to a publishable form as possible. Evaluation is not specific to low-resource MT; it is a problem for all types of MT research, but low-resource language pairs can present specific difficulties for evaluation.

Evaluation can either be manual (using human judgments) or automatic (using software). Human judgments are generally considered the “gold standard” for MT evaluation because, ultimately, the translation is intended to be consumed by humans. The annual WMT shared tasks have used human evaluation every year since they started in 2006, and the organizers argue that (Callison-Burch et al. 2007):

While automatic measures are an invaluable tool for the day-to-day development of machine translation systems, they are an imperfect substitute for human assessment of translation quality.

Human evaluation is of course much more time-consuming (and therefore expensive) than automatic evaluation, and so can only be used to compare a small number of variants; with most of the system selection performed by automatic evaluation. For low-resource MT, the potential difficulty with human evaluation is connecting the researchers with the evaluators. Some low-resource languages have very small language communities, so the pool of potential evaluators is small, whereas in other cases the researchers may not be well connected with the language community—in that case the answer should be for the researchers to engage more with the community (Nekoto et al. 2020).

Most of the evaluation in MT is performed using automatic metrics, but when these metrics are developed they need to be validated against human judgments as the gold standard. An important source of gold standard data for validation of metrics is the series of WMT metrics shared tasks (Freitag et al. 2021b). However, this data covers the language pairs used in the WMT news tasks, whose coverage of low-resource languages is limited to those listed in Table 3. It is unclear how well conclusions about the utility of metrics will transfer from high-resource languages to low-resource languages.

Table 3

Shared tasks that have included low-resource language pairs.

2018 IWSLT (Niehues et al. 2018Basque–English
2018 WAT Mixed domain (Nakazawa et al. 2018Myanmar–English
2019 WAT Mixed domain (Nakazawa et al. 2019Myanmar–English and Khmer–English
2019 WAT Indic (Nakazawa et al. 2019Tamil–English
2019 WMT news (Barrault et al. 2019Kazakh–English and Gujarati–English
2019 LowResMT (Ojha et al. 2020{Bhojpuri, Latvian, Magahi and Sindhi} –English
2020 WMT news (Barrault et al. 2020{Tamil, Inuktitut, Pashto and Khmer} –English
2020 WMT Unsupervised and very low resource (Fraser 2020Upper Sorbian–German
2020 WMT Similar language (Barrault et al. 2020Hindi–Marathi
2020 WAT Mixed domain (Nakazawa et al. 2020Myanmar–English and Khmer–English
2020 WAT Indic (Nakazawa et al. 2020Odia–English
2021 AmericasNLP (Mager et al. 2021Ten indigenous languages of Latin America, to/from Spanish
2021 WAT News Comm (Nakazawa et al. 2021Japanese–Russian
2021 WAT Indic (Nakazawa et al. 2021Ten Indian languages, to/from English
2021 LowResMT (Ortega et al. 2021Taiwanese Sign Language–Trad. Chinese, Irish–English and Marathi–English
2021 WMT News (Akhbardeh et al. 2021Hausa–English and Bengali–Hindi
2021 WMT Unsupervised and very low resource (Libovický and Fraser 2021Chuvash–Russian and Upper Sorbian–German
2021 WMT Large-Scale Multilingual MT (Wenzek et al. 2021FLORES-101 2 small and 1 large task (10k pairs)
2018 IWSLT (Niehues et al. 2018Basque–English
2018 WAT Mixed domain (Nakazawa et al. 2018Myanmar–English
2019 WAT Mixed domain (Nakazawa et al. 2019Myanmar–English and Khmer–English
2019 WAT Indic (Nakazawa et al. 2019Tamil–English
2019 WMT news (Barrault et al. 2019Kazakh–English and Gujarati–English
2019 LowResMT (Ojha et al. 2020{Bhojpuri, Latvian, Magahi and Sindhi} –English
2020 WMT news (Barrault et al. 2020{Tamil, Inuktitut, Pashto and Khmer} –English
2020 WMT Unsupervised and very low resource (Fraser 2020Upper Sorbian–German
2020 WMT Similar language (Barrault et al. 2020Hindi–Marathi
2020 WAT Mixed domain (Nakazawa et al. 2020Myanmar–English and Khmer–English
2020 WAT Indic (Nakazawa et al. 2020Odia–English
2021 AmericasNLP (Mager et al. 2021Ten indigenous languages of Latin America, to/from Spanish
2021 WAT News Comm (Nakazawa et al. 2021Japanese–Russian
2021 WAT Indic (Nakazawa et al. 2021Ten Indian languages, to/from English
2021 LowResMT (Ortega et al. 2021Taiwanese Sign Language–Trad. Chinese, Irish–English and Marathi–English
2021 WMT News (Akhbardeh et al. 2021Hausa–English and Bengali–Hindi
2021 WMT Unsupervised and very low resource (Libovický and Fraser 2021Chuvash–Russian and Upper Sorbian–German
2021 WMT Large-Scale Multilingual MT (Wenzek et al. 2021FLORES-101 2 small and 1 large task (10k pairs)

Automatic metrics are nearly always reference-based—in other words, they work by comparing the MT hypothesis with a human-produced reference. This means that references need to be available for the chosen language pair, they should be of good quality, and ideally should be established benchmarks used by the research community. Such references are in short supply for low-resource language pairs (Section 2), and when available may be in the wrong domain, small, or of poor quality.

Looking at the automatic metrics in common use in current MT research, we see two main types of metrics being used in current research: string-based metrics (BLEU [Papineni et al. 2002; Post 2018], ChrF [Popovic 2015], etc.) and embedding-based metrics (BERTscore [Zhang et al. 2019], COMET [Rei et al. 2020], BLEURT [Sellam, Das, and Parikh 2020], etc.). The string-based metrics are “low-resource metrics,” because they do not require any resources beyond tokenization/segmentation, whereas the embedding-based metrics require more significant resources such as pre-trained sentence embeddings, and often need human judgments for fine-tuning. The embedding-based metrics could be considered successors to metrics like METEOR (Denkowski and Lavie 2014), which uses synonym lists to improve matching between hypothesis and reference.

Recent comparisons (Freitag et al. 2021b; Kocmi et al. 2021) have suggested that embedding-based metrics have superior performance—in other words, that they correlate better with human judgments than string-based metrics. However, since embedding-based metrics generally rely on sentence embeddings, they only work when such embeddings are available, and if they are fine-tuned on human evaluation data, they may not perform as well when such data is not available. For instance, COMET is based on the XML-R embeddings (Conneau et al. 2020), so can support the 100 languages supported by XML-R, but will not give reliable results for unsupported languages.

String-based metrics (such as BLEU and ChrF) will generally support any language for reference-based automatic evaluation. BLEU has been in use in MT evaluation for many years, and its benefits and limitations are well studied; ChrF has a much shorter history, but recent comparisons suggest that it performs better than BLEU (see references above) and its use of character n-grams probably make it more suited to the morphological complexity found in many low-resource languages. A recent debate in the MT evaluation literature has been about the ability for automatic metrics to discern differences in high-quality MT systems (Ma et al. 2019). However in low-resource MT, we may be faced with the opposite problem, namely, metrics may be less reliable when faced with several low-quality systems (Fomicheva and Specia 2019).

In conclusion, for evaluation of low-resource MT, we recommend human evaluation as the gold standard, but where automatic evaluation is used, to be especially wary of the lack of calibration of metrics and the potential unreliability of test sets for low-resource language pairs.

MT is a big field and many interesting papers are published all the time. Because of the variety of language pairs, toolkits and settings, it can be difficult to determine what research will have an impact beyond the published experiments. Shared tasks provide an opportunity to reproduce and combine research, while keeping the training and testing data constant.

System description papers can offer valuable insights into how research ideas can transfer to real gains when aiming to produce the best possible system with the data available. Whereas standard research papers often focus on showing that the technique or model proposed in the paper is effective, the incentives for system descriptions are different; authors are concerned with selecting the techniques (or most commonly the combination of techniques) that work best for the task. System descriptions therefore contain a good reflection of the techniques that researchers believe will work (together with their comparison) in standardized conditions. The difficulty with system descriptions is that the submitted systems are often potpourris of many different techniques, organized in pipelines with multiple steps, a situation that rarely occurs in research papers presenting individual approaches. In light of this, it is not always easy to pinpoint exactly which techniques lead to strongly performing systems, and it is often the case that similar techniques are used by both leading systems and those that perform less well. Moreover, the papers do not tend to provide an exhaustive and systematic com- parison of different techniques due to differences in implementation, data processing, and hyperparameters. We also note that the evaluation of shared tasks normally fo- cuses on quality alone, although a multi-dimensional analysis may be more appropriate (Ethayarajh and Jurafsky 2020), and that even if the task has manual evaluation, there is still debate about the best way to do this (Freitag et al. 2021a).

Apart from the system descriptions, an important output of shared tasks is the publication of standard training sets and test sets (Section 2). These can be used in later research, and help to raise the profile of the language pair for MT research.

In this section we survey the shared tasks that have included low-resource language pairs, and we draw common themes from the corresponding sets of system description papers, putting into perspective the methods previously described in this survey. Rather than attempting to quantify the use of different techniques à laLibovický (2021),21 we aim to describe how the most commonly used techniques are exploited, particularly for high-performing systems, providing some practical advice to training systems for low-resource language pairs. We begin with a brief description of shared tasks featuring low-resource pairs (Section 8.1), before surveying techniques commonly used (Section 8.2).

### 8.1 Low-resource MT in Shared Tasks

There are many shared tasks that focus on MT, going all the way back to the earli- est WMT shared task (Koehn and Monz 2006). However, they have tended to focus on well-resourced European languages and Chinese. Tasks specifically for low-resource MT are fairly new, coinciding with the recent interest in expanding MT to a larger range of language pairs.

We choose to focus particularly on shared tasks run by WMT (WMT Conference on Machine Translation), IWSLT (International Conference on Spoken Language Translation), WAT (Workshop on Asian Translation), and LowResMT (Low Resource Machine Translation). In Table 3, we list the shared MT tasks that have focused on low-resource pairs. In addition to the translation tasks, we should mention that the corpus filtering task at WMT has specifically addressed low-resource MT (Koehn et al. 2019, 2020).

### 8.2 Commonly Used Techniques

In this section, we review the choices made by participants to shared tasks for low-resource MT, focusing on those techniques that are particularly widespread, those that work particularly well and the choices that are specific to particular languages or language families. We describe these choices in an approximately step-by-step fashion: starting with data preparation (Section 8.2.1) and data processing (Section 8.2.2); then proceeding to model architecture choices (Section 8.2.3); exploiting additional data, including backtranslation, pre-training, and multilinguality (Section 8.2.4); and finally looking at model transformation and finalization, including ensembling, knowledge distillation, and fine-tuning (Section 8.2.5).

#### 8.2.1 Data Preparation

An important initial step to training an NMT model is to identify available data (See Section 2) and to potentially filter it, depending on the noisiness of the data set and how out-of-domain it is, or to use an alternative strategy to indicate domain or data quality (i.e., tagging). So, what choices do participants tend to make in terms of using (or excluding) data sources, filtering, and cleaning of data and using meta-information such as domain tags?

##### Choice of Data.

We focus on constrained submissions only (i.e., where participants can only use the data provided by the organizers), so most participants use all available data. Hesitancy can sometimes be seen with regard to Web-crawled data (other than WMT newscrawl, which is generally more homogeneous and therefore of better quality), some choosing to omit the data (Singh 2020) and others to filter it for quality (Chen et al. 2020; Li et al. 2020). It is very unusual to see teams do their own crawling (Hernandez and Nguyen [2020] is a counter-example); teams doing so run the risk of crawling data that overlaps with the development set or one side of the test set. Tran et al. (2021) successfully mined an extra million sentence pairs of Hausa–English data from the allowed monolingual data, helping them win the 2021 task.

##### Data Cleaning and Filtering.

Although not exhaustively reported, many of the submissions apply some degree of data cleaning and filtering to the parallel and monolingual data. In its simplest form, this means excluding sentences based on their length (if too long) and the ratio between the lengths of parallel sentences (if too different). Some teams also remove duplicates (e.g., Li et al. 2019). More rigorous cleaning includes eliminating sentences containing fewer than a specified percentage of alpha-numeric characters in sentences (depending on the language’s script), those identified as belonging to another language (e.g., using language identification), or those less likely to belong to the same distribution as the training data (e.g., using filtering techniques such as Moore-Lewis [Moore and Lewis 2010]). Data filtering is also a commonly used technique for backtranslation data (see the paragraph on data augmentation below), often using similar filtering techniques such as dual conditional cross-entropy filtering (Junczys-Dowmunt 2018) to retain only the cleanest and most relevant synthetic parallel sentences. Unfortunately, the effect of data filtering is rarely evaluated, probably because it would involve expensive re-training.

##### Data Tagging.

Some teams choose to include meta-information in their models through the addition of pseudo-tokens. For example, Dutta et al. (2020) choose to tag sentences according to their quality for the Upper Sorbian–German task, this information being provided by the organizers. Domain tagging (i.e., indicating the type of data), which can be useful to indicate whether data is in-domain or out-of-domain, was used by Chen et al. (2020), one of the top-scoring systems for Tamil–English. For the Basque–English task, Scherrer (2018) find that using domain tags gives systematic improvements over not using them, and Knowles et al. (2020a) come to the same conclusion when translating into Inuktitut.

#### 8.2.2 Data Pre-processing

There is some variation in which data pre-processing steps are used. For example, it has been shown that for high-resource language pairs such as Czech–English, it is not always necessary to apply tokenization and truecasing steps (Bawden et al. 2019) before applying subword segmentation. We do not observe a clear pattern, with many systems applying all steps, and some excluding tokenization (Wu et al. 2020 for Tamil) and truecasing. Among the different possible pre-processing steps, we review participants’ choices concerning tokenization, subword segmentation, and transliteration/alphabet mapping (relevant when translating between languages that use different scripts).

##### Tokenization.

If a tokenizer is used before subword segmentation, it is common for it to be language-specific, particularly for the low-resource language in question. For example, IndicNLP22 (Kunchukuttan 2020) is widely used for Indian languages (e.g., forthe shared tasks involving Gujarati and Tamil), and many of the Khmer–English submissions also used Khmer-specific tokenizers. For European languages, the Moses tokenizer (Koehn et al. 2007) remains the most commonly used option.

##### Subword Segmentation.

All participants perform some sort of subword segmentation, with most participants using either SentencePiece (Kudo and Richardson 2018)23 or subword_nmt toolkits (Sennrich, Haddow, and Birch 2016b).24 Even though the BPE toolkit is not compatible with the Abugida scripts used for Gujarati, Tamil, and Khmer (in these scripts, two unicode codepoints can be used to represent one glyph), we only found one group who modified BPE to take this into account (Shi et al. 2020). BPE-dropout (Provilkov, Emelianenko, and Voita 2020), a regularization method, was found to be useful by a number of teams (Knowles et al. 2020b; Libovický et al. 2020; Chronopoulou et al. 2020).

The size of the subword vocabulary is often a tuned parameter, although the range of different values tested is not always reported. Surprisingly, there is significant variation in the subword vocabulary sizes used, and there is not always a clear pattern. Despite the low-resource settings, many of the systems use quite large subword vocabularies (30k–60k merge operations). There are exceptions: A large number of the systems for Tamil–English use small vocabularies (6k–30k merge operations), which may be attributed to the morphologically rich nature of Tamil coupled with the scarcity of data.

Joint subword segmentation is fairly common. Its use is particularly well motivated when the source and target languages are similar where we may expect to see a high amount of lexical overlap (e.g., for the similar language shared tasks such as Upper Sorbian–German) and when “helper languages” are used to compensate for the low-resource scenario (e.g., addition of Czech and English data). However, it is also used in some cases even where there is little lexical overlap, for example, for Tamil–English, where the languages do not share the same script, including by some of the top-scoring systems (Shi et al. 2020; Wu et al. 2018). Although few systematic studies are reported, one hypothesis could be that even if different scripts are used there is no disadvantage to sharing segmentation; it could help with named entities and therefore reducing the overall vocabulary size of the model (Ding, Renduchintala, and Duh 2019).

A few works explore alternative morphology-driven segmentation schemes, but without seeing any clear advantage: Scherrer, Grönroos, and Virpioja (2020) find that, for Upper-Sorbian–German, Morfessor can equal the performance of BPE when tuned correctly (but without surpassing it), whereas Sánchez-Cartagena (2018) find gains for Morfessor over BPE. Dhar, Bisazza, and van Noord (2020) have mixed results for Tamil–English when comparing linguistically motivated subword units compared with the use of statistics-based SentencePiece (Kudo and Richardson 2018).

##### Transliteration and Alphabet Mapping.

Transliteration and alphabet mapping have been principally used in the context of exploiting data from related languages that are written in different scripts. This was particularly present for translation involving Indian languages, which often have their own script. For the Gujarati–English task, many of the top systems used Hindi–English data (see below the paragraph on using other language data) and performed alphabet mapping into the Gujarati script (Li et al. 2019; Bawden et al. 2019; Dabre et al. 2019). For Tamil–English, Goyal et al. (2020) found that when using Hindi in a multilingual set-up, it helped for Hindi to be mapped into the Tamil script for the Tamil$→$English direction, but did not bring improvements for English$→$Tamil. Transliteration was also used in the Kazakh–English task, particularly with the addition of Turkish as higher-resourced language. Toral et al. (2019), a top-scoring system, chose to cyrillize Turkish to increase overlap with Kazakh, whereas Briakou and Carpuat (2019) chose to romanize Kazakh to increase the overlap with Turkish, but only for the Kazakh$→$English direction.

#### 8.2.3 Model Architectures and Training

The community has largely converged on a common architecture, the Transformer (Vaswani et al. 2017), although differences can be observed in terms of the number of parameters in the model and certain training parameters. It is particularly tricky to make generalizations about model and training parameters given the dependency on other techniques used (which can affect how much data is available). However a few generalizations can be seen, which we review here.

##### SMT versus NMT.

There is little doubt that NMT has overtaken SMT, even in low-resource tasks. The majority of submissions use neural MT models and more specifically transformers (rather than recurrent models). Some teams compare SMT and NMT (Dutta et al. 2020; Sen et al. 2019a), with the conclusion that NMT is better when suf- ficient data is available, including synthetic data. Some teams use SMT only for back- translation, on the basis that SMT can work better (or at least be less susceptible to hallucinating) on the initial training using a very small amount of parallel data. For SMT systems, the most commonly used toolkit is Moses (Koehn et al. 2007), whereas there is a little more variation for NMT toolkits—the most commonly used being Fairseq (Ott et al. 2019), Marian (Junczys-Dowmunt et al. 2018a), OpenNMT (Klein et al. 2017), and Sockeye (Hieber et al. 2020).

##### Model Size.

Although systematic comparisons are not always given, some participants did indicate that architecture size was a tuned parameter (Chen et al. 2020), although this can be computationally expensive and therefore not a possibility for all teams. The model sizes chosen for submissions varies, and there is not a clear and direct link between size and model performance. However, there are some general patterns worth commenting on. While many of the baseline models are small (corresponding to transformer-base or models with fewer layers), a number of high-scoring teams found that it was possible to train larger models (e.g., deeper or wider) as long as additional techniques were used, such as monolingual pre-training (Wu et al. 2020) or additional data from other languages in a multilingual set-up or after synthetic data creation through pivoting (Li et al. 2019) through a higher-resource language or backtranslation (Chen et al. 2020; Li et al. 2019). For example, the Facebook AI team (Chen et al. 2020), who fine-tuned for model architecture, started with a smaller transformer (3 layers and 8 attention heads) for their supervised English$→$Tamil baseline, but were able to increase this once backtranslated data was introduced (to 10 layers and 16 attention heads). Although some systems perform well with a transformer-base model (Bawden et al. 2019 for Tamil–English), many of the best systems use larger models, such as the transformer-big (Hernandez and Nguyen 2020; Kocmi 2020; Bei et al. 2019; Wei et al. 2020; Chen et al. 2020).

##### Alternative Neural Architectures.

Other than variations on the basic transformer model, there are few alternative architectures tested. Wu et al. (2020) tested the addition of dynamic convolutions to the transformer model following Wu et al. (2019), which they used along with other wider and deep transformers in model ensembles. However, they did not compare the different models. Another alternative form of modeling tested by several teams was factored representations (see Section 5.1.2). Dutta et al. (2020) explored the addition of lemmas and part-of-speech tags for Upper-Sorbian–German but without seeing gains, since the morphological tool used was not adapted to Upper-Sorbian. For Basque–English, Williams et al. (2018) find that source factors indicating the language of the subword can help to improve the baseline system.

##### Training Parameters.

Exact training parameters are often not provided, making comparison difficult. Many of the participants do not seem to choose training parameters that are markedly different from the higher-resource settings (Zhang et al. 2020c; Wu et al. 2020).

Much of this survey has been dedicated to approaches for the exploitation of additional resources to compensate for the lack of data for low-resource language pairs: monolingual data (Section 3), multilingual data (Section 4), or other linguistic resources (Section 5). In shared tasks, the following approaches have been shown to be highly effective to boosting performance in low-resource scenarios.

##### Backtranslation.

The majority of high-performing systems carry out some sort of data augmentation, the most common being backtranslation, often used iteratively, although forward translation is also used (Shi et al. 2020; Chen et al. 2020; Zhang et al. 2020c; Wei et al. 2020). For particularly challenging language pairs (e.g., for very low-resource between languages that are not very close), it is important for the initial model that is used to produce the backtranslations to be of sufficiently high quality. For example, some of the top Gujarati–English systems used pre-training before backtranslation to boost the quality of the initial model (Bawden et al. 2019; Bei et al. 2019). Participants do not always report the number of iterations of backtranslations performed, although those that do often cite the fact that few improvements are seen beyond two iterations (Chen et al. 2020). Tagged backtranslation, whereby a pseudo-token is added to sentences that are backtranslated to distinguish them from genuine parallel data, have previously been shown to provide improvements (Caswell, Chelba, and Grangier 2019). Several participants report gains thanks to the addition of backtranslation tags (Wu et al. 2020; Chen et al. 2020; Knowles et al. 2020a), although Goyal et al. (2020) find that tagged backtranslation under-performs normal backtranslation in a multilingual set-up for Tamil–English.

##### Synthetic Data from Other Languages.

A number of top-performing systems successfully exploit parallel corpora from related languages. The two top-performing systems for Gujarati–English use a Hindi–English parallel corpus to create synthetic Gujarati–English data (Li et al. 2019; Bawden et al. 2019). Both exploit the fact that there is a high degree of lexical overlap between Hindi and Gujarati once Hindi has been transliterated into Gujarati script. Li et al. (2019) choose to transliterate the Hindi side and then to select the best sentences using cross-entropy filtering, whereas Bawden et al. (2019) choose to train a Hindi$→$Gujarati model, which they use to translate the Hindi side of the corpus. Pivoting through a higher-resource related language was also found to be useful for other language pairs: for Kazakh–English, Russian was the language of choice (Li et al. 2019; Toral et al. 2019; Dabre et al. 2019; Budiwati et al. 2019); for Basque–English, Spanish was used as a pivot (Scherrer 2018; Sánchez-Cartagena 2018), which was found to be more effective than backtranslation by Scherrer (2018), and was found to benefit from additional filtering by Sánchez-Cartagena (2018).

##### Transfer-learning Using Language Modeling Objectives.

The top choices of language modeling objectives are mBART (Liu et al. 2020) (used by Chen et al. [2020] and Bawden et al. [2020] for Tamil–English), XLM (Conneau and Lample 2019) (used by Bawden et al. [2019] for Gujarati–English, by Laskar et al. [2020] for Hindi–Marathi, and by Kvapilíková, Kocmi, and Bojar [2020] and Dutta et al. [2020] for Upper-Sorbian–German), and MASS (Song et al. 2019a) (used by Li et al. [2020] and Singh, Singh, and Bandyopadhyay [2020] for Upper Sorbian–German). Some of the top systems used these language modeling objectives, but their use was not across the board, and pre-training using translation objectives was arguably more common. Given the success of pre-trained models in NLP, this could be surprising. A possible explanation for these techniques not being used systematically is that they can be computationally expensive to train from scratch and the constrained nature of the shared tasks means that the participants are discouraged from using pre-trained language models.

##### Transfer Learning from Other MT Systems.

Another commonly used technique used by participants was transfer learning involving other language pairs. Many of the teams exploited a high-resource related language pair. For example, for Kazakh–English, pre-training was done using Turkish–English (Briakou and Carpuat 2019) and Russian–English (Kocmi and Bojar 2019), Dabre et al. (2019) pre-trained for Gujarati–English using Hindi–English, and Czech–German was used to pre-train for Upper-Sorbian–German (Knowles et al. 2020b).

An alternative but successful approach was to use a high-resource but not necessarily related language pair. For example, the CUNI systems use Czech–English to pre-train Inuktitut (Kocmi 2020) and Gujarati (Kocmi and Bojar 2019), and Bawden et al. (2020) found pre-training on English–German to be as effective as mBART training for Tamil–English. Finally, a number of teams opted for multilingual pre-training, involving the language pair in question and a higher-resource language or several higher-resource languages. Wu et al. (2020) use the mRASP approach: a universal multilingual model in- volving language data for English to and from Pashto, Khmer, Tamil, Inuktitut, German and Polish, which is then fine-tuned to the individual low-resource language pairs.

##### Multilingual Models.

Other than the pre-training strategies mentioned just above, multilingual models feature heavily in shared task submissions. The overwhelmingly most-common framework used was the universal encoder-decoder models as proposed by Johnson et al. (2017). Some participants chose to include select (related) languages. Williams et al. (2018) use Spanish to boost Basque–English translation and find that the addition of French data degrades results. Goyal and Sharma (2019) add Hindi as an additional encoder language for Gujarati–English; and for Tamil–English, they test adding Hindi to either the source or target side depending on whether Tamil is the source or target language (Goyal et al. 2020). Other participants choose to use a larger number of languages. Zhang et al. (2020c) train a multilingual system on six Indian languages for Tamil–English; and Hokamp, Glover, and Gholipour Ghalandari (2019) choose to train a multilingual model on all WMT languages for Gujarati–English (coming middle in the results table). Upsampling the lower-resourced languages in the multilingual systems is an important factor, whether the multilingual system is used as the main model or for pre-training (Zhang et al. 2020c; Wu et al. 2020). Recent approaches have seen success using more diverse and even larger numbers of languages. Tran et al. (2021) train a model for 14 diverse language directions, winning 10 of them (although their system is unconstrained). Their two models, many-to-English and English-to-many, used a Sparsely Gated Mixture-of-Expert (MoE) model (Lepikhin et al. 2020). The MoE strike a balance between allowing high-resource directions to benefit from increased model capacity, while also allowing transfer to low-resource directions via shared capacity. Microsoft’s winning submission to the 2021 large-scale multilingual task (Yang et al. 2021) covered 10k language pairs across the FLORES-101 data set. They use the public available DeltaLM-Large model (Ma et al. 2021), a multilingual pre-trained encoder-decoder model, and apply progressive learning (Zhang et al. 2020b) (starting training with 24 encoder layers and adding 12 more) and iterative backtranslation.

#### 8.2.5 Model Transformation and Finalization

Additional techniques, not specific to low-resource MT, are often applied in the final stages of model construction, and they can provide significant gains to an already trained model. We regroup here knowledge distillation (which we consider as a sort of model transformation) and both model combination and fine-tuning (which can be considered model finalization techniques).

##### Knowledge Distillation.

Knowledge distillation is also a frequently used technique and seems to give minor gains, although is not as frequently used as backtranslation or ensembling. Knowledge distillation (Kim and Rush 2016) leverages a large teacher model to train a student model. The teacher model is used to translate the training data, resulting in synthetic data on the target side. A number of teams apply this iteratively, in combination with backtranslation (Xia et al. 2019) or fine-tuning (Li et al. 2019). Bei et al. (2019) mix knowledge distillation data with genuine and synthetic parallel data to train a new model to achieve gains in BLEU.

##### Model Combination.

Ensembling is the combination of several independently trained models and is used by a large number of participants to get gains over single systems. Several teams seek to create ensembles of diverse models, including deep and wide ones. For example, Wu et al. (2020) experiment with ensembling for larger models (larger feed forward dimension and then deeper models), including using different sampling strategies to increase the number of different models. Ensembling generally leads to better results but not always. Wu et al. (2020) found that a 9-model ensemble was best for Khmer and Pashto into English, but they found that for English into Khmer and Pashto, a single model was best. A second way of combining several models is to use an additional model to rerank n-best hypothesis of an initial model. Libovický et al. (2020) attempted right-to-left rescoring (against the normally produced left-to-right hypothesis), but without seeing any gains for Upper-Sorbian–German. Chen et al. (2020) test noisy channel reranking for Tamil–English, but without seeing gains either, although some gains are seen for Inuktitut$→$English, presumably because of the high-quality monolingual news data available to train a good English language model.

##### Fine-tuning.

Mentioned previously in the context of pre-training, fine-tuning was used in several contexts by a large number of teams. It is inevitably used after pre-training on language model objectives or on other language pairs (see above) to adapt the model to the language direction in question. It is also frequently used on models trained on backtranslated data, by fine-tuning on genuine parallel data (Sánchez-Cartagena, Pérez-Ortiz, and Sánchez-Martínez 2019). A final boost used by a number of top-systems is achieved through fine-tuning to the development set (Shi et al. 2020; Chen et al. 2020; Zhang et al. 2020c; Wei et al. 2020). This was a choice not made by all teams, some of whom chose to keep it as a held-out set, notably to avoid the risk of overfitting.

In the previous section (Section 8), we saw that even if shared tasks rarely offer definitive answers, they do give a good indication of what combination of techniques can deliver state-of-the-art systems for particular language pairs. This look at what methods are commonly used in practice hopefully provides some perspective on the research we have covered in this survey.

In this survey we have looked at the entire spectrum of scientific effort in the field: from data sourcing and creation (Section 2), leveraging all types of available data, monolingual data (Section 3), multilingual data (Section 4), and other linguistic resources (Section 5), to improving model robustness, training, and inference (Section 6), and finally evaluating the results (Section 7).

Thanks to large-scale and also more focused efforts to identify, scrape, and filter parallel data from the Web, some language pairs have moved quickly from being considered low-resource to now being considered more medium-resourced (e.g., Romanian–English, Turkish–English and Hindi–English). The ability of deep learning models to learn from monolingual data (Sennrich, Haddow, and Birch 2016a; Cheng et al. 2016; He et al. 2016) and related languages (Liu et al. 2020; Conneau and Lample 2019; Pan et al. 2021) has had a big impact on this field. However, state-of-the-art systems for high-resourced language pairs like German-English and Chinese-English still seem to require huge amounts of data (Akhbardeh et al. 2021), far more than a human learner.

So how far have we come, and can we quantify the difference between the performance on high- and low- resource language pairs? Examining recent comparisons reveals a mixed picture, and shows the difficulty of trying to compare results across language pairs. In the WMT21 news task, we can compare the German–English (nearly 100M sentences of parallel training data) with the Hausa–English (just over 0.5M sentences parallel) tasks. The direct assessment scores of the best systems are similar: German$→$ English (71.9), Hausa$→$ English (74.4), English$→$German (83.3), and English$→$Hausa (82.7), on a 0–100 scale. In fact, for both the out-of-English pairs, the evaluation did not find the best system to be statistically significantly worse than the human translation. In the WMT21 Unsupervised and Low-resource task (Libovický and Fraser 2021), BLEU scores of around 60 were observed for both German–Upper Sorbian directions. However, in the shared tasks run by AmericasNLP (Mager et al. 2021), nearly all BLEU scores were under 10, and in the “full” track of the 2021 Large-scale Multilingual shared task (Wenzek et al. 2021), the mean BLEU score of the best system was around 16, indicating low scores for many of the 10,000 pairs.

So, in terms of quantifiable results, it’s a mixed picture with indications that translation can still be very poor in many cases, but if large-scale multilingual systems with ample synthetic data are possible, with a moderate amount of parallel data, then acceptable results can be obtained. However, even in the latter circumstances (for Hausa–English) it’s unclear how good the translations really are—a detailed analysis such as done by Freitag et al. (2021a) for German-English could be revealing. It is clear, however, that for language pairs outside the 100–200 that have been considered so far, MT is likely to be very poor or even impossible due to the severe lack of data. This is especially true for non-English pairs, which are not much studied or evaluated in practice.

Looking forward, we now discuss a number of key areas for future work.

##### Collaboration with Language Communities.

Recent efforts by language communities to highlight their work and their lack of access to opportunities to develop expertise and contribute to language technology has resulted in bringing valuable talent and linguistic knowledge into our field. A few major community groups are Masakhane (Nekoto et al. 2020; Onome Orife et al. 2020) for African languages, GhanaNLP,25 and AI4Bharat.26 Some of these have also led to workshops encouraging work on these languages, such as AfricaNLP,27 WANLP (Arabic Natural Language Procesing Workshop) (Habash et al. 2021), DravidianLangTech (Chakravarthi et al. 2021), BSNLP (Balto-Slavic languages Natural Language Procesing) (Babych et al. 2021), AmericasNLP (Mager et al. 2021), and WILDRE (Jha et al. 2020), to name a few. It is very clear that we need to work together with speakers of low-resource and endangered languages to understand their challenges and their needs and to create knowledge and resources that can help them benefit from progress in our field, as well as also help their communities overcome language barriers.

##### Massive Multilingual Models.

The striking success of multilingual pre-trained models such as mBART (Liu et al. 2020) and mRASP (Pan et al. 2021) still needs further investigation, and massively multilingual models clearly confer advantage to both high- and low-resource pairs (Tran et al. 2021; Yang et al. 2021). We should be able to answer questions such as whether the gains are more from the size of models, or from the number of languages the models are trained on, or is it from the sheer amount of data used. There are also questions about how to handle languages that are not included in the pre-trained model.

##### Incorporating External Knowledge.

We will never have enough parallel data, and for many language pairs the situation is harder due to a lack of high-resourced related languages and a lack of monolingual data. We know that parallel data is not an efficient way to learn to translate. We have not fully explored questions such as what is a more efficient way of encoding translation knowledge—bilingual lexicons, grammars or ontologies—or indeed what type of knowledge is most helpful in creating MT systems and how to gather it. Further work looking at how we can best incorporate these resources is also needed: Should they be incorporated directly into the model or should we use them to create synthetic parallel data?

##### Robustness.

Modern MT systems, being large neural networks, tend to incur substantial quality degradation as the distribution of data encountered by the production system becomes more and more different than the distribution of the training data (Lapuschkin et al. 2019; Hupkes et al. 2019; Geirhos et al. 2020). This commonly occurs in translation applications where the language domains, topics, and registers can be extremely varied and change quickly over time. Especially in low-resource settings, we are often limited to old training corpora from a limited sets of domains. Therefore it is of great importance to find ways to make the systems robust to distribution shifts. This is a big research direction in general machine learning, but it has a specific angle in MT due to the possibility of producing hallucinations (Martindale et al. 2019; Raunak, Menezes, and Junczys-Dowmunt 2021) that might mislead the user. We need to find ways to make the systems detect out-of-distribution conditions and ideally avoid producing hallucinations or at least warn the user that the output might be misleading.

In conclusion, we hope that this survey will provide researchers with a broad understanding of low-resource machine translation, enabling them to be more effective at developing new tools and resources for this challenging, yet essential, research field.

This work was partly funded by Rachel Bawden’s chair in the PRAIRIE institute funded by the French national agency ANR as part of the “Investissements d’avenir” programme under the reference ANR-19-P3IA-0001. This work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 825299 (GoURMET). It was also supported by the UK Engineering and Physical Sciences Research Council fellowship grant EP/S001271/1 (MTStretch).

1

From Ethnologue: https://www.ethnologue.com/.

2

The Opus counts were accessed in September 2021.

3

There are of course parallel corpora that are not in Opus, and so these numbers remain an approximation, but we believe that due to its size and coverage, Opus counts can be taken as indicative of the total available parallel data.

4

To find further resources for Fon, we could consider Fon–French (Emezue and Dossou 2020), where there is more parallel data than for Fon–English, although the cleaned corpus still has fewer than 55k sentence pairs.

5

Unfortunately the JW300 corpus is no longer available to researchers, since the publishers have asked for it to be withdrawn for breach of copyright.

14

With the difference that unsupervised MT is architecture-dependent and we choose to discuss it in Section 3.2 on synthesizing parallel data.

15

These techniques are inspired by data augmentation in computer vision, where it is much simpler to manipulate examples to create new ones (for example by flipping and cropping images) while preserving the example label. The difficulty in constructing synthetic examples in NLP in general is the fact that modifying any of the discrete units of the sentence is likely to change the meaning or grammaticality of the sentence.

16

Note that many of the works designed to incorporate bilingual lexicons actually work on automatically produced correspondences in the form of phrase tables (produced using SMT methods). Although these may be extracted from the same parallel data as used to train the NMT model, it may be possible to give more weight to infrequent words than may be the case when the pairs are learned using NMT.

17

The origin of the source units (i.e., original or inserted translation) is distinguished by using factored representations. A similar technique was used to insert dictionary definitions rather than translations by Zhong and Chiang (2020).

18

Inductive bias is the preference towards certain probability distributions that the system has before training, resulting from its architecture.

19

Uncertainty that is caused by the intrinsic randomness of the task, as opposed to epistemic uncertainty, which results from ignorance about the nature of the task.

20

The implementations of beam search differ between MT toolkits in details that can have significant impact over translation quality and are unfortunately often not well documented in the accompanying papers.

,
David Ifeoluwa
,
Dana
Ruiter
,
Jesujoba O.
Alabi
,
Damilola
,
Ayeni
,
Mofe
,
Ayodele
Awokoya
, and
Cristina
España-Bonet
.
2021
.
MENYO-20k: A multi-domain English-Yorùbá corpus for machine translation and domain adaptation
.
CoRR
,
abs/2103.08647
.
Agić
,
Željko
and
Ivan
Vulić
.
2019
.
JW300: A wide-coverage parallel corpus for low-resource languages
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3204
3210
.
Aharoni
,
Roee
,
Melvin
Johnson
, and
Orhan
Firat
.
2019
.
Massively multilingual neural machine translation
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
3874
3884
.
Aji
,
Alham Fikri
,
Nikolay
Bogoychev
,
Kenneth
Heafield
, and
Rico
Sennrich
.
2020
.
In neural machine translation, what does transfer learning transfer?
In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7701
7710
.
Akhbardeh
,
,
Arkhangorodsky
,
Magdalena
Biesialska
,
Ondřej
Bojar
,
Rajen
Chatterjee
,
Vishrav
Chaudhary
,
Marta R.
Costa-jussa
,
Cristina
España-Bonet
,
Angela
Fan
,
Christian
Federmann
,
Markus
Freitag
,
Yvette
Graham
,
Roman
Grundkiewicz
,
Barry
,
Leonie
Harter
,
Kenneth
Heafield
,
Christopher
Homan
,
Matthias
Huck
,
Kwabena
Amponsah-Kaakyire
,
Jungo
Kasai
,
Daniel
Khashabi
,
Kevin
Knight
,
Tom
Kocmi
,
Philipp
Koehn
,
Nicholas
Lourie
,
Christof
Monz
,
Makoto
Morishita
,
Masaaki
Nagata
,
Ajay
Nagesh
,
Toshiaki
Nakazawa
,
Matteo
Negri
,
Santanu
Pal
,
Allahsera Auguste
Tapo
,
Marco
Turchi
,
Valentin
Vydrin
, and
Marcos
Zampieri
.
2021
.
Findings of the 2021 conference on machine translation (WMT21)
. In
Proceedings of the Sixth Conference on Machine Translation
, pages
1
88
.
Arivazhagan
,
Naveen
,
Ankur
Bapna
,
Orhan
Firat
,
Dmitry
Lepikhin
,
Melvin
Johnson
,
Maxim
Krikun
,
Mia Xu
Chen
,
Yuan
Cao
,
George
Foster
,
Colin
Cherry
,
Wolfgang
Macherey
,
Zhifeng
Chen
, and
Yonghui
Wu
.
2019
.
Massively multilingual neural machine translation in the wild: Findings and challenges
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
3874
3884
.
Artetxe
,
Mikel
,
Gorka
Labaka
,
Eneko
Agirre
, and
Kyunghyun
Cho
.
2018
.
Unsupervised neural machine translation
. In
Proceedings of the 6th International Conference on Learning Representations
.
Artetxe
,
Mikel
and
Holger
Schwenk
.
2019a
.
Margin-based parallel corpus mining with multilingual sentence embeddings
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3197
3203
.
Artetxe
,
Mikel
and
Holger
Schwenk
.
2019b
.
Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond
.
Transactions of the Association for Computational Linguistics
,
7
:
597
610
.
Arthaud
,
Farid
,
Rachel
Bawden
, and
Alexandra
Birch
.
2021
.
Few-shot learning through contextual data augmentation
. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
, pages
1049
1062
.
Arthur
,
Philip
,
Graham
Neubig
, and
Satoshi
Nakamura
.
2016
.
Incorporating discrete translation lexicons into neural machine translation
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
1557
1567
.
Ataman
,
Duygu
,
Wilker
Aziz
, and
Alexandra
Birch
.
2019
.
A latent morphology model for open-vocabulary neural machine translation
. In
Proceedings of the 7th International Conference on Learning Representations
.
Ataman
,
Duygu
,
Matteo
Negri
,
Marco
Turchi
, and
Marcello
Federico
.
2017
.
Linguistically motivated vocabulary reduction for neural machine translation from Turkish to English
.
The Prague Bulletin of Mathematical Linguistics
,
108
:
331
342
.
Babych
,
Bogdan
,
Olga
Kanishcheva
,
Preslav
Nakov
,
Jakub
Piskorski
,
Lidia
Pivovarova
,
Vasyl
Starko
,
Josef
Steinberger
,
Roman
Yangarber
,
Michał
Marcińczuk
,
Senja
Pollak
,
Pavel
Přibáň
, and
Marko
Robnik-Šikonja
, editors.
2021
.
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
.
Bahdanau
,
Dzmitry
,
Kyunghyun
Cho
, and
Yoshua
Bengio
.
2015
.
Neural machine translation by jointly learning to align and translate
. In
Proceedings of the 3rd International Conference on Learning Representations
.
Bañón
,
Marta
,
Pinzhen
Chen
,
Barry
,
Kenneth
Heafield
,
Hieu
Hoang
,
Miquel
Esplà-Gomis
,
Mikel L.
,
Amir
Kamran
,
Faheem
Kirefu
,
Philipp
Koehn
,
Sergio Ortiz
Rojas
,
Leopoldo Pla
Sempere
,
Gema
Ramírez-Sánchez
,
Elsa
Sarrías
,
Marek
Strelec
,
Brian
Thompson
,
William
Waites
,
Dion
Wiggins
, and
Jaume
Zaragoza
.
2020
.
ParaCrawl: Web-scale acquisition of parallel corpora
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
4555
4567
.
Barrault
,
Loic
,
Ondřej
Bojar
,
Marta R.
Costa-jussa
,
Christian
Federmann
,
Mark
Fishel
,
Yvette
Graham
,
Barry
,
Matthias
Huck
,
Philipp
Koehn
,
Shervin
Malmasi
,
Christof
Monz
,
Mathias
Maller
,
Santanu
Pal
,
Matt
Post
, and
Marcos
Zampieri
.
2019
.
Findings of the 2019 Conference on Machine Translation (WMT19)
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
1
61
.
Barrault
,
Loïc
,
Magdalena
Biesialska
,
Ondrej
Bojar
,
Marta R.
Costa-jussà
,
Christian
Federmann
,
Yvette
Graham
,
Roman
Grundkiewicz
,
Barry
,
Matthias
Huck
,
Eric
Joanis
,
Tom
Kocmi
,
Philipp
Koehn
,
Chi-kiu
Lo
,
Nikola
Ljubešic
,
Christof
Monz
,
Makoto
Morishita
,
Masaaki
Nagata
,
Toshiaki
Nakazawa
,
Santanu
Pal
,
Matt
Post
, and
Marcos
Zampieri
.
2020
.
Findings of the 2020 Conference on Machine Translation (WMT20)
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1
55
.
Bastings
,
Jasmijn
,
Wilker
Aziz
, and
Ivan
Titov
.
2019
.
Interpretable neural predictions with differentiable binary variables
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
2963
2977
.
Bawden
,
Rachel
,
Alexandra
Birch
,
Dobreva
,
Arturo
Oncevay
,
Antonio Valerio Miceli
Barone
, and
Philip
Williams
.
2020
.
The University of Edinburgh’s English-Tamil and English-Inuktitut submissions to the WMT20 news translation task
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
92
99
.
Bawden
,
Rachel
,
Nikolay
Bogoychev
,
Ulrich
Germann
,
Roman
Grundkiewicz
,
Faheem
Kirefu
,
Antonio Valerio Miceli
Barone
, and
Alexandra
Birch
.
2019
.
The University of Edinburgh’s submissions to the WMT19 news translation task
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
103
115
.
Baziotis
,
Christos
,
Barry
, and
Alexandra
Birch
.
2020
.
Language model prior for low-resource neural machine translation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7622
7634
.
Bei
,
Chao
,
Hao
Zong
,
Conghu
Yuan
,
Qingming
Liu
, and
Baoyong
Fan
.
2019
.
GTCOM neural machine translation systems for WMT19
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
116
121
.
Bengio
,
Samy
,
Oriol
Vinyals
,
Navdeep
Jaitly
, and
Noam
Shazeer
.
2015
.
Scheduled sampling for sequence prediction with recurrent neural networks
. In
Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1
,
NIPS’15
, pages
1171
1179
.
Bertoldi
,
Nicola
and
Marcello
Federico
.
2009
.
Domain adaptation for statistical machine translation with monolingual resources
. In
Proceedings of the Fourth Workshop on Statistical Machine Translation
, pages
182
189
.
Bhattacharyya
,
Sumanta
,
Rooshenas
,
Subhajit
,
Simeng
Sun
,
Mohit
Iyyer
, and
Andrew
McCallum
.
2021
.
Energy-based reranking: Improving neural machine translation using energy-based models
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
4528
4537
.
Bhosale
,
Shruti
,
Kyra
Yee
,
Sergey
Edunov
, and
Michael
Auli
.
2020
.
Language models not just for pre-training: Fast online neural noisy channel modeling
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
584
593
.
Birch
,
Alexandra
,
Barry
,
Antonio Valerio Miceli
Barone
,
Jindrich
Helcl
,
Jonas
Waldendorf
,
Felipe Sánchez
Martínez
,
Mikel
,
Víctor Sánchez
Cartagena
,
Juan Antonio
Pérez-Ortiz
,
Miquel
Esplà-Gomis
,
Wilker
Aziz
,
Lina
,
Sevi
Sariisik
,
Peggy van der
Kreeft
, and
Kay
Macquarrie
.
2021
.
Surprise language challenge: Developing a neural machine translation system between Pashto and English in two months
. In
Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track)
, pages
92
102
.
Bojanowski
,
Piotr
,
Edouard
Grave
,
Armand
Joulin
, and
Tomas
Mikolov
.
2017
.
Enriching word vectors with subword information
.
Transactions of the Association for Computational Linguistics
,
5
:
135
146
.
Bojar
,
Ondřej
and
Aleš
Tamchyna
.
2011
.
Improving translation model by monolingual data
. In
Proceedings of the Sixth Workshop on Statistical Machine Translation
, pages
330
336
.
Bojar
,
Ondrej
,
Christian
Federmann
,
Mark
Fishel
,
Yvette
Graham
,
Barry
,
Matthias
Huck
,
Philipp
Koehn
, and
Christof
Monz
.
2018
.
Findings of the 2018 Conference on Machine Translation (WMT18)
. In
Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers
, pages
272
307
.
Bowman
,
Samuel R.
,
Luke
Vilnis
,
Oriol
Vinyals
,
Andrew
Dai
,
Rafal
Jozefowicz
, and
Samy
Bengio
.
2016
.
Generating sentences from a continuous space
. In
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning
, pages
10
21
.
Briakou
,
Eleftheria
and
Marine
Carpuat
.
2019
.
The University of Maryland’s Kazakh-English neural machine translation system at WMT19
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
134
140
.
Brown
,
Peter F.
,
Stephen A. Della
Pietra
,
Vincent J. Della
Pietra
, and
Robert L.
Mercer
.
1993
.
The mathematics of statistical machine translation: Parameter estimation
.
Computational Linguistics
,
19
(
2
):
263
311
.
Brown
,
Tom
,
Benjamin
Mann
,
Nick
Ryder
,
Melanie
Subbiah
,
Jared D.
Kaplan
,
Prafulla
Dhariwal
,
Arvind
Neelakantan
,
Pranav
Shyam
,
Girish
Sastry
,
Amanda
,
Sandhini
Agarwal
,
Ariel
Herbert-Voss
,
Gretchen
Krueger
,
Tom
Henighan
,
Rewon
Child
,
Ramesh
,
Daniel
Ziegler
,
Jeffrey
Wu
,
Clemens
Winter
,
Chris
Hesse
,
Mark
Chen
,
Eric
Sigler
,
Mateusz
Litwin
,
Scott
Gray
,
Benjamin
Chess
,
Jack
Clark
,
Christopher
Berner
,
Sam
McCandlish
,
Alec
,
Ilya
Sutskever
, and
Dario
Amodei
.
2020
.
Language models are few-shot learners
. In
Advances in Neural Information Processing Systems
, volume
33
, pages
1877
1901
,
Curran Associates, Inc.
Buck
,
Christian
,
Kenneth
Heafield
, and
Bas
van Ooyen
.
2014
.
N-gram counts and language models from the Common Crawl
. In
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14)
, pages
3579
3584
.
Budiwati
,
Sari Dewi
,
Al
Hafiz Akbar Maulana Siagian
,
Tirana Noor
Fatyanosa
, and
Masayoshi
Aritsugi
.
2019
.
DBMS-KU interpolation for WMT19 news translation task
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
141
146
.
Burlot
,
Franck
,
Mercedes
García-Martínez
,
Loïc
Barrault
,
Fethi
Bougares
, and
François
Yvon
.
2017
.
Word representations in factored neural machine translation
. In
Proceedings of the Second Conference on Machine Translation
, pages
20
31
.
Callison-Burch
,
Chris
,
Cameron
Fordyce
,
Philipp
Koehn
,
Christof
Monz
, and
Josh
Schroeder
.
2007
.
(Meta-) evaluation of machine translation
. In
Proceedings of the Second Workshop on Statistical Machine Translation
, pages
136
158
.
Caswell
,
Isaac
,
Theresa
Breiner
,
Daan
van Esch
, and
Ankur
Bapna
.
2020
.
Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
6588
6608
.
Caswell
,
Isaac
,
Ciprian
Chelba
, and
David
Grangier
.
2019
.
Tagged backtranslation
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)
, pages
53
63
.
Cettolo
,
Mauro
,
Jan
Niehues
,
Sebastian
Stüker
,
Luisa
Bentivogli
, and
Marcello
Federico
.
2014
.
Report on the 11th IWSLT evaluation campaign, IWSLT 2014
. In
Proceedings of the 11th International Workshop on Spoken Language Translation
, pages
2
17
.
Chakravarthi
,
Bharathi Raja
,
Ruba
,
Anand
Kumar M.
,
Parameswari
Krishnamurthy
, and
Elizabeth
Sherly
, editors.
2021
.
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
.
Chen
,
Peng Jen
,
Ann
Lee
,
Changhan
Wang
,
Naman
Goyal
,
Angela
Fan
,
Mary
Williamson
, and
Jiatao
Gu
.
2020
.
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
113
125
.
Cheng
,
Yong
,
Wei
Xu
,
Zhongjun
He
,
Wei
He
,
Hua
Wu
,
Maosong
Sun
, and
Yang
Liu
.
2016
.
Semi-supervised learning for neural machine translation
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1965
1974
.
Chi
,
Zewen
,
Li
Dong
,
Shuming
Ma
,
Shaohan
Huang
,
Xian-Ling
Mao
,
Heyan
Huang
, and
Furu
Wei
.
2021
.
mT6: Multilingual pre-trained text-to-text transformer with translation pairs
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
1671
1683
.
Choshen
,
Leshem
,
Lior
Fox
,
Zohar
Aizenbud
, and
Omri
Abend
.
2020
.
On the weaknesses of reinforcement learning for neural machine translation
. In
Proceedings of the 8th International Conference on Learning Representations
.
Christodouloupoulos
,
Christos
and
Mark
Steedman
.
2015
.
A massively parallel corpus: The Bible in 100 languages
.
Language Resources and Evaluation
,
49
(
2
):
375
395
.
Chronopoulou
,
Alexandra
,
Dario
Stojanovski
, and
Alexander
Fraser
.
2021
.
Improving the lexical ability of pre-trained language models for unsupervised neural machine translation
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
173
180
.
Chronopoulou
,
Alexandra
,
Dario
Stojanovski
,
Viktor
Hangya
, and
Alexander
Fraser
.
2020
.
The LMU Munich system for the WMT 2020 unsupervised machine translation shared task
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1084
1091
.
Clark
,
Stephen
and
James R.
Curran
.
2007
.
Wide-coverage efficient statistical parsing with CCG and log-linear models
.
Computational Linguistics
,
33
(
4
):
493
552
.
Conneau
,
Alexis
,
Kartikay
Khandelwal
,
Naman
Goyal
,
Vishrav
Chaudhary
,
Guillaume
Wenzek
,
Francisco
Guzmán
,
Edouard
Grave
,
Myle
Ott
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2020
.
Unsupervised cross-lingual representation learning at scale
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
8440
8451
.
Conneau
,
Alexis
and
Guillaume
Lample
.
2019
.
Cross-lingual language model pre-training
. In
Proceedings of the 33rd International Conference on Neural Information Processing Systems
.
Conneau
,
Alexis
,
Guillaume
Lample
,
Marc’aurelio
Ranzato
,
Ludovic
Denoyer
, and
Hervé
Jégou
.
2018
.
Word translation without parallel data
. In
Proceedings of the 6th International Conference on Learning Representations
.
Cover
,
Thomas M.
and
Joy A.
Thomas
.
2006
.
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
.
Wiley-Interscience
,
USA
.
Currey
,
Anna
,
Antonio Valerio Miceli
Barone
, and
Kenneth
Heafield
.
2017
.
Copied monolingual data improves low-resource neural machine translation
. In
Proceedings of the Second Conference on Machine Translation
, pages
148
156
.
Dabre
,
Raj
,
Kehai
Chen
,
Benjamin
Marie
,
Rui
Wang
,
Atsushi
Fujita
,
Masao
Utiyama
, and
Eiichiro
Sumita
.
2019
.
NICT’s supervised neural machine translation systems for the WMT19 news translation task
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
168
174
.
Dabre
,
Raj
,
Chenhui
Chu
, and
Anoop
Kunchukuttan
.
2020
.
A comprehensive survey of multilingual neural machine translation
.
CoRR
,
abs/2001.01115
.
Dabre
,
Raj
,
Atsushi
Fujita
, and
Chenhui
Chu
.
2019
.
Exploiting multilingualism through multistage fine-tuning for low-resource neural machine translation
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
1410
1416
.
Dabre
,
Raj
,
Anoop
Kunchukuttan
,
Atsushi
Fujita
, and
Eiichiro
Sumita
.
2018
.
NICT’s participation in WAT 2018: Approaches using multilingualism and recurrently stacked layers
. In
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation
.
Dabre
,
Raj
,
Tetsuji
Nakagawa
, and
Hideto
Kazawa
.
2017
.
An empirical study of language relatedness for transfer learning in neural machine translation
. In
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
, pages
282
286
.
Dandapat
,
Sandipan
and
Christian
Federmann
.
2018
.
Iterative data augmentation for neural machine translation: A low resource case study for English-Telugu
. In
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
, pages
287
292
.
Daumé
,
Hal
and
Daniel
Marcu
.
2005
.
Learning as search optimization: Approximate large margin methods for structured prediction
. In
Proceedings of the 22nd International Conference on Machine Learning
, pages
169
176
.
Denkowski
,
Michael
and
Alon
Lavie
.
2011
.
Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems
. In
Proceedings of the Sixth Workshop on Statistical Machine Translation
, pages
85
91
.
Denkowski
,
Michael
and
Alon
Lavie
.
2014
.
Meteor universal: Language specific translation evaluation for any target language
. In
Proceedings of the Ninth Workshop on Statistical Machine Translation
, pages
376
380
.
Devlin
,
Jacob
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
.
Dhar
,
Prajit
,
Arianna
Bisazza
, and
Gertjan
van Noord
.
2020
.
Linguistically motivated subwords for English-Tamil translation: University of Groningen’s submission to WMT-2020
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
126
133
.
Di
Gangi
,
Mattia
Antonino
and
Marcello
Federico
.
2017
.
Monolingual embeddings for low resourced neural machine translation
. In
Proceedings of the 14th International Workshop on Spoken Language Translation
, pages
97
104
.
Ding
,
Shuoyang
,
Renduchintala
, and
Kevin
Duh
.
2019
.
A call for prudent choice of subword merge operations in neural machine translation
. In
Proceedings of Machine Translation Summit XVII Volume 1: Research Track
, pages
204
213
.
Dinu
,
Georgiana
,
Prashant
Mathur
,
Marcello
Federico
, and
Yaser
Al-Onaizan
.
2019
.
Training neural machine translation to apply terminology constraints
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3063
3068
.
Dong
,
Daxiang
,
Hua
Wu
,
Wei
He
,
Dianhai
Yu
, and
Haifeng
Wang
.
2015
.
Multi-task learning for multiple language translation
. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
1723
1732
.
Duan
,
Xiangyu
,
Baijun
Ji
,
Hao
Jia
,
Min
Tan
,
Min
Zhang
,
Boxing
Chen
,
Weihua
Luo
, and
Yue
Zhang
.
2020
.
Bilingual dictionary based neural machine translation without using parallel sentences
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
1570
1579
.
Dutta
,
Sourav
,
Jesujoba
Alabi
,
Saptarashmi
,
Dana
Ruiter
, and
Josef
van Genabith
.
2020
.
UdS-DFKI@WMT20: Unsupervised MT and very low resource supervised MT for German-Upper Sorbian
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1092
1098
.
Edman
,
Lukas
,
Antonio
Toral
, and
Gertjan
van Noord
.
2020
.
Low-resource unsupervised NMT: Diagnosing the problem and providing a linguistically motivated solution
. In
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
, pages
81
90
.
Edunov
,
Sergey
,
Myle
Ott
,
Michael
Auli
, and
David
Grangier
.
2018a
.
Understanding back-translation at scale
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
489
500
.
Edunov
,
Sergey
,
Myle
Ott
,
Michael
Auli
,
David
Grangier
, and
Marc’Aurelio
Ranzato
.
2018b
.
Classical structured prediction losses for sequence to sequence learning
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
355
364
.
Eikema
,
Bryan
and
Wilker
Aziz
.
2019
.
Auto-encoding variational neural machine translation
. In
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
, pages
124
141
.
Eikema
,
Bryan
and
Wilker
Aziz
.
2020
.
Is MAP decoding all you need? The inadequacy of the mode in neural machine translation
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
4506
4520
.
Emezue
,
Chris Chinenye
and
Femi Pancrace Bonaventure
Dossou
.
2020
.
FFR v1.1: Fon-French neural machine translation
. In
Proceedings of the The Fourth Widening Natural Language Processing Workshop
, pages
83
87
.
Ethayarajh
,
Kawin
and
Dan
Jurafsky
.
2020
.
Utility is in the eye of the user: A critique of NLP leaderboards
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
4846
4853
.
Ezeani
,
Ignatius
,
Paul
Rayson
,
Ikechukwu E.
Onyenwe
,
Chinedu
Uchechukwu
, and
Mark
Hepple
.
2020
.
Igbo-English machine translation: An evaluation benchmark
.
CoRR
,
abs/2004.00648
.
,
Marzieh
,
Arianna
Bisazza
, and
Christof
Monz
.
2017
.
Data augmentation for low-resource neural machine translation
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
567
573
.
Fan
,
Angela
,
Shruti
Bhosale
,
Holger
Schwenk
,
Zhiyi
Ma
,
Ahmed
El-Kishky
,
Siddharth
Goyal
,
Mandeep
Baines
,
Onur
Celebi
,
Guillaume
Wenzek
,
Vishrav
Chaudhary
,
Naman
Goyal
,
Tom
Birch
,
Vitaliy
Liptchinsky
,
Sergey
Edunov
,
Edouard
Grave
,
Michael
Auli
, and
Armand
Joulin
.
2021
.
Beyond English-centric multilingual machine translation
.
Journal of Machine Learning Research 22
,
22
:
1
48
.
Feng
,
Fangxiaoyu
,
Yinfei
Yang
,
Daniel
Cer
,
Naveen
Arivazhagan
, and
Wei
Wang
.
2020
.
Language-agnostic BERT sentence embedding
.
CoRR
,
abs/2007.01852
.
Feng
,
Yang
,
Shiyue
Zhang
,
Andi
Zhang
,
Dong
Wang
, and
Andrew
Abel
.
2017
.
Memory-augmented neural machine translation
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
1390
1399
.
Finn
,
Chelsea
,
Pieter
Abbeel
, and
Sergey
Levine
.
2017
.
Model-agnostic meta-learning for fast adaptation of deep networks
. In
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research
, pages
1126
1135
.
Firat
,
Orhan
,
Kyunghyun
Cho
, and
Yoshua
Bengio
.
2016
.
Multi-way, multilingual neural machine translation with a shared attention mechanism
. In
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
866
875
.
Fomicheva
,
Marina
and
Lucia
Specia
.
2019
.
Taking MT evaluation metrics to extremes: Beyond correlation with human judgments
.
Computational Linguistics
,
45
(
3
):
515
558
.
,
Mikel L
,
Mireia
Ginestí-Rosell
,
Jacob
Nordfalk
,
Jim
O’Regan
,
Sergio
Ortiz-Rojas
,
Juan Antonio
Pérez-Ortiz
,
Felipe
Sánchez-Martínez
,
Gema
Ramírez-Sánchez
, and
Francis M.
Tyers
.
2011
.
Apertium: A free/open-source platform for rule-based machine translation
.
Machine Translation
,
25
(
2
):
127
144
.
,
Mikel L.
and
Francis M.
Tyers
.
2016
.
Apertium: A free/open source platform for machine translation and basic language technology
. In
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products
, pages
127
144
.
Fraser
,
Alexander
.
2020
.
Findings of the WMT 2020 shared tasks in unsupervised MT and very low resource supervised MT
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
765
771
.
Freitag
,
Markus
and
Orhan
Firat
.
2020
.
Complete multilingual neural machine translation
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
550
560
.
Freitag
,
Markus
,
George
Foster
,
David
Grangier
,
Viresh
Ratnakar
,
Qijun
Tan
, and
Wolfgang
Macherey
.
2021a
.
Experts, errors, and context: A large-scale study of human evaluation for machine translation
.
Transactions of the Association for Computational Linguistics
,
9
:
1460
1474
.
Freitag
,
Markus
,
Ricardo
Rei
,
Nitika
Mathur
,
Chi-kiu
Lo
,
Craig
Stewart
,
George
Foster
,
Alon
Lavie
, and
Ondřej
Bojar
.
2021b
.
Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain
. In
Proceedings of the Sixth Conference on Machine Translation
, pages
733
774
.
Garcia
,
Xavier
,
Siddhant
,
Orhan
Firat
, and
Ankur
Parikh
.
2021
.
Harnessing multilinguality in unsupervised machine translation for rare languages
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
1126
1137
.
Garcia-Martinez
,
Mercedes
,
Loïc
Barrault
, and
Fethi
Bougares
.
2016
.
Factored neural machine translation architectures
. In
Proceedings of the 13th International Workshop on Spoken Language Translation
,
8
pages.
Geirhos
,
Robert
,
Jörn-Henrik
Jacobsen
,
Claudio
Michaelis
,
Richard
Zemel
,
Wieland
Brendel
,
Matthias
Bethge
, and
Felix A
Wichmann
.
2020
.
Shortcut learning in deep neural networks
.
Nature Machine Intelligence
,
2
:
665
673
.
,
Ilshat
,
Aidar
Valeev
,
Albina
Khusainova
, and
Khan
.
2019
.
A survey of methods to leverage monolingual data in low-resource neural machine translation
.
CoRR
,
abs/1910 .00373
.
Goel
,
Vaibhava
and
William J.
Byrne
.
2000
.
Minimum Bayes-risk automatic speech recognition
.
Computer Speech and Language
,
14
(
2
):
115
135
.
Goldwater
,
Sharon
and
David
McClosky
.
2005
.
Improving statistical MT through morphological analysis
. In
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
, pages
676
683
.
Goodfellow
,
Ian J.
,
Mehdi
Mirza
,
Da
Xiao
,
Aaron
Courville
, and
Yoshua
Bengio
.
2014
.
An empirical investigation of catastrophic forgetting in gradient-based neural networks
. In
Proceedings of the 2014 International Conference on Learning Representations
.
Goyal
,
Naman
,
Cynthia
Gao
,
Vishrav
Chaudhary
,
Peng-Jen
Chen
,
Guillaume
Wenzek
,
Da
Ju
,
Sanjana
Krishnan
,
Marc’Aurelio
Ranzato
,
Francisco
Guzmán
, and
Angela
Fan
.
2021
.
The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation
.
CoRR
,
abs/2106.03193
.
Goyal
,
Vikrant
,
Sourav
Kumar
, and
Dipti Misra
Sharma
.
2020
.
Efficient neural machine translation for low-resource languages via exploiting related languages
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
, pages
162
168
.
Goyal
,
Vikrant
,
Anoop
Kunchukuttan
,
Rahul
Kejriwal
,
Siddharth
Jain
, and
Amit
Bhagwat
.
2020
.
Contact relatedness can help improve multilingual NMT: Microsoft STCI-MT @ WMT20
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
202
206
.
Goyal
,
Vikrant
and
Dipti Misra
Sharma
.
2019
.
The IIIT-H Gujarati-English machine translation system for WMT19
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
191
195
.
Grönroos
,
Stig Arne
,
Sami
Virpioja
,
Peter
Smit
, and
Mikko
Kurimo
.
2014
.
Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology
. In
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
, pages
1177
1185
.
Gu
,
Jiatao
,
Yong
Wang
,
Yun
Chen
,
Victor O. K.
Li
, and
Kyunghyun
Cho
.
2018
.
Meta-learning for low-resource neural machine translation
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
3622
3631
.
Gülçehre
,
Ç
aglar
,
Orhan
Firat
,
Kelvin
Xu
,
Kyunghyun
Cho
,
Loïc
Barrault
,
Huei-Chi
Lin
,
Fethi
Bougares
,
Holger
Schwenk
, and
Yoshua
Bengio
.
2015
.
On using monolingual corpora in neural machine translation
.
CoRR
,
abs/1503.03535
.
Guzmán
,
Francisco
,
Peng-Jen
Chen
,
Myle
Ott
,
Juan
Pino
,
Guillaume
Lample
,
Philipp
Koehn
,
Vishrav
Chaudhary
, and
Marc’Aurelio
Ranzato
.
2019
.
The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
6098
6111
.
Ha
,
Thanh Le
,
Jan
Niehues
, and
Alexander
Waibel
.
2016
.
Toward multilingual neural machine translation with universal encoder and decoder
. In
Proceedings of the 13th International Workshop on Spoken Language Translation
.
Habash
,
Nizar
,
Houda
Bouamor
,
Hazem
Hajj
,
Walid
Magdy
,
Wajdi
Zaghouani
,
Fethi
Bougares
,
Tomeh
,
Ibrahim
Abu Farha
, and
Samia
Touileb
, editors.
2021
.
Proceedings of the Sixth Arabic Natural Language Processing Workshop
.
,
Barry
and
Faheem
Kirefu
.
2020
.
PMIndia - A collection of parallel corpora of languages of India
.
CoRR
,
abs/2001.09907
.
Hasan
,
Tahmid
,
Abhik
Bhattacharjee
,
Kazi
Samin
,
Masum
Hasan
,
Basak
,
M.
Sohel Rahman
, and
Rifat
Shahriyar
.
2020
.
Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
2612
2623
.
Hassan
,
Hany
,
Anthony
Aue
,
Chang
Chen
,
Vishal
Chowdhary
,
Jonathan
Clark
,
Christian
Federmann
,
Xuedong
Huang
,
Marcin
Junczys-Dowmunt
,
William
Lewis
,
Mu
Li
,
Shujie
Liu
,
Tie-Yan
Liu
,
Renqian
Luo
,
Arul
Menezes
,
Tao
Qin
,
Frank
Seide
,
Xu
Tan
,
Fei
Tian
,
Lijun
Wu
,
Shuangzhi
Wu
,
Yingce
Xia
,
Dongdong
Zhang
,
Zhirui
Zhang
, and
Ming
Zhou
.
2018
.
Achieving human parity on automatic Chinese to English news translation
.
CoRR
,
abs/1803.05567
.
He
,
Di
,
Yingce
Xia
,
Tao
Qin
,
Liwei
Wang
,
Nenghai
Yu
,
Tie-Yan
Liu
, and
Wei-Ying
Ma
.
2016
.
Dual learning for machine translation
. In
Advances in Neural Information Processing Systems
, volume
29
,
Curran Associates, Inc.
He
,
Junxian
,
Jiatao
Gu
,
Jiajun
Shen
, and
Marc’aurelio
Ranzato
.
2019
.
Revisiting Self-Training for neural sequence generation
. In
Proceedings of the 7th International Conference on Learning Representations
.
He
,
Xuanli
,
Gholamreza
Haffari
, and
Norouzi
.
2020
.
Dynamic programming encoding for subword segmentation in neural machine translation
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
3042
3051
.
Hernandez
,
François
and
Vincent
Nguyen
.
2020
.
The Ubiqus English-Inuktitut system for WMT20
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
213
217
.
Hieber
,
Felix
,
Tobias
Domhan
,
Michael
Denkowski
, and
David
Vilar
.
2020
.
Sockeye 2: A toolkit for neural machine translation
. In
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
, pages
457
458
.
Hoang
,
Vu Cong Duy
,
Philipp
Koehn
,
Gholamreza
Haffari
, and
Trevor
Cohn
.
2018
.
Iterative back-translation for neural machine translation
. In
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
, pages
18
24
.
Hokamp
,
Chris
,
John
Glover
, and
Demian Gholipour
Ghalandari
.
2019
.
Evaluating the supervised and zero-shot performance of multi-lingual translation models
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
209
217
.
Huck
,
Matthias
,
Simon
Riess
, and
Alexander
Fraser
.
2017
.
Target-side word segmentation strategies for neural machine translation
. In
Proceedings of the Second Conference on Machine Translation
, pages
56
67
.
Hupkes
,
Dieuwke
,
Verna
Dankers
,
Mathijs
Mul
, and
Elia
Bruni
.
2019
.
The compositionality of neural networks: Integrating symbolism and connectionism
.
CoRR
,
abs/1908.08351
.
Huszar
,
Ferenc
.
2015
.
CoRR
,
abs/1511.05101
.
Jean
,
Sébastien
,
Orhan
Firat
,
Kyunghyun
Cho
,
Roland
Memisevic
, and
Yoshua
Bengio
.
2015
.
Montreal neural machine translation systems for WMT’15
. In
Proceedings of the Tenth Workshop on Statistical Machine Translation
, pages
134
140
.
Jha
,
Girish Nath
,
Kalika
Bali
,
Sobha
L.
,
S. S.
Agrawal
, and
Atul Kr.
Ojha
, editors.
2020
.
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
.
Johnson
,
Melvin
,
Mike
Schuster
,
Quoc
Le
,
Maxim
Krikun
,
Yonghui
Wu
,
Zhifeng
Chen
,
Nikhil
Thorat
,
Fernanda
Viégas
,
Martin
Wattenberg
,
Greg
, et al.
2017
.
Google’s multilingual neural machine translation system: Enabling zero-shot translation
.
Transactions of the Association for Computational Linguistics
,
5
:
339
351
.
Junczys-Dowmunt
,
Marcin
.
2018
.
Dual conditional cross-entropy filtering of noisy parallel corpora
. In
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
, pages
888
895
.
Junczys-Dowmunt
,
Marcin
,
Roman
Grundkiewicz
,
Tomasz
Dwojak
,
Hieu
Hoang
,
Kenneth
Heafield
,
Tom
Neckermann
,
Frank
Seide
,
Ulrich
Germann
,
Alham Fikri
Aji
,
Nikolay
Bogoychev
,
André F. T.
Martins
, and
Alexandra
Birch
.
2018a
.
Marian: Fast neural machine translation in C++
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations
, pages
116
121
.
Junczys-Dowmunt
,
Marcin
,
Roman
Grundkiewicz
,
Shubha
Guha
, and
Kenneth
Heafield
.
2018b
.
Approaching neural grammatical error correction as a low-resource machine translation task
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
595
606
.
Karakanta
,
Alina
,
Atul Kr.
Ojha
,
Chao-Hong
Liu
,
Jonathan
Washington
,
Nathaniel
Oco
,
Surafel Melaku
Lakew
,
Valentin
Malykh
, and
Xiaobing
Zhao
, editors.
2019
.
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
.
Khanna
,
Tanmai
,
Jonathan N.
Washington
,
Francis M.
Tyers
,
Sevilay
Bayatlı
,
Daniel G.
Swanson
,
Tommi A.
Pirinen
,
Irene
Tang
, and
Hèctor Alòs i
Font
.
2021
.
Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages
.
Machine Translation
, pages
1
28
.
Kim
,
Yoon
and
Alexander M.
Rush
.
2016
.
Sequence-level knowledge distillation
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
1317
1327
.
Kim
,
Yunsu
,
Yingbo
Gao
, and
Hermann
Ney
.
2019
.
Effective cross-lingual transfer of neural machine translation models without shared vocabularies
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
1246
1257
.
Kim
,
Yunsu
,
Miguel
Graça
, and
Hermann
Ney
.
2020
.
When and why is unsupervised neural machine translation useless?
In
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
, pages
35
44
.
Kim
,
Yunsu
,
Petre
Petrov
,
Pavel
Petrushkov
,
Shahram
, and
Hermann
Ney
.
2019
.
Pivot-based transfer learning for neural machine translation between Non-English languages
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
866
876
.
Kingma
,
Diederik P.
and
Max
Welling
.
2014
.
Auto-encoding variational Bayes
. In
Proceedings of the 2nd International Conference on Learning Representations
.
Klein
,
Guillaume
,
Yoon
Kim
,
Yuntian
Deng
,
Jean
Senellart
, and
Alexander
Rush
.
2017
.
OpenNMT: Open-source toolkit for neural machine translation
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations
, pages
67
72
.
Knowles
,
Rebecca
,
Samuel
Larkin
,
Darlene
Stewart
, and
Patrick
Littell
.
2020a
.
NRC systems for low resource German-Upper Sorbian machine translation 2020: Transfer learning with lexical modifications
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1112
1122
.
Knowles
,
Rebecca
,
Darlene
Stewart
,
Samuel
Larkin
, and
Patrick
Littell
.
2020b
.
NRC systems for the 2020 Inuktitut-English news translation task
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
156
170
.
Ko
,
Wei Jen
,
Ahmed
El-Kishky
,
Renduchintala
,
Vishrav
Chaudhary
,
Naman
Goyal
,
Francisco
Guzmán
,
Pascale
Fung
,
Philipp
Koehn
, and
Mona
Diab
.
2021
.
Adapting high-resource NMT models to translate low-resource related languages without parallel data
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
802
812
.
Kocmi
,
Tom
.
2020
.
CUNI submission for the Inuktitut language in WMT news 2020
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
171
174
.
Kocmi
,
Tom
and
Ondřej
Bojar
.
2018
.
Trivial transfer learning for low-resource neural machine translation
. In
Proceedings of the Third Conference on Machine Translation: Research Papers
, pages
244
252
.
Kocmi
,
Tom
and
Ondřej
Bojar
.
2019
.
CUNI Submission for low-resource languages in WMT news 2019
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
234
240
.
Kocmi
,
Tom
,
Christian
Federmann
,
Roman
Grundkiewicz
,
Marcin
Junczys-Dowmunt
,
Hitokazu
Matsushita
, and
Arul
Menezes
.
2021
.
To ship or not to ship: An extensive evaluation of automatic metrics for machine translation
. In
Proceedings of the Sixth Conference on Machine Translation
.
Koehn
,
Philipp
.
2020
.
Neural Machine Translation
.
Cambridge University Press
.
Koehn
,
Philipp
,
Vishrav
Chaudhary
,
Ahmed
El-Kishky
,
Naman
Goyal
,
Peng-Jen
Chen
, and
Francisco
Guzmán
.
2020
.
Findings of the WMT 2020 shared task on parallel corpus filtering and alignment
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
724
740
.
Koehn
,
Philipp
,
Francisco
Guzmán
,
Vishrav
Chaudhary
, and
Juan
Pino
.
2019
.
Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
, pages
54
72
.
Koehn
,
Philipp
and
Hieu
Hoang
.
2007
.
Factored translation models
. In
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
, pages
868
876
.
Koehn
,
Philipp
,
Hieu
Hoang
,
Alexandra
Birch
,
Chris
Callison-Burch
,
Marcello
Federico
,
Nicola
Bertoldi
,
Brooke
Cowan
,
Shen
,
Christine
Moran
,
Richard
Zens
,
Chris
Dyer
,
Ondřej
Bojar
,
Alexandra
Constantin
, and
Evan
Herbst
.
2007
.
Moses: Open source toolkit for statistical machine translation
. In
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions
, pages
177
180
.
Koehn
,
Philipp
,
Huda
Khayrallah
,
Kenneth
Heafield
, and
Mikel L.
.
2018
.
Findings of the WMT 2018 shared task on parallel corpus filtering
. In
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
, pages
726
739
.
Koehn
,
Philipp
and
Rebecca
Knowles
.
2017
.
Six challenges for neural machine translation
. In
Proceedings of the First Workshop on Neural Machine Translation
, pages
28
39
.
Koehn
,
Philipp
and
Christof
Monz
.
2006
.
Manual and automatic evaluation of machine translation between European languages
. In
Proceedings on the Workshop on Statistical Machine Translation
, pages
102
121
.
Koehn
,
Philipp
,
Franz J.
Och
, and
Daniel
Marcu
.
2003
.
Statistical phrase-based translation
. In
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics
, pages
127
133
.
Kreutzer
,
Julia
,
Isaac
Caswell
,
Lisa
Wang
,
Ahsan
Wahab
,
Daan
van Esch
,
Nasanbayar
Ulzii-Orshikh
,
Allahsera
Tapo
,
Nishant
Subramani
,
Artem
Sokolov
,
Claytone
Sikasote
,
Monang
Setyawan
,
Supheakmungkol
Sarin
,
Sokhar
Samb
,
Benoît
Sagot
,
Clara
Rivera
,
Annette
Rios
,
Isabel
,
Salomey
Osei
,
Pedro Ortiz
Suarez
,
Iroro
Orife
,
Kelechi
Ogueji
,
Andre Niyongabo
Rubungo
,
Toan Q.
Nguyen
,
Mathias
Müller
,
André
Müller
,
Shamsuddeen Hassan
,
Nanda
,
Ayanda
Mnyakeni
,
Jamshidbek
Mirzakhalov
,
Tapiwanashe
Matangira
,
Colin
Leong
,
Nze
Lawson
,
Sneha
Kudugunta
,
Yacine
Jernite
,
Mathias
Jenny
,
Orhan
Firat
,
Bonaventure F. P.
Dossou
,
Sakhile
Dlamini
,
Nisansa
de Silva
,
Sakine Çabuk
Ballı
,
Stella
Biderman
,
Alessia
Battisti
,
Ahmed
Baruwa
,
Ankur
Bapna
,
Pallavi
Baljekar
,
Israel Abebe
Azime
,
Ayodele
Awokoya
,
Duygu
Ataman
,
Orevaoghene
Ahia
,
Oghenefego
Ahia
,
Sweta
Agrawal
, and
Mofetoluwa
.
2022
.
Quality at a glance: An audit of Web-crawled multilingual datasets
.
Transactions of the Association for Computational Linguistics
,
10
:
50
72
.
Kudo
,
Taku
.
2018
.
Subword regularization: Improving neural network translation models with multiple subword candidates
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
66
75
.
Kudo
,
Taku
and
John
Richardson
.
2018
.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
, pages
66
71
.
Kumar
,
Sachin
,
Antonios
Anastasopoulos
,
Shuly
Wintner
, and
Yulia
Tsvetkov
.
2021
.
Machine translation into low-resource language varieties
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, pages
110
121
.
Kumaraswamy
,
P.
1980
.
A generalized probability density function for double-bounded random processes
.
Journal of Hydrology
,
46
(
1
):
79
88
.
Kvapilíková
,
Ivana
,
Tom
Kocmi
, and
Ondrej
Bojar
.
2020
.
CUNI systems for the unsupervised and very low resource translation task in WMT20
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1123
1128
.
Lake
,
Brenden M
.
2019
.
Compositional generalization through meta sequence-to-sequence learning
. In
Advances in Neural Information Processing Systems
, volume
32
, pages
9791
9801
,
Curran Associates, Inc.
Lakew
,
Surafel Melaku
,
Aliia
Erofeeva
, and
Marcello
Federico
.
2018
.
Neural machine translation into language varieties
. In
Proceedings of the Third Conference on Machine Translation: Research Papers
, pages
156
164
.
Lample
,
Guillaume
,
Alexis
Conneau
,
Ludovic
Denoyer
, and
Marc’Aurelio
Ranzato
.
2018a
.
Unsupervised machine translation using monolingual corpora only
. In
Proceedings of the 6th International Conference on Learning Representations
.
Lample
,
Guillaume
,
Myle
Ott
,
Alexis
Conneau
,
Ludovic
Denoyer
, and
Marc’Aurelio
Ranzato
.
2018b
.
Phrase-based & neural unsupervised machine translation
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
5039
5049
.
Lapuschkin
,
Sebastian
,
Stephan
Wäldchen
,
Alexander
Binder
,
Grégoire
Montavon
,
Wojciech
Samek
, and
Klaus Robert
Müller
.
2019
.
Unmasking clever Hans predictors and assessing what machines really learn
.
Nature Communications
,
10
(
1
):
1
8
.
,
Sahinur Rahman
,
Abdullah Faiz Ur Rahman
Khilji
,
Partha
Pakray
, and
Sivaji
.
2020
.
Hindi-Marathi cross lingual model
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
396
401
.
Läubli
,
Samuel
,
Rico
Sennrich
, and
Martin
Volk
.
2018
.
Has machine translation achieved human parity? A case for document-level evaluation
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
4791
4796
.
LeCun
,
Yann
,
Sumit
Chopra
,
Raia
,
M.
Ranzato
, and
F.
Huang
.
2006
.
A tutorial on energy-based learning
.
Predicting Structured Data
,
1
(
0
):
0
59
.
Lepikhin
,
Dmitry
,
HyoukJoong
Lee
,
Yuanzhong
Xu
,
Dehao
Chen
,
Orhan
Firat
,
Yanping
Huang
,
Maxim
Krikun
,
Noam
Shazeer
, and
Zhifeng
Chen
.
2020
.
GShard: Scaling giant models with conditional computation and automatic sharding
. In
International Conference
,
Online
.
Li
,
Bei
,
Yinqiao
Li
,
Chen
Xu
,
Ye
Lin
,
Jiqiang
Liu
,
Hui
Liu
,
Ziyang
Wang
,
Yuhao
Zhang
,
Nuo
Xu
,
Zeyang
Wang
,
Kai
Feng
,
Hexuan
Chen
,
Tengbo
Liu
,
Yanyang
Li
,
Qiang
Wang
,
Tong
Xiao
, and
Jingbo
Zhu
.
2019
.
The NiuTrans machine translation systems for WMT(19
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
257
266
.
Li
,
Zuchao
,
Hai
Zhao
,
Rui
Wang
,
Kehai
Chen
,
Masao
Utiyama
, and
Eiichiro
Sumita
.
2020
.
SJTU-NICT’s supervised and unsupervised neural machine translation systems for the WMT20 news translation task
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
218
229
.
Libovický
,
Jindrich
.
2021
.
Jindrich’s blog – Machine translation weekly 86: The wisdom of the WMT crowd
. https://jlibovicky.github.io/2021/07/24/MT-Weekly-The-Wisdom-of-the-WMT-Crowd.
Libovický
,
Jindrich
and
Alexander
Fraser
.
2021
.
Findings of the WMT 2021 shared tasks in unsupervised MT and very low resource supervised MT
. In
Proceedings of the Sixth Conference on Machine Translation
, pages
726
732
.
Libovický
,
Jindrich
,
Viktor
Hangya
,
Helmut
Schmid
, and
Alexander
Fraser
.
2020
.
The LMU Munich system for the WMT20 very low resource supervised MT task
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1104
1111
.
Lignos
,
Constantine
.
2010
.
Learning from unseen data
. In
Proceedings of the Morpho Challenge 2010 Workshop
, pages
35
38
.
Lin
,
Yu Hsiang
,
Chian-Yu
Chen
,
Jean
Lee
,
Zirui
Li
,
Yuyan
Zhang
,
Mengzhou
Xia
,
Shruti
Rijhwani
,
Junxian
He
,
Zhisong
Zhang
,
Xuezhe
Ma
,
Antonios
Anastasopoulos
,
Patrick
Littell
, and
Graham
Neubig
.
2019
.
Choosing transfer languages for cross-lingual learning
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3125
3135
.
Lin
,
Zehui
,
Xiao
Pan
,
Mingxuan
Wang
,
Xipeng
Qiu
,
Jiangtao
Feng
,
Hao
Zhou
, and
Lei
Li
.
2020
.
Pre-training multilingual neural machine translation by leveraging alignment information
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
2649
2663
.
Liu
,
Yinhan
,
Jiatao
Gu
,
Naman
Goyal
,
Xian
Li
,
Sergey
Edunov
,
Marjan
,
Mike
Lewis
, and
Luke
Zettlemoyer
.
2020
.
Multilingual denoising pre-training for neural machine translation
.
Transactions of the Association for Computational Linguistics
,
8
:
726
742
.
Lo
,
Chi kiu
.
2019
.
YiSi - a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
507
513
.
Louizos
,
Christos
,
Max
Welling
, and
Diederik P.
Kingma
.
2018
.
Learning sparse neural networks through l0 regularization
. In
Proceedings of the 6th International Conference on Learning Representations
.
Luong
,
Minh Thang
,
Quoc V.
Le
,
Ilya
Sutskever
,
Oriol
Vinyals
, and
Lukasz
Kaiser
.
2016
.
. In
Proceedings of the 4th International Conference on Learning Representations
.
Ma
,
Qingsong
,
Johnny
Wei
,
Ondřej
Bojar
, and
Yvette
Graham
.
2019
.
Results of the WMT(19 metrics shared task: Segment-level and strong MT systems pose big challenges
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
62
90
.
Ma
,
Shuming
,
Li
Dong
,
Shaohan
Huang
,
Dongdong
Zhang
,
Alexandre
Muzio
,
Saksham
Singhal
,
Hany Hassan
,
Xia
Song
, and
Furu
Wei
.
2021
.
DeltaLM: Encoder-decoder pre-training for language generation and translation by augmenting pretrained multilingual encoders
.
arXiv preprint arXiv:2106.13736
.
Mager
,
Manuel
,
Arturo
Oncevay
,
Abteen
Ebrahimi
,
John
Ortega
,
Annette
Rios
,
Angela
Fan
,
Ximena
Gutierrez-Vasques
,
Luis
Chiruzzo
,
Gustavo
Giménez-Lugo
,
Ricardo
Ramos
,
Ruiz
,
Rolando
Coto-Solano
,
Alexis
Palmer
,
Elisabeth
Mager-Hois
,
Vishrav
Chaudhary
,
Graham
Neubig
,
Ngoc Thang
Vu
, and
Katharina
Kann
.
2021
.
Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas
. In
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
, pages
202
217
.
Marchisio
,
Kelly
,
Kevin
Duh
, and
Philipp
Koehn
.
2020
.
When does unsupervised machine translation work?
In
Proceedings of the Fifth Conference on Machine Translation
, pages
569
581
.
Martindale
,
Marianna
,
Marine
Carpuat
,
Kevin
Duh
, and
Paul
McNamee
.
2019
.
Identifying fluently inadequate output in neural and statistical machine translation
. In
Proceedings of Machine Translation Summit XVII: Research Track
, pages
233
243
.
Mayer
,
Thomas
and
Michael
Cysouw
.
2014
.
Creating a massively parallel Bible corpus
. In
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14)
, pages
3158
3163
.
Miceli Barone
,
Antonio Valerio
,
Jindřich
Helcl
,
Rico
Sennrich
,
Barry
, and
Alexandra
Birch
.
2017
.
Deep architectures for neural machine translation
. In
Proceedings of the Second Conference on Machine Translation
, pages
99
107
.
Mikolov
,
Tomas
,
Kai
Chen
,
Greg
, and
Jeffrey
Dean
.
2013
.
Efficient estimation of word representations in vector space
. In
Proceedings of International Conference on Learning Representations
.
Moore
,
Robert C.
and
William
Lewis
.
2010
.
Intelligent selection of language model training data
. In
Proceedings of the ACL 2010 Conference Short Papers
, pages
220
224
.
Mueller
,
Aaron
,
Garrett
Nicolai
,
Arya D.
McCarthy
,
Dylan
Lewis
,
Winston
Wu
, and
David
Yarowsky
.
2020
.
An analysis of massively multilingual neural machine translation for low-resource languages
. In
Proceedings of the 12th Language Resources and Evaluation Conference
, pages
3710
3718
.
Mukiibi
,
Jonathan
,
Babirye
Claire
, and
Nakatumba-Nabende
Joyce
.
2021
.
An English-Luganda parallel corpus
. https://doi.org/10.5281/zenodo.4764039
Muller
,
Benjamin
,
Antonios
Anastasopoulos
,
Benoît
Sagot
, and
Djamé
Seddah
.
2021
.
When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
448
462
.
Müller
,
Mathias
,
Annette
Rios
, and
Rico
Sennrich
.
2020
.
Domain robustness in neural machine translation
. In
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
, pages
151
164
.
Murthy
,
Rudra
,
Anoop
Kunchukuttan
, and
Pushpak
Bhattacharyya
.
2019
.
Addressing word-order divergence in multilingual neural machine translation for extremely low resource languages
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
3868
3873
.
Nădejde
,
Maria
,
Siva
Reddy
,
Rico
Sennrich
,
Tomasz
Dwojak
,
Marcin
Junczys-Dowmunt
,
Philipp
Koehn
, and
Alexandra
Birch
.
2017
.
Predicting target language CCG supertags improves neural machine translation
. In
Proceedings of the Second Conference on Machine Translation
, pages
68
79
.
Nakazawa
,
Toshiaki
,
Nobushige
Doi
,
Shohei
Higashiyama
,
Chenchen
Ding
,
Raj
Dabre
,
Hideya
Mino
,
Isao
Goto
,
Win Pa
Pa
,
Anoop
Kunchukuttan
,
Yusuke
Oda
,
Shantipriya
Parida
,
Ondřej
Bojar
, and
Kurohashi
.
2019
.
Overview of the 6th workshop on Asian translation
. In
Proceedings of the 6th Workshop on Asian Translation
, pages
1
35
.
Nakazawa
,
Toshiaki
,
Hideki
Nakayama
,
Chenchen
Ding
,
Raj
Dabre
,
Shohei
Higashiyama
,
Hideya
Mino
,
Isao
Goto
,
Win Pa
Pa
,
Anoop
Kunchukuttan
,
Shantipriya
Parida
,
Ondřej
Bojar
,
Chenhui
Chu
,
Akiko
Eriguchi
,
Kaori
Abe
,
Yusuke
Oda
, and
Kurohashi
.
2021
.
Overview of the 8th workshop on Asian translation
. In
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
, pages
1
45
.
Nakazawa
,
Toshiaki
,
Hideki
Nakayama
,
Chenchen
Ding
,
Raj
Dabre
,
Shohei
Higashiyama
,
Hideya
Mino
,
Isao
Goto
,
Win Pa
Pa
,
Anoop
Kunchukuttan
,
Shantipriya
Parida
,
Ondřej
Bojar
, and
Kurohashi
.
2020
.
Overview of the 7th workshop on Asian translation
. In
Proceedings of the 7th Workshop on Asian Translation
, pages
1
44
.
Nakazawa
,
Toshiaki
,
Katsuhito
Sudoh
,
Shohei
Higashiyama
,
Chenchen
Ding
,
Raj
Dabre
,
Hideya
Mino
,
Isao
Goto
,
Win Pa
Pa
,
Anoop
Kunchukuttan
, and
Kurohashi
.
2018
.
Overview of the 5th workshop on Asian translation
. In
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation
, pages
1
41
.
Neishi
,
Masato
,
Jin
Sakuma
,
Satoshi
Tohda
,
Shonosuke
Ishiwatari
,
Naoki
Yoshinaga
, and
Masashi
Toyoda
.
2017
.
A bag of useful tricks for practical neural machine translation: Embedding layer initialization and large batch size
. In
Proceedings of the 4th Workshop on Asian Translation (WAT(2017)
, pages
99
109
.
Nekoto
,
Wilhelmina
,
Vukosi
Marivate
,
Tshinondiwa
Matsila
,
Timi
Fasubaa
,
Taiwo
Fagbohungbe
,
Solomon Oluwole
Akinola
,
Shamsuddeen
,
Salomon Kabongo
Kabenamualu
,
Salomey
Osei
,
Freshia
Sackey
,
Rubungo Andre
Niyongabo
,
Ricky
Macharm
,
Perez
Ogayo
,
Orevaoghene
Ahia
,
Musie Meressa
Berhe
,
Mofetoluwa
,
Masabata
Mokgesi-Selinga
,
Lawrence
Okegbemi
,
Laura
Martinus
,
Kolawole
Tajudeen
,
Kevin
Degila
,
Kelechi
Ogueji
,
Kathleen
Siminyu
,
Julia
Kreutzer
,
Jason
Webster
,
Jamiil Toure
Ali
,
Abbott
,
Iroro
Orife
,
Ignatius
Ezeani
,
Dangana
,
Herman
Kamper
,
Elsahar
,
Goodness
Duru
,
Ghollah
Kioko
,
Murhabazi
Espoir
,
Elan
van Biljon
,
Daniel
Whitenack
,
Christopher
Onyefuluchi
,
Chris Chinenye
Emezue
,
Bonaventure F. P.
Dossou
,
Blessing
Sibanda
,
Blessing
Bassey
,
Ayodele
Olabiyi
,
Arshath
Ramkilowan
,
Alp
Öktem
,
, and
Abdallah
Bashir
.
2020
.
Participatory research for low-resourced machine translation: A case study in African languages
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
2144
2160
.
Neubig
,
Graham
and
Junjie
Hu
.
2018
.
Rapid adaptation of neural machine translation to new languages
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
875
880
.
Nguyen
,
Toan Q.
and
David
Chiang
.
2017
.
Transfer learning across low-resource, related languages for neural machine translation
. In
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, pages
296
301
.
Niehues
,
Jan
,
Ronaldo
Cattoni
,
Sebastian
Stüker
,
Mauro
Cettolo
,
Marco
Turchi
, and
Marcelo
Federico
.
2018
.
The IWSLT 2018 evaluation campaign
. In
Proceedings of the 15th International Workshop on Spoken Language Translation
, pages
2
5
.
Niehues
,
Jan
and
Eunah
Cho
.
2017
.
Exploiting linguistic resources for neural machine translation using multi-task learning
. In
Proceedings of the Second Conference on Machine Translation
, pages
80
89
.
Niu
,
Xing
,
Weijia
Xu
, and
Marine
Carpuat
.
2019
.
Bi-directional differentiable input reconstruction for low-resource neural machine translation
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
442
448
.
Och
,
Franz Josef
.
2003
.
Minimum error rate training in statistical machine translation
. In
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics
, pages
160
167
.
Oflazer
,
Kemal
,
İlknur Durgar
El-Kahlout
, and
Ilknur Durgar
El-Kahlout
.
2007
.
Exploring different representational units in English-to-Turkish statistical machine translation
. In
Proceedings of the Second Workshop on Statistical Machine Translation
, pages
25
32
.
Ojha
,
Atul Kr
.
,
Valentin
Malykh
,
Alina
Karakanta
, and
Chao-Hong
Liu
.
2020
.
Findings of the LoResMT 2020 shared task on zero-shot for low-resource languages
. In
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
, pages
33
37
.
Onome Orife
,
Iroro Fred
,
Julia
Kreutzer
,
Blessing
Sibanda
,
Daniel
Whitenack
,
Kathleen
Siminyu
,
Laura
Martinus
,
Jamiil Toure
Ali
,
Abbott
,
Vukosi
Marivate
,
Salomon
Kabongo
,
Musie
Meressa
,
Espoir
Murhabazi
,
Orevaoghene
Ahia
,
Elan
van Biljon
,
Arshath
Ramkilowan
,
,
Alp
Ktem
,
Wole
Akin
,
Ghollah
Kioko
,
Kevin
Degila
,
Herman
Kamper
,
Bonaventure
Dossou
,
Chris
Emezue
,
Kelechi
Ogueji
, and
Abdallah
Bashir
.
2020
.
Masakhane–machine translation for Africa
. In
AfricaNLP Workshop
,
International Conference on Learning Representations (ICLR)
.
Ortega
,
John
,
Atul Kr.
Ojha
,
Katharina
Kann
, and
Chao-Hong
Liu
, editors.
2021
.
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
.
Ortega
,
John E.
,
Richard Castro
Mamani
, and
Kyunghyun
Cho
.
2020
.
Neural machine translation with a polysynthetic low resource language
.
Machine Translation
,
34
(
4
):
325
346
.
Ortiz
Suárez
,
Pedro
Javier
,
Benoît
Sagot
, and
Laurent
Romary
.
2019
.
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures
. In
Proceedings of the 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7)
, pages
9
16
.
Ott
,
Myle
,
Michael
Auli
,
David
Grangier
, and
Marc’Aurelio
Ranzato
.
2018
.
Analyzing uncertainty in neural machine translation
. In
International Conference on Machine Learning
.
Ott
,
Myle
,
Sergey
Edunov
,
Alexei
Baevski
,
Angela
Fan
,
Sam
Gross
,
Nathan
Ng
,
David
Grangier
, and
Michael
Auli
.
2019
.
FAIRSEQ: A fast, extensible toolkit for sequence modeling
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)
, pages
48
53
.
Pan
,
Xiao
,
Mingxuan
Wang
,
Liwei
Wu
, and
Lei
Li
.
2021
.
Contrastive learning for many-to-many multilingual neural machine translation
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
244
258
.
Papineni
,
Kishore
,
Salim
Roukos
,
Todd
Ward
, and
Wei-Jing
Zhu
.
2002
.
BLEU: A method for automatic evaluation of machine translation
. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics
, pages
311
318
.
Pavlick
,
Ellie
,
Matt
Post
,
Ann
Irvine
,
Dmitry
Kachaev
, and
Chris
Callison-Burch
.
2014
.
The language demographics of Amazon Mechanical Turk
.
Transactions of the Association for Computational Linguistics
,
2
(
Feb
):
79
92
.
Pennington
,
Jeffrey
,
Richard
Socher
, and
Christopher
Manning
.
2014
.
GloVe: Global vectors for word representation
. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1532
1543
.
Peters
,
Matthew
,
Mark
Neumann
,
Mohit
Iyyer
,
Matt
Gardner
,
Christopher
Clark
,
Kenton
Lee
, and
Luke
Zettlemoyer
.
2018
.
Deep contextualized word representations
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
2227
2237
.
Philip
,
Jerin
,
Shashank
,
Vinay P.
Namboodiri
, and
CV
Jawahar
.
2021
.
Revisiting low resource status of Indian languages in machine translation
. In
Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data
, pages
178
187
.
Platanios
,
Emmanouil Antonios
,
Mrinmaya
Sachan
,
Graham
Neubig
, and
Tom
Mitchell
.
2018
.
Contextual parameter generation for universal neural machine translation
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
425
435
.
Popel
,
Martin
,
Marketa
Tomkova
,
Jakub
Tomek
,
Łukasz
Kaiser
,
Jakob
Uszkoreit
,
Ondřej
Bojar
, and
Zdeněk
žabokrtský
.
2020
.
Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals
.
Nature Communications
,
11
:
1
15
.
Popović
,
Maja
.
2015
.
chrF: Character n-gram F-score for automatic MT evaluation
. In
Proceedings of the Tenth Workshop on Statistical Machine Translation
, pages
392
395
.
Post
,
Matt
.
2018
.
A call for clarity in reporting BLEU scores
. In
Proceedings of the Third Conference on Machine Translation: Research Papers
, pages
186
191
.
Post
,
Matt
,
Chris
Callison-Burch
, and
Miles
Osborne
.
2012
.
Constructing parallel corpora for six Indian languages via crowdsourcing
. In
Proceedings of the Seventh Workshop on Statistical Machine Translation
, pages
401
409
.
Provilkov
,
Ivan
,
Dmitrii
Emelianenko
, and
Elena
Voita
.
2020
.
BPE-dropout: Simple and effective subword regularization
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
1882
1892
.
Qi
,
Ye
,
Devendra
Sachan
,
Matthieu
Felix
,
Sarguna
, and
Graham
Neubig
.
2018
.
When and why are pre-trained word embeddings useful for neural machine translation
? In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
, pages
529
535
.
,
Alec
,
Jeff
Wu
,
Rewon
Child
,
David
Luan
,
Dario
Amodei
, and
Ilya
Sutskever
.
2019
.
Language models are unsupervised multitask learners
.
Technical report, OpenAI
.
Raffel
,
Colin
,
Noam
Shazeer
,
Roberts
,
Katherine
Lee
,
Sharan
Narang
,
Michael
Matena
,
Yanqi
Zhou
,
Wei
Li
, and
Peter J.
Liu
.
2020
.
Exploring the limits of transfer learning with a unified text-to-text transformer
.
Journal of Machine Learning Research
,
21
(
140
):
1
67
.
Ramachandran
,
Prajit
,
Peter
Liu
, and
Quoc
Le
.
2017
.
Unsupervised pretraining for sequence to sequence learning
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
383
391
.
Ramesh
,
Gowtham
,
Sumanth
Doddapaneni
,
Aravinth
Bheemaraj
,
Mayank
Jobanputra
,
Raghavan
AK
,
Ajitesh
Sharma
,
Sujit
Sahoo
,
Harshita
Diddee
,
Mahalakshmi
J
,
Divyanshu
Kakwani
,
Navneet
Kumar
,
Aswin
,
Srihari
Nagaraj
,
Kumar
Deepak
,
Vivek
Raghavan
,
Anoop
Kunchukuttan
,
Pratyush
Kumar
, and
Khapra
.
2022
.
Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages
.
Transactions of the Association for Computational Linguistics
,
10
:
145
162
.
Ranathunga
,
Surangika
,
En-Shiun Annie
Lee
,
Marjana Prifti
Skenduli
,
Ravi
Shekhar
,
Mehreen
Alam
, and
Rishemjit
Kaur
.
2021
.
Neural machine translation for low-resource languages: A survey
.
CoRR
,
abs/2106.15115
.
Ranzato
,
Marc’Aurelio
,
Sumit
Chopra
,
Michael
Auli
, and
Wojciech
Zaremba
.
2016
.
Sequence level training with recurrent neural networks
. In
Proceedings of the 4th International Conference on Learning Representations
,
San Juan, Puerto Rico
.
Raunak
,
Vikas
,
Arul
Menezes
, and
Marcin
Junczys-Dowmunt
.
2021
.
The curious case of hallucinations in neural machine translation
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
1172
1183
.
Rei
,
Ricardo
,
Craig
Stewart
,
Ana C.
Farinha
, and
Alon
Lavie
.
2020
.
COMET: A neural framework for MT evaluation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
2685
2702
.
Rezende
,
Danilo
and
Shakir
Mohamed
.
2015
.
Variational Inference with Normalizing Flows
. In
Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research
, pages
1530
1538
.
Rezende
,
Danilo Jimenez
,
Shakir
Mohamed
, and
Daan
Wierstra
.
2014
.
Stochastic backpropagation and approximate inference in deep generative models
. In
Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research
, pages
1278
1286
.
Rios
,
Annette
,
Mathias
Müller
, and
Rico
Sennrich
.
2020
.
Subword segmentation and a single bridge language affect zero-shot neural machine translation
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
528
537
.
Saleva
,
Jonne
and
Constantine
Lignos
.
2021
.
The effectiveness of morphology-aware segmentation in low-resource neural machine translation
. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
, pages
164
174
.
Sánchez-Cartagena
,
Víctor M.
,
Marta
Bañón
,
Sergio
Ortiz-Rojas
, and
Gema
Ramírez
.
2018
.
Prompsit’s submission to WMT 2018 parallel corpus filtering shared task
. In
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
, pages
955
962
.
Sánchez-Cartagena
,
Víctor M.
,
Mikel L.
, and
Felipe
Sánchez-Martínez
.
2020
.
A multi-source approach for Breton–French hybrid machine translation
. In
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
, pages
61
70
.
Sánchez-Cartagena
,
Víctor M.
,
Juan Antonio
Pérez-Ortiz
, and
Felipe
Sánchez-Martínez
.
2019
.
The Universitat d’Alacant submissions to the English-to-Kazakh news translation task at WMT 2019
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
356
363
.
Sánchez-Cartagena
,
Víctor M.
,
Juan Antonio
Pérez-Ortiz
, and
Felipe
Sánchez-Martínez
.
2020
.
Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
3938
3950
.
Víctor
M.
Sánchez-Cartagena,
2018
.
Prompsit’s submission to the IWSLT 2018 low resource machine translation task
. In
Proceedings of the 15th International Workshop on Spoken Language Translation
, pages
95
103
.
Sánchez-Martínez
,
Felipe
,
Víctor M.
Sánchez-Cartagena
,
Juan Antonio
Pérez-Ortiz
,
Mikel L.
,
Miquel
Esplà-Gomis
,
Andrew
Secker
,
Susie
Coleman
, and
Julie
Wall
.
2020
.
An English-Swahili parallel corpus and its use for neural machine translation in the news domain
. In
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
, pages
299
308
.
Santoro
,
,
Sergey
Bartunov
,
Matthew
Botvinick
,
Daan
Wierstra
, and
Timothy
Lillicrap
.
2016
.
Meta-learning with memory-augmented neural networks
. In
Proceedings of the 33rd International Conference on International Conference on Machine Learning
, pages
1842
1850
.
Scherrer
,
Yves
.
2018
.
The University of Helsinki submissions to the IWSLT 2018 low-resource translation task
. In
Proceedings of the 15th International Workshop on Spoken Language Translation
, pages
83
88
.
Scherrer
,
Yves
,
Stig-Arne
Grönroos
, and
Sami
Virpioja
.
2020
.
The University of Helsinki and Aalto University submissions to the WMT 2020 news and low-resource translation tasks
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1129
1138
.
Schmidhuber
,
J.
1992
.
Learning to control fast-weight memories: An alternative to dynamic recurrent networks
.
Neural Computation
,
4
(
1
):
131
139
.
Schulz
,
Philip
,
Wilker
Aziz
, and
Trevor
Cohn
.
2018
.
A stochastic decoder for neural machine translation
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1243
1252
.
Schwenk
,
Holger
,
Vishrav
Chaudhary
,
Shuo
Sun
,
Hongyu
Gong
, and
Francisco
Guzmán
.
2021a
.
WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia
. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
, pages
1351
1361
.
Schwenk
,
Holger
,
Guillaume
Wenzek
,
Sergey
Edunov
,
Edouard
Grave
,
Armand
Joulin
, and
Angela
Fan
.
2021b
.
CCMatrix: Mining billions of high-quality parallel sentences on the web
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
6490
6500
.
Sellam
,
Thibault
,
Dipanjan
Das
, and
Ankur
Parikh
.
2020
.
BLEURT: Learning robust metrics for text generation
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7881
7892
.
Sen
,
Sukanta
,
Kamal Kumar
Gupta
,
Asif
Ekbal
, and
Pushpak
Bhattacharyya
.
2019a
.
IITP-MT system for Gujarati-English news translation task at WMT 2019
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
, pages
407
411
.
Sen
,
Sukanta
,
Kamal Kumar
Gupta
,
Asif
Ekbal
, and
Pushpak
Bhattacharyya
.
2019b
.
Multilingual unsupervised NMT using shared encoder and language-specific decoders
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3083
3089
.
Sennrich
,
Rico
and
Barry
.
2016
.
Linguistic input features improve neural machine translation
. In
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
, pages
83
91
.
Sennrich
,
Rico
,
Barry
, and
Alexandra
Birch
.
2016a
.
Improving neural machine translation models with monolingual data
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
86
96
.
Sennrich
,
Rico
,
Barry
, and
Alexandra
Birch
.
2016b
.
Neural machine translation of rare words with subword units
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1715
1725
.
Sennrich
,
Rico
and
Martin
Volk
.
2010
.
MT-based sentence alignment for OCR-generated parallel texts
. In
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers
.
Sennrich
,
Rico
and
Martin
Volk
.
2011
.
Iterative, MT-based sentence alignment of parallel texts
. In
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
, pages
175
182
.
Sennrich
,
Rico
and
Biao
Zhang
.
2019
.
Revisiting low-resource neural machine translation: A case study
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
211
221
.
Setiawan
,
Hendra
,
Matthias
Sperber
,
Udhyakumar
Nallasamy
, and
Matthias
Paulik
.
2020
.
Variational neural machine translation with normalizing flows
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7771
7777
.
Shen
,
Shiqi
,
Yong
Cheng
,
Zhongjun
He
,
Wei
He
,
Hua
Wu
,
Maosong
Sun
, and
Yang
Liu
.
2016
.
Minimum risk training for neural machine translation
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1683
1692
.
Shi
,
Tingxun
,
Shiyu
Zhao
,
Xiaopu
Li
,
Xiaoxue
Wang
,
Qian
Zhang
,
Di
Ai
,
Dawei
Dang
,
Xue
Zhengshan
, and
Jie
Hao
.
2020
.
OPPO’s machine translation systems for WMT20
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
282
292
.
Singh
,
Keshaw
.
2020
.
Adobe AMPS’s submission for very low resource supervised translation task at WMT(20
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1144
1149
.
Singh
,
Salam Michael
,
Thoudam Doren
Singh
, and
Sivaji
.
2020
.
The NITS-CNLP system for the unsupervised MT task at WMT 2020
. In
Proceedings of the Fifth Conference on Machine Translation
, pages
1139
1143
.