Abstract
Most of the world languages are resource-poor for statistical machine translation; still, many of them are actually related to some resource-rich language. Thus, we propose three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation. Specifically, we build improved statistical machine translation models from a resource-poor language POOR into a target language TGT by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. We assume a small POOR–TGT bitext from which we learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Our work is of importance for resource-poor machine translation because it can provide a useful guideline for people building machine translation systems for resource-poor languages.
Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 7.26 BLEU points of improvement over the unadapted one and 3.09 BLEU points over the original small bitext. Moreover, combining the small POOR–TGT bitext with the adapted bitext outperforms the corresponding combinations with the unadapted bitext by 1.93–3.25 BLEU points. We also demonstrate the applicability of our approaches to other languages and domains.
1. Introduction
Contemporary statistical machine translation (SMT) systems learn how to translate from large sentence-aligned bilingual corpora of human-generated translations, called bitexts. Unfortunately, collecting sufficiently large, high-quality bitexts is difficult, and thus most of the 6,500+ world languages are resource-poor for SMT. Fortunately, many of these resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for bitext reuse.
Example pairs of such resource rich–poor languages include Spanish–Catalan, Finnish–Estonian, Swedish–Norwegian, Russian–Ukrainian, Irish–Gaelic Scottish, Standard German–Swiss German, Modern Standard Arabic–Dialectical Arabic (e.g., Gulf, Egyptian), Turkish–Azerbaijani, and so on.
Previous work has already demonstrated the benefits of using a bitext for a related resource-rich language to X (e.g., X = English) to improve machine translation from a resource-poor language to X (Nakov and Ng 2009, 2012). Here we take a different, orthogonal approach: We adapt the resource-rich language to get closer to the resource-poor one.
We assume two bitexts: (1) a small bitext for a resource-poor source language S1 and some target language T, and (2) a large bitext for a related resource-rich source language S2 and the same target language T. We use these bitexts to learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-poor and resource-rich languages, S1 and S2. We propose three approaches to adapt (the source side of) the large bitext for S2–T: word-level paraphrasing, phrase-level paraphrasing, and text rewriting using a specialized decoder. The first two approaches were proposed in our previous work (Wang, Nakov, and Ng 2012), and the third approach is novel and outperforms the other two in our experiments.
Training on the adapted large bitext yields very significant improvements in translation quality compared with both training on the unadapted large bitext S2–T, and training on the small bitext for the resource-poor language S1–T. We further achieve very sizable improvements when combining the small bitext S1–T with the large adapted bitext , compared with combining the former with the unadapted bitext S2–T.
Although here we focus on adapting Malay to look like Indonesian, we also demonstrate the applicability of our approach to another language pair, Bulgarian–Macedonian, which is also from a different domain.
The remainder of this article is organized as follows. Section 2 presents an overview of related work. Section 3 introduces our target resource rich–poor language pair: Malay–Indonesian. Then, Section 4 presents our three approaches for source language adaptation. Section 5 describes the experimental set-up, after which we present the experimental results and discussions in Section 6. Section 7 contains deeper analysis of the obtained results. Finally, Section 8 concludes and points to possible directions for future work.
2. Related Work
One relevant line of research is on machine translation between closely related languages, which is arguably simpler than general SMT, and thus can be handled using word-for-word translation, manual language-specific rules that take care of the necessary morphological and syntactic transformations, or character-level translation/ transliteration. This has been tried for a number of language pairs including Czech–Slovak (Hajič, Hric, and Kuboň 2000), Turkish–Crimean Tatar (Altintas and Cicekli 2002), Irish–Scottish Gaelic (Scannell 2006), and Macedonian–Bulgarian (Nakov and Tiedemann 2012). In contrast, we have a different objective: We do not carry out full translation but rather adaptation (since our ultimate goal is to translate into a third language X). A special case of this same line of research is the translation between dialects of the same language, for example, between Cantonese and Mandarin (Zhang 1998), or between a dialect of a language and a standard version of that language, for example, between some Arabic dialect (e.g., Egyptian) and Modern Standard Arabic (Bakr, Shaalan, and Ziedan 2008; Sawaf 2010; Salloum and Habash 2011; Sajjad, Darwish, and Belinkov 2013). Here again, manual rules and/or language-specific tools and resources are typically used. In the case of Arabic dialects, a further complication arises due to the informal status of the dialects, which are not standardized and not used in formal contexts but rather only in informal online media such as social networks, chats, forums, Twitter, and SMS messages, though the Egyptian Wikipedia is one notable exception. This causes further mismatch in domain and genre. Thus, translating from Arabic dialects to Modern Standard Arabic requires, among other things, normalizing informal text to a formal form. Sajjad, Darwish, and Belinkov (2013) first normalized a dialectal Egyptian Arabic to look like Modern Standard Arabic, and then translated the transformed text to English.
In fact, this is a more general problem, which arises with informal sources such as SMS messages and Tweets for just any language (Aw et al. 2006; Han and Baldwin 2011; Wang and Ng 2013; Bojja, Nedunchezhian, and Wang 2015). Here the main focus is on coping with spelling errors, abbreviations, and slang, which are typically addressed using string edit distance, while also taking pronunciation into account. This is different from our task, where we try to adapt good, formal text from one language to another.
A second relevant line of research is on language adaptation and normalization, when done specifically for improving SMT into another language. For example, Marujo et al. (2011) described a rule-based system for adapting Brazilian Portuguese (BP) to European Portuguese (EP), which they used to adapt BP–English bitexts to EP–English. They report small improvements in BLEU for EP–English translation when training on the adapted “EP”–English bitext compared with using the unadapted BP–English (38.55 vs. 38.29 BLEU points), or when an EP–English bitext is used in addition to the adapted/unadapted one (41.07 vs. 40.91 BLEU points). Unlike that work, which heavily relied on language-specific rules, our approach is statistical, and largely language-independent; moreover, our improvements are much more sizable.
A third relevant line of research is on reusing bitexts between related languages without or with very little adaptation, which works well for very closely related languages. For example, our previous work (Nakov and Ng 2009, 2012) experimented with various techniques for combining a small bitext for a resource-poor language (Indonesian or Spanish) with a much larger bitext for a related resource-rich language (Malay or Portuguese), pretending that Spanish is resource-poor; the target language of all bitexts was English. However, that work did not attempt language adaptation, except for very simple transliteration for Portuguese–Spanish that ignored context entirely; because it does not substitute a word with a completely different word, transliteration did not help much for Malay–Indonesian, which use unified spelling. Still, once we have language-adapted the large bitext, it makes sense to try to combine it further with the small bitext; thus, in the following we will directly compare and combine these two approaches.
One alternative, which we do not explore in this work, is to use cascaded translation using a pivot language (Cohn and Lapata 2007; Utiyama and Isahara 2007; Wu and Wang 2009). Unfortunately, using the resource-rich language as a pivot (poor→rich→X) would require an additional parallel poor–rich bitext, which we do not have. Pivoting over the target X (rich→X→poor) for the purpose of language adaptation, on the other hand, would miss the opportunity to exploit the relationship between the resource-poor and the resource-rich language; this would also be circular since the first step would ask an SMT system to translate its own training data (we only have one rich–X bitext). Yet another alternative approach for improving resource-poor MT is to mine translation bitexts from comparable corpora (Munteanu, Fraser, and Marcu 2004; Snover, Dorr, and Schwartz 2008). This is orthogonal to our efforts here, as our focus is on adapting resources for a related resource-rich language, rather than directly mining source–target translation pairs from comparable corpora.
3. Malay and Indonesian
Malay and Indonesian are closely related, mutually intelligible Austronesian languages with 180 million speakers combined. They have a unified spelling, with occasional differences, for example, kerana vs. karena (‘because’), Inggeris vs. Inggris (‘English’), and wang vs. uang (‘money’).
They differ more substantially in vocabulary, mostly because of loan words, where Malay typically follows the English pronunciation, whereas Indonesian tends to follow Dutch, for example, televisyen vs. televisi, Julai vs. Juli, and Jordan vs. Yordania.
Although there are many cognates between the two languages, there are also many false friends, for example, polisi means policy in Malay but police in Indonesian. There are also many partial cognates, for example, nanti means both will (future tense marker) and later in Malay but only later in Indonesian.
Thus, fluent Malay and fluent Indonesian can differ substantially. Consider, for example, Article 1 of the Universal Declaration of Human Rights:1
- •
Semuamanusiadilahirkanbebasdansamarata dari segi kemuliaandan hak-hak.Merekamempunyai pemikirandanperasaanhati danhendaklah bertindak di antarasatu sama laindengansemangat persaudaraan. (Malay)
- •
Semuaorangdilahirkanmerdekadanmempunyai martabatdan hak-hakyang sama.Merekadikaruniai akaldan hatinuranidanhendaknya bergaulsatu sama laindalamsemangat persaudaraan. (Indonesian)
- •
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. (English)
There is only 50% overlap at the word level, but the actual vocabulary overlap is much higher—for example, there is only one word in the Malay text that does not exist in Indonesian: samarata (‘equal’). Other differences are due to the use of different morphological forms, for example, hendaklah vs. hendaknya (‘conscience’), derivational variants of hendak (‘want’).
To quantify the similarity between some pairs of languages, we calculated the cosine similarity between them based on the Universal Declaration of Human Rights.2 The results are shown in Table 1. We can see that the average similarity between English and {Indonesian, Malay, French, Spanish} is 0.001–0.033, whereas for closely related language pairs it ranges from 0.302 to 0.802. Of course, this cosine calculation compares surface word overlap only and does not take minor morphological variants into consideration. Yet, this gives an idea of the relative proximity between the languages.
Language Pairs . | Cosine Similarity . |
---|---|
Malay–Indonesian | 0.802 |
Portuguese–Spanish | 0.475 |
Bulgarian–Macedonian | 0.302 |
French–English | 0.033 |
Spanish–English | 0.031 |
Indonesian–English | 0.002 |
Malay–English | 0.001 |
Language Pairs . | Cosine Similarity . |
---|---|
Malay–Indonesian | 0.802 |
Portuguese–Spanish | 0.475 |
Bulgarian–Macedonian | 0.302 |
French–English | 0.033 |
Spanish–English | 0.031 |
Indonesian–English | 0.002 |
Malay–English | 0.001 |
Of course, word choice in translation is often a matter of taste. Thus, we asked a native speaker of Indonesian to adapt the Malay version to Indonesian while preserving as many words as possible, and we obtained the following result:
- •
Semua manusia dilahirkan bebas danmempunyai martabatdan hak-hakyang sama. Mereka mempunyai pemikiran dan perasaandan hendaklahbergaulsatu sama laindalamsemangat persaudaraan. (Indonesian)
Obtaining this latter version from the original Malay text requires three kinds of word-level operations:
- •
deletion of dari, segi, and hati
- •
insertion of yang and sama
- •
substitution of samarata with mempunyai, kemuliaan with martabat, and dengan with dalam
Unfortunately, we do not have parallel Malay–Indonesian text, which complicates the process of learning when to apply these operations. Thus, in the following we focus our attention on the simplest and most common operation of word/phrase substitution only, leaving the other two operations for future work. There are other potentially useful operations—for example, a correct translation for the Malay samarata can be obtained by splitting it into the Indonesian sequence sama rata.
Note that simple word substitution is enough in many cases—for example, it is all that is needed for the following Malay–Indonesian sentence pair:
- •
KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. (Malay)
- •
PDB Malaysia akan mencapai 8 persen pada tahun 2010. (Indonesian)
- •
Malaysia's GDP is expected to reach 8 percent in 2010. (English)
4. Methods
Assuming a resource-rich bitext (Malay–English) and a resource-poor bitext (Indonesian–English), we improve statistical machine translation from the resource-poor language (Indonesian) to English by adapting the bitext for the related resource-rich language (Malay) and English to the resource-poor language (Indonesian) and English. We propose three bitext adaptation approaches: word-level paraphrasing, phrase-level paraphrasing, and text rewriting with a specialized decoder.
Given a Malay sentence in the resource-rich Malay–English bitext, we use one of these three adaptation approaches to generate a ranked list of n corresponding adapted “Indonesian” sentences. Then, we pair each such adapted “Indonesian” sentence with the English counterpart in the Malay–English bitext for the Malay sentence it was derived from, thus obtaining a synthetic “Indonesian”–English bitext. Finally, we combine this synthetic bitext with the resource-poor Indonesian–English bitext to train the final Indonesian–English SMT system, using various bitext combination methods.
In the remainder of this section, we first present the word-level paraphrasing approach, followed by the phrase-level paraphrasing approach; then, we describe the text rewriting decoder. Finally, we describe the bitext combination methods we experiment with.
4.1 Word-Level Paraphrasing
Given a Malay sentence, we generate a confusion network containing multiple Indonesian word-level paraphrase options for each Malay word. Each such Indonesian option is associated with a corresponding weight in the network, which is defined as the probability of this option being a translation of the original Malay word, calculated using Equation (1). We decode this confusion network using a large Indonesian language model, thus generating a ranked list of n corresponding adapted “Indonesian” sentences.
In the following we first describe how we generate the word-level Indonesian options and the corresponding weights for the Malay words. Then, we explain how we build, decode, and improve the confusion network.
4.1.1 Inducing Word-Level Paraphrases
We use pivoting over English to induce potential Indonesian word translations for a given Malay word.
Pr(i|e) and Pr(e|m) are estimated using maximum likelihood from the word alignments. Following Callison-Burch, Koehn, and Osborne (2006), we further assume that i is conditionally independent of m given e.
For example, Figure 1 shows an example that induces an Indonesian word apakah as a translation option for the Malay word adakah, since the two words are both aligned to the same English word whether in the word alignments for the Indonesian–English bitext and the Malay–English bitext, respectively.
4.1.2 Confusion Network Construction
Given a Malay sentence, we construct an Indonesian confusion network, where each Malay word is augmented with a set of alternatives, represented as network transitions: possible Indonesian word translations. The weight of such a transition is the conditional Indonesian–Malay translation probability as calculated by Equation (1); the original Malay word is assigned a weight of 1.
Note that we paraphrase each word in the input Malay sentence as opposed to only those Malay words that we believe not to exist in Indonesian (e.g., because they do not appear in our Indonesian monolingual text). This is necessary because of the large number of false friends and partial cognates between Malay and Indonesian (see Section 3).
Finally, we decode the confusion network for a Malay sentence using a large Indonesian language model, and we extract an n-best list. For balance, in case of fewer than n adaptations for a Malay sentence, we randomly repeat some of the available ones. Table 2 shows the 10-best adapted “Indonesian” sentences we generated for the confusion network in Figure 2. According to a native Indonesian speaker, options 1 and 3 in the table are perfect adaptations, options 2 and 5 have a wrong word order, and the rest are grammatical though not perfect.
Rank . | . | . | “Indonesian” Sentence . | |
---|---|---|---|---|
1 | pdb | malaysia | akan | mencapai 8 persen pada tahun 2010 . |
2 | pdb | malaysia | untuk | mencapai 8 persen pada tahun 2010 . |
3 | pdb | malaysia | diperkirakan | mencapai 8 persen pada tahun 2010 . |
4 | maka | malaysia | akan | mencapai 8 persen pada tahun 2010 . |
5 | maka | malaysia | untuk | mencapai 8 persen pada tahun 2010 . |
6 | pdb | malaysia | dapat | mencapai 8 persen pada tahun 2010 . |
7 | maka | malaysia | diperkirakan | mencapai 8 persen pada tahun 2010 . |
8 | sebesar | malaysia | akan | mencapai 8 persen pada tahun 2010 . |
9 | pdb | malaysia | diharapkan | mencapai 8 persen pada tahun 2010 . |
10 | pdb | malaysia | ini | mencapai 8 persen pada tahun 2010 . |
Rank . | . | . | “Indonesian” Sentence . | |
---|---|---|---|---|
1 | pdb | malaysia | akan | mencapai 8 persen pada tahun 2010 . |
2 | pdb | malaysia | untuk | mencapai 8 persen pada tahun 2010 . |
3 | pdb | malaysia | diperkirakan | mencapai 8 persen pada tahun 2010 . |
4 | maka | malaysia | akan | mencapai 8 persen pada tahun 2010 . |
5 | maka | malaysia | untuk | mencapai 8 persen pada tahun 2010 . |
6 | pdb | malaysia | dapat | mencapai 8 persen pada tahun 2010 . |
7 | maka | malaysia | diperkirakan | mencapai 8 persen pada tahun 2010 . |
8 | sebesar | malaysia | akan | mencapai 8 persen pada tahun 2010 . |
9 | pdb | malaysia | diharapkan | mencapai 8 persen pada tahun 2010 . |
10 | pdb | malaysia | ini | mencapai 8 persen pada tahun 2010 . |
4.1.3 Further Refinements
Many of our Malay–Indonesian paraphrases are bad: Some have very low probabilities, and others involve rare words for which the probability estimates are unreliable. Moreover, the options we propose for a Malay word are inherently restricted to the small Indonesian vocabulary of the Indonesian–English bitext. We now describe how we address these issues.
Score-based filtering. We filter out translation pairs whose probabilities (Equation (1)) are lower than some threshold (tuned on the development data set), for example, 0.01.
Improved estimations for Pr(i|e). We concatenate k copies of the small Indonesian–English bitext and one copy of the large Malay–English bitext, where the value of k is selected so that we have roughly the same number of Indonesian and Malay sentences. Then, we generate word-level alignments for the resulting bitext. Finally, we truncate these alignments keeping them for one copy of the original Indonesian–English bitext only. Thus, we end up with improved word alignments for the Indonesian–English bitext, and ultimately with better estimations for Equation (1). Because Malay and Indonesian share many cognates, this improves word alignments for Indonesian words that occur rarely in the small Indonesian–English bitext, but are relatively frequent in the larger Malay–English one; it also helps for some frequent words.
Cross-lingual morphological variants. We increase the Indonesian options for a Malay word using morphology. Because the set of Indonesian options for a Malay word in pivoting is restricted to the Indonesian vocabulary of the small Indonesian–English bitext, this is a severe limitation of pivoting. Thus, assuming a large monolingual Indonesian text, we first build a lexicon of the words in the text. Then, we lemmatize these words using two different lemmatizers: the Malay lemmatizer of Baldwin and Awab (2006), and a similar Indonesian lemmatizer. These two analyzers have different strengths and weaknesses, therefore we combine their outputs to increase recall. Next, we group all Indonesian words that share the same lemma, for example, for minum we obtain {diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum}. Because Malay and Indonesian are subject to the same morphological processes and share many lemmata, we use such groups to propose Indonesian translation options for a Malay word. We first lemmatize the target Malay word, and then we find all groups of Indonesian words the Malay lemma belongs to. The union of these groups is the set of morphological variants that we will add to the confusion network as additional options for the Malay word. Although the different morphological forms typically have different meanings, for example, minum (‘drink’) vs. peminum (‘drinker’), in some cases the forms could have the same translation in English, for example, minum (‘drink’, verb) vs. minuman (‘drink’, noun). This is our motivation for trying morphological variants, even though they are almost exclusively derivational, and thus generally quite risky as translational variants. For example, given seperminuman (‘drinking’) in the Malay input, we first find its lemma minum, and then we get the above example set of Indonesian words, which contains some reasonable substitutes such as minuman (‘drink’).
4.2 Phrase-Level Paraphrasing
Word-level paraphrasing ignores context when generating Indonesian variants, relying only on the Indonesian language model to make the right contextual choice. This might not be strong enough. Thus, we also try to model context more directly by generating adaptation options at the phrase level.
4.2.1 Inducing Phrase-Level Paraphrases
We use standard phrase-based SMT techniques (Koehn et al. 2007) to build separate phrase tables for the Indonesian–English and the Malay–English bitexts. We then pivot over the English phrases to generate Indonesian–Malay phrase pairs. As in the case of word-level pivoting, we derive the paraphrase probabilities from the corresponding probabilities in the two phrase tables, again using Equation (1).
We then use the Moses phrase-based SMT decoder (Koehn et al. 2007) to “translate” the Malay side of the Malay–English bitext to get closer to Indonesian. We use monotone translation, that is, we allow no phrase reordering. We tune the parameters of the log-linear model on a development set using minimum error rate training (MERT) (Och 2003).
4.2.2 Cross-Lingual Morphological Variants
Although phrase-level paraphrasing models context better, it remains limited in the size of its Indonesian vocabulary by the small Indonesian–English bitext, just like word-level paraphrasing. We address this by transforming the Indonesian sentences in the development and the test Indonesian–English bitexts into confusion networks (Dyer 2007; Du, Jiang, and Way 2010), where we add Malay morphological variants for the Indonesian words, weighting them based on Equation (2). Note that we do not alter the training bitext; we just transform the source side of the development and the test data sets into confusion networks.
4.3 Text Rewriting with a Specialized Decoder
In this section, we introduce a third approach to source language adaptation, which uses a text rewriting decoder to iteratively find the best adaptation for an input sentence.
We first discuss the differences between traditional left-to-right decoders and the text rewriting decoder we propose. We then introduce the decoding algorithm, the different hypothesis producers, and the feature functions we use for source language adaptation.
4.3.1 Differences from Typical Beam-Search Decoders
Beam-search decoders are widely used in natural language processing applications such as SMT, for example, in the phrase-based Moses decoder (Koehn et al. 2007), and automatic speech recognition (ASR), for example, in the HTK hidden Markov model toolkit (Young et al. 2002). Given an input sentence in the source language, various hypotheses about the output sentence in the target language are generated in a left-to-right fashion.
Figure 3 shows an example search tree for the input sentence s1s2s3, given the following translation options: {(s1, t2), (s1s2, t2t5), (s2s3, t6), (s3, t4)}, where si and tj are source and target words, respectively. Starting from the initial hypothesis, each hypothesis is expanded by adding one more target phrase to the output sentence. This requires keeping a map of which words were translated so far, as the figure shows. Hypotheses with the same maps and the same target output are recombined, and those with the same number of translated words are kept in the same beam. For efficiency reasons, beams are limited in size, and thus only the highest scoring hypotheses make it in each beam. Note that all hypotheses before the last level in the search tree are incomplete, which means that sentence-level feature functions could not be computed exactly for them, for example, type/token ratio (Hardmeier et al. 2013) feature function that models readability.
Because the decoders used in SMT and ASR typically work at the phrase- or the word-level, they cannot make use of sentence-level features. In contrast, our text rewriting decoder works at the sentence-level, that is, all hypotheses are complete sentences. This means that we can use truly sentence-level features. We will show an example in Section 7.5.
Figure 4 shows the search tree of our decoder for the same input sentence and the same translation options as in the beam decoder example from Figure 3. The search starts from the initial hypothesis, which is then expanded by replacing a source phrase with a target phrase using one phrase pair from the translation options; then, the process continues recursively with each of the new hypotheses.
4.3.2 Beam-Search Algorithm for Text Rewriting
Given an input sentence, our decoder searches for the best rewriting. It repeats two steps for a number of iterations:
- •
producing new sentence-level hypotheses from the hypotheses in the current beams, which is carried out by hypothesis producers
- •
scoring these new hypotheses to retain in the beams only the best ones, which is done using feature functions
Algorithm 1 describes the search process, which uses lazy pruning, only retaining in the beams the n-best hypotheses (as also implemented in the Moses decoder). Hypotheses with the same number of modifications are grouped in the same beam. The maximum number of iterations is equal to the number of tokens in the input sentence, that is, we suppose each token needs at most one modification on average. Upon completion, we select the best hypothesis across all beams.
4.3.3 Hypothesis Producers
Hypothesis producers generate new hypotheses by modifying existing ones. We use three types of hypothesis producers:
- •
Word-level mapping: This hypothesis producer uses the word-level pivoted Malay–Indonesian dictionary described in Section 4.1.1. For example, given the hypothesis KDNK Malaysia dijangka cecah 8.1peratuspada tahun 2010., if the dictionary has the translation pair (peratus, persen), the following hypothesis will be produced: KDNK Malaysia dijangka cecah 8.1persenpada tahun 2010.
- •
Phrase-level mapping: This hypothesis producer uses the pivoted phrase table described in Section 4.2.1. For example, if the pivoted phrase table contains the phrase pair (dijangka cecah, akan mencapai), given the hypothesis KDNK Malaysiadijangka cecah8.1 peratus pada tahun 2010., the new hypothesis KDNK Malaysiaakan mencapai8.1 peratus pada tahun 2010. will be generated.
- •
Cross-lingual morphological mapping: This hypothesis producer uses the cross-lingual morphological variants dictionary from a Malay word to its Indonesian morphological variants described in Section 4.1.3. For example, given the hypothesis dan untuk meringkaskan pengalamannya?, if the dictionary contains the morphological variant pair (meringkaskan, meringkas), the following hypothesis will be produced: dan untukmeringkaspengalamannya?
The hypothesis producers presented here are all based on statistical methods. In principle, we can also use some rule-based hypothesis producers to adapt Malay to Indonesian. For example, the number format of Malay is different from that of Indonesian: Malay numbers are written in accordance with the British convention, that is, “.” is the decimal point and “,” denotes digit grouping, whereas in Indonesian, the roles of “.” and “,” are switched. Thus, we can build a rule-based hypothesis producer to convert Malay numbers to Indonesian ones, for example, which would convert the hypothesis KDNK Malaysia dijangka cecah8.1peratus pada tahun 2010. to KDNK Malaysia dijangka cecah8,1peratus pada tahun 2010. However, such a rule-based hypothesis producer would be language-specific. In the present work, we have chosen to stick to statistical hypothesis producers only in order to keep our decoder as language-independent as possible. This makes it potentially applicable to many closely related language pairs, which we will demonstrate in Section 7.4.
4.3.4 Feature Functions
The text rewriting decoder assesses the quality of a hypothesis based on a log-linear model and a number of feature functions, which can be grouped into two general types.
The first type includes the count feature functions, which count the total number of modifications that a given hypothesis producer has made. They allow the decoder to distinguish good hypothesis producers from bad ones. More precisely, if the decoder finds a specific hypothesis producer more useful than others, it can give it a higher weight in order to let it perform more modifications.
The second type includes general feature functions such as:
- •
Indonesian language model score of the adapted “Indonesian” sentence
- •
Word penalty, that is, the number of tokens in the hypothesis
- •
Malay word penalty, that is, the number of Malay words in the hypothesis, which are identified using bigram counts from the Indonesian language model: A word w in a hypothesis … w−1ww1 … is considered a Malay word if both bigrams w−1w and ww1 do not occur in the Indonesian language model; note that it would be difficult to implement this feature function in a phrase-based SMT decoder such as Moses (Koehn et al. 2007) since hypotheses in Moses only contain incomplete sentences before the last stack, and this feature function asks to see a future word w1 that has not been generated yet for the last word w; of course, it could also be implemented for words up to w−1, that is, ignoring the last word in the hypothesis, but this would make the implementation different from what is done for LM, and it would also require special treatment of the case of a full hypothesis compared with how partial hypotheses are handled; and the implementation would become even trickier if we want to use higher order n-grams instead of bigrams.
- •
Word-level mappings: the summation of the logarithms of all conditional probabilities (see Equation (1)) used so far
- •
Phrase-level mappings: We have four feature functions, each of which is the summation of the logarithms of one of the four probabilities in the pivoted phrase table, that is, forward/reverse phrase translation probability and forward/reverse lexical weighting probability
- •
Cross-lingual morphological mapping, that is, the summation of the logarithms of all morphological variant mapping scores (see Equation (2)) used so far
4.3.5 Model
The text rewriting decoder prunes bad hypotheses based on score(h); it also selects the best hypothesis as the one with the highest score(h) across all beams.
4.4 Combining Bitexts
We have presented our source language adaptation approaches in Sections 4.1, 4.2, and 4.3. Now we explain how we combine the Indonesian–English bitext with the synthetic “Indonesian”–English bitext we have generated. We consider the following three bitext combination approaches:
Simple concatenation. Assuming the two bitexts are of comparable quality, we simply train an SMT system on their concatenation.
Balanced concatenation with repetitions. The two bitexts are not directly comparable. For one thing, “Indonesian”–English is obtained from n-best lists, that is, it has exactly n very similar variants for each Malay sentence. Moreover, the original Malay–English bitext is much larger than the Indonesian–English one and now it has further expanded n times to become “Indonesian”–English, which means it will heavily dominate the concatenation. To counter balance this, we repeat the smaller Indonesian–English bitext enough times to make its number of sentences roughly the same as for “Indonesian”–English; then we concatenate them and train an SMT system on the resulting bitext.
Sophisticated phrase table combination. Finally, we experiment with a method for combining phrase tables proposed in Nakov and Ng (2009, 2012). The first phrase table is extracted from word alignments for the balanced concatenation with repetitions, which are then truncated so that they are kept for only one copy of the Indonesian–English bitext. The second table is built from the simple concatenation. The two tables are then merged as follows: All phrase pairs from the first one are retained, and to them are added those phrase pairs from the second one that are not present in the first one. Each phrase pair retains its original scores, which are further augmented with 1–3 extra feature scores indicating its origin: The first/second/third feature is 1 if the pair came from the first/second/both table(s), and 0 otherwise. We experiment using all three, the first two, or the first feature only; we also try setting the features to 0.5 instead of 0. This makes six combinations (0, 00, 000, .5, .5.5, .5.5.5); on testing, we use the one that achieves the highest BLEU score on the development set.
Other possibilities for combining the phrase tables include using alternative decoding paths (Birch, Osborne, and Koehn 2007), simple linear interpolation, and direct merging with extra features (Callison-Burch, Koehn, and Osborne 2006); they were previously found inferior to the last two approaches above (Nakov and Ng 2009, 2012).
5. Experiments
With a small Indonesian–English bitext and a larger Malay–English bitext, we use three approaches for source language adaptation to adapt the Malay side of the Malay–English bitext to look like Indonesian, thus obtaining a synthetic “Indonesian”–English bitext. With the synthetic bitext, we run two kinds of experiments:
- •
isolated, where we train an SMT system on the synthetic “Indonesian”–English bitext only
- •
combined, where we combine the synthetic bitext with the original Indonesian–English bitext
In all experiments, we use the same Indonesian–English development set for tuning, and the same Indonesian–English test set for evaluation; see below.
5.1 Data Sets
In our experiments, we use the following data sets, which are required for Indonesian–English SMT:
- •
Indonesian–English training bitext (IN2EN): 28,383 sentence pairs; 915,192 English tokens; 796,787 Indonesian tokens
- •
Indonesian–English dev bitext (IN2EN-dev): 2,000 sentence pairs; 37,101 English tokens; 35,509 Indonesian tokens
- •
Indonesian–English test bitext (IN2EN-test): 2,018 sentence pairs; 36,584 English tokens; 35,708 Indonesian tokens
- •
Monolingual English text (EN-LM): 174,443 sentences; 5,071,988 English tokens
We also use a Malay–English set, to be adapted to “Indonesian”–English, and monolingual Indonesian text for building an Indonesian language model:
- •
Malay–English training bitext (ML2EN): 290,000 sentence pairs; 8,638,780 English tokens; 8,061,729 Malay tokens
- •
Monolingual Indonesian text (IN-LM): 1,132,082 sentences; 20,452,064 Indonesian tokens
We use two bitexts (IN2EN and ML2EN) to induce word-level and phrase-level paraphrases as described in Sections 4.1.1 and 4.2.1, respectively. Moreover, in Section 4.1.3, we use a large monolingual Indonesian corpus, IN-LM, in order to induce Indonesian morphological variants for a Malay word. We built all these monolingual and bilingual data sets from texts we crawled from the Internet.
We further needed a Malay–Indonesian development bitext in order to tune the phrase-based SMT decoder in the phrase-level paraphrasing approach of Section 4.2.1, and our source language adaptation decoder of Section 4.3. We created this bitext synthetically: We translated the English side of the IN2EN-dev into Malay using Google Translate,3 and we paired this translated Malay with the Indonesian side of IN2EN-dev:
- •
Synthetic Malay–Indonesian dev bitext (ML2IN-dev): 2,000 sentence pairs; 34,261 Malay tokens; 35,509 Indonesian tokens
5.2 Baseline Systems
We built five baseline systems – two using a single bitext, ML2EN or IN2EN, and three combining ML2EN and IN2EN, using simple concatenation, balanced concatenation, and sophisticated phrase table combination. The last combination is a very strong baseline and the most relevant one that we need to improve upon.
We built each SMT system as follows. Given a training bitext, we built directed word alignments using IBM model 4 (Brown et al. 1993) for both directions, and we combined them using the intersect+grow heuristic (Och and Ney 2003). Based on these alignments, we extracted phrase translation pairs of length up to seven, and we scored them to build a phrase table, where each phrase pair has five features (Koehn 2013): forward and reverse translation probabilities, forward and reverse lexicalized phrase translation probabilities, and a phrase penalty. We further used a 5-gram language model trained using the SRILM toolkit (Stolcke 2002) with modified Kneser-Ney smoothing (Kneser and Ney 1995). We combined all features in a log-linear model, namely: (1) the five features in the phrase table, (2) a language model score, (3) a word penalty, that is, the number of words in the output translation, and (4) distance-based reordering cost.
We tuned the weights of these features by optimizing BLEU (Papineni et al. 2002) on the development set IN2EN-dev using MERT (Och 2003), and we used them for translation with the phrase-based SMT decoder of Moses.
We evaluated all systems on the same test set, IN2EN-test.
5.3 Isolated Experiments
In the isolated experiments, we train the SMT system on the adapted “Indonesian”–English bitext only, which allows for a direct comparison to using ML2EN or IN2EN only.
5.3.1 Using Word-Level Paraphrases
In our word-level paraphrasing experiments, we adapted Malay to Indonesian using three kinds of confusion networks (CN) (see Section 4.1.3 for details):
- •
CN:word – using word-level pivoting only
- •
CN:word′ – using word-level pivoting, with probabilities from word alignments for IN2EN that were improved using ML2EN
- •
CN:word′+morph – CN:word further augmented with cross-lingual morphological variants
There are two parameters to tune on IN2EN-dev for the above confusion networks: (1) the minimum pivoting probability threshold for the Malay–Indonesian word-level paraphrases, and (2) the number of n-best Indonesian-adapted sentences that are to be generated for each input Malay sentence. We try {0.001, 0.005, 0.01, 0.05} for the threshold and {1, 5, 10} for n.
5.3.2 Using Phrase-Level Paraphrases
In our phrase-level paraphrasing experiments, we used pivoted phrase tables (PPT) with the following features for each phrase table entry (in addition to the phrase penalty; see Section 4.2 for more details):
- •
PPT:phrase1 – only using the forward conditional translation probability
- •
PPT:phrase4 – using all four conditional probabilities
- •
PPT:phrase4::CN:morph – PPT:phrase4 with a cross-lingual morphological confusion network for the dev/test Indonesian sentences
Here we tune one parameter only: the number of n-best Indonesian-adapted sentences to be generated for each input Malay sentence; we try {1, 5, 10}. We tune the phrase-level paraphrasing systems on ML2IN-dev.
5.3.3 Using a Text Rewriting Decoder
For our text rewriting decoder (DD), we conducted four experiments with different hypothesis producers (see Section 4.3.3 for more details):
- •
DD:word′ – using only one hypothesis producer, word-level mapping, whose dictionary contains word-level pivoting with probabilities from word alignments for IN2EN that were improved using ML2EN
- •
DD:word′+morph – adding one more hypothesis producer, a cross-lingual morphological mapping hypothesis producer, which uses a dictionary of cross-lingual morphological variants
- •
DD:phrase4 – only using one phrase-level mapping hypothesis producer, which uses the same pivoted phrase table as PPT:phrase4
- •
DD:phrase4+morph – this is DD:phrase4 with a cross-lingual morphological mapping hypothesis producer as for DD:word′+morph
For the first two (word-based) experiments, we tuned the two parameters in Section 5.3.1 on IN2EN-dev. For the last two (phrase-based) experiments, we only needed to tune the second parameter of the two. We tried the same values for the two parameters. We tuned the log-linear model of the text rewriting decoder on ML2IN-dev.
We also tried to use the word-level and phrase-level hypothesis producers, but this performed about the same as the phrase-level mapping hypothesis producer alone. This may be because the two mappings are extracted from the word alignments of the same Malay–English and Indonesian–English bitexts by pivoting. Thus, we can expect that the phrase-level mapping already contains most, if not all, of the word-level mapping.
5.4 Combined Experiments
These experiments assess the impact of our source language adapted bitext when combined with the original Indonesian–English bitext IN2EN, as opposed to combining ML2EN with IN2EN as was in the last three baselines above. We experimented with the same three combinations: (1) simple concatenation, (2) balanced concatenation, and (3) sophisticated phrase table combination. We tuned the parameters as before; for the last combination, we further had to include in the tuning the extra phrase table features (see Section 4.4 for details).
6. Results and Discussion
In this section, we present the results of our experiments. In all tables, statistically significant improvements (p < 0.01), according to Collins, Koehn, and Kučerová's (2005) sign test, over the baseline are in bold; in case of two baselines, we use underline for the second baseline.
6.1 Baseline Experiments
The results for the baseline systems are shown in Table 3. We can see that training on ML2EN instead of IN2EN yields over 4 points absolute drop in BLEU (Papineni et al. 2002) score, even though ML2EN is about 10 times larger than IN2EN and both bitexts are from the same domain. This confirms the existence of important differences between Malay and Indonesian. Simple concatenation does not help, but balanced concatenation with repetitions improves by 1.12 BLEU points over IN2EN, which shows the importance of giving IN2EN a proper weight in the combined bitext. This is further reconfirmed by the sophisticated phrase table combination, which yields an additional absolute gain of 0.31 BLEU points.
System . | BLEU . |
---|---|
ML2EN | 14.50 |
IN2EN | 18.67 |
Simple concatenation | 18.49 |
Balanced concatenation | 19.79 |
Sophisticated phrase table combination | 20.10(.5.5) |
System . | BLEU . |
---|---|
ML2EN | 14.50 |
IN2EN | 18.67 |
Simple concatenation | 18.49 |
Balanced concatenation | 19.79 |
Sophisticated phrase table combination | 20.10(.5.5) |
6.2 Isolated Experiments
Table 4 shows the results for the isolated experiments. We can see that word-level paraphrasing (CN:*) improves by up to 5.56 and 1.39 BLEU points over the two baselines (both results are statistically significant). Compared with ML2EN, CN:word yields an absolute improvement of 4.41 BLEU points, CN:word′ adds another 0.59, and CN:word′+morph adds 0.56 more. The scores for TER (v. 0.7.25) (Snover et al. 2006) and METEOR (v. 1.3) (Banerjee and Lavie 2005) are on par with those for BLEU (NIST v. 13).
. | System . | n-gram precision . | BLEU . | TER . | METEOR . | |||
---|---|---|---|---|---|---|---|---|
1-gr. . | 2-gr. . | 3-gr. . | 4-gr. . | |||||
ML2EN (baseline) | 48.34 | 19.22 | 9.54 | 4.98 | 14.50 | 67.14 | 43.28 | |
IN2EN (baseline) | 55.04 | 23.90 | 12.87 | 7.18 | 18.67 | 61.99 | 54.34 | |
(i) | CN:word | 54.50 | 24.41 | 13.09 | 7.35 | 61.94 | 51.07 | |
CN:word′ | 55.05 | 25.09 | 13.60 | 7.69 | 61.25 | 51.97 | ||
CN:word′+morph | 55.97 | 25.73 | 14.06 | 7.99 | 60.31 | 55.65 | ||
(ii) | PPT:phrase1 | 55.11 | 25.04 | 13.66 | 7.80 | 60.92 | 51.93 | |
PPT:phrase4 | 56.64 | 26.20 | 14.53 | 8.40 | 59.33 | 54.23 | ||
PPT:phrase4::CN:morph | 56.91 | 26.53 | 14.76 | 8.55 | 59.30 | 57.19 | ||
(iii) | DD:word′ | 56.57 | 26.15 | 14.39 | 8.18 | 59.33 | 56.66 | |
DD:word′+morph | 56.74 | 26.22 | 14.41 | 8.18 | 59.50 | 56.89 | ||
DD:phrase4 | 57.14 | 26.49 | 14.72 | 8.49 | 58.79 | 57.33 | ||
DD:phrase4+morph | 57.35 | 26.71 | 14.92 | 8.63 | 58.55 | 57.53 | ||
System combination: (i)+(ii)+(iii) | 58.46 | 27.64 | 15.46 | 9.07 | 57.26 | 58.04 |
. | System . | n-gram precision . | BLEU . | TER . | METEOR . | |||
---|---|---|---|---|---|---|---|---|
1-gr. . | 2-gr. . | 3-gr. . | 4-gr. . | |||||
ML2EN (baseline) | 48.34 | 19.22 | 9.54 | 4.98 | 14.50 | 67.14 | 43.28 | |
IN2EN (baseline) | 55.04 | 23.90 | 12.87 | 7.18 | 18.67 | 61.99 | 54.34 | |
(i) | CN:word | 54.50 | 24.41 | 13.09 | 7.35 | 61.94 | 51.07 | |
CN:word′ | 55.05 | 25.09 | 13.60 | 7.69 | 61.25 | 51.97 | ||
CN:word′+morph | 55.97 | 25.73 | 14.06 | 7.99 | 60.31 | 55.65 | ||
(ii) | PPT:phrase1 | 55.11 | 25.04 | 13.66 | 7.80 | 60.92 | 51.93 | |
PPT:phrase4 | 56.64 | 26.20 | 14.53 | 8.40 | 59.33 | 54.23 | ||
PPT:phrase4::CN:morph | 56.91 | 26.53 | 14.76 | 8.55 | 59.30 | 57.19 | ||
(iii) | DD:word′ | 56.57 | 26.15 | 14.39 | 8.18 | 59.33 | 56.66 | |
DD:word′+morph | 56.74 | 26.22 | 14.41 | 8.18 | 59.50 | 56.89 | ||
DD:phrase4 | 57.14 | 26.49 | 14.72 | 8.49 | 58.79 | 57.33 | ||
DD:phrase4+morph | 57.35 | 26.71 | 14.92 | 8.63 | 58.55 | 57.53 | ||
System combination: (i)+(ii)+(iii) | 58.46 | 27.64 | 15.46 | 9.07 | 57.26 | 58.04 |
Table 4 further shows that the optimal parameters for the word-level systems involve a very low probability cut-off, and a high number of n-best sentences. This indicates that they are robust to noise, probably because bad source-side phrases are unlikely to match the test-time input. Note also the effect of repetitions: Good word choices are shared by many n-best sentences, and thus have higher probability.
The gap between ML2EN and IN2EN for unigram precision could be explained by vocabulary differences between Malay and Indonesian. Compared with IN2EN, all CN:* models have higher 2/3/4-gram precision. However, CN:word has lower unigram precision, which could be due to bad word alignments, as the results for CN:word′ show.
When morphological variants are further added, the unigram precision improves by almost 1 BLEU point over CN:word′. This shows the importance of morphology for overcoming the limitations of the small Indonesian vocabulary of the IN2EN bitext.
The second part of Table 4 shows that phrase-level paraphrasing approach (PPT:*) performs a bit better. This confirms the importance of modeling context for closely related languages like Malay and Indonesian, which are rich in false friends and partial cognates.
We further see that using more scores in the pivoted phrase table is better. Extending the Indonesian vocabulary with cross-lingual morphological variants is still helpful, though not as much as at the word-level.
The third part of Table 4 shows that text rewriting decoder (DD:*) performs even better: It further increases the improvements up to 6.57 and 2.40 BLEU points absolutely over the two baselines (statistically significant).
Finally, the combination of the output of the best PPT, CN, and DD systems using MEMT (Heafield and Lavie 2010) yields even further gains, which shows that the three approaches are somewhat complementary. The best BLEU score for our isolated experiments is 21.76, which is already better than all five baselines in Table 3, including the three bitext combination baselines, which only achieve up to 20.10.
6.3 Combined Experiments
Table 5 shows the performance of the three bitext combination strategies (see Section 4.4 for details) when applied to combine IN2EN with the original ML2EN (i), and with various adapted versions of ML2EN (ii–iv).
. | Combination with . | Combining IN2EN with an adapted version of ML2EN . | ||
---|---|---|---|---|
Simple Concatenation . | Balanced Concatenation . | Sophisticated Combination . | ||
(i) | + ML2EN (unadapted; baseline) | 18.49 | 19.79 | 20.10(.5.5) |
(ii) | + CN:word | |||
+ CN:word′ | ||||
+ CN:word′+morph | ||||
(iii) | + PPT:phrase1 | |||
+ PPT:phrase4 | ||||
+ PPT:phrase4::CN:morph | ||||
(iv) | + DD:word′ | |||
+ DD:word′+morph | ||||
+ DD:phrase4 | ||||
+ DD:phrase4+morph | ||||
System combination: (i)+(ii)+(iii)+(iv) | 21.74(+3.25) | 21.81(+2.02) | 22.03(+1:93) |
. | Combination with . | Combining IN2EN with an adapted version of ML2EN . | ||
---|---|---|---|---|
Simple Concatenation . | Balanced Concatenation . | Sophisticated Combination . | ||
(i) | + ML2EN (unadapted; baseline) | 18.49 | 19.79 | 20.10(.5.5) |
(ii) | + CN:word | |||
+ CN:word′ | ||||
+ CN:word′+morph | ||||
(iii) | + PPT:phrase1 | |||
+ PPT:phrase4 | ||||
+ PPT:phrase4::CN:morph | ||||
(iv) | + DD:word′ | |||
+ DD:word′+morph | ||||
+ DD:phrase4 | ||||
+ DD:phrase4+morph | ||||
System combination: (i)+(ii)+(iii)+(iv) | 21.74(+3.25) | 21.81(+2.02) | 22.03(+1:93) |
We can see that for the word-level paraphrasing experiments (CN:*), all combinations except CN:word perform significantly better than their corresponding baselines, but the improvements are most sizeable for simple concatenation. Note that whereas there is a difference of 0.31 BLEU points between the balanced concatenation and the sophisticated combination for the original ML2EN, they differ little for the adapted versions. This is probably due to the sophisticated combination assuming that the second bitext is worse than the first one, which is not really the case for the adapted versions: As Table 4 shows, they all outperform IN2EN.
Overall, phrase-level paraphrasing (PPT:*) performs a bit better than word-level paraphrasing, and they are both outperformed by the text rewriting decoder (DD:*). Finally, system combination with MEMT yields even further gains. These results are consistent with those for the isolated experiments.
7. Further Analysis
In this section, we perform a more in-depth analysis of the obtained results.
7.1 Paraphrasing Non-Indonesian Words Only
In the CN:* experiments, we paraphrased each word in the Malay input. This was motivated by the existence of false friends such as polisi and of partial cognates such as nanti. However, doing so also risks proposing worse alternatives, for example, changing beliau (‘he’, respectful) to ia (‘he’, casual), which the weights on the confusion network edges and the language model would not always handle properly. Thus, we tried paraphrasing non-Indonesian words only, that is, those not in IN-LM. Because IN-LM occasionally contains some Malay-specific words, we also tried paraphrasing words that occur at most t times in IN-LM. Table 6 shows that this can yield a loss of up to 1 BLEU point for t = 0; 10, and a bit less for t = 20; 40.
System . | BLEU . |
---|---|
CN:word, t = 0 | 17.88(0.01,5best) |
CN:word, t = 10 | 17.88(0.05,10best) |
CN:word, t = 20 | 18.14(0.01,5best) |
CN:word, t = 40 | 18.34(0.01,5best) |
CN:word (i.e., paraphrase all) | 18.91(0.005,10best) |
System . | BLEU . |
---|---|
CN:word, t = 0 | 17.88(0.01,5best) |
CN:word, t = 10 | 17.88(0.05,10best) |
CN:word, t = 20 | 18.14(0.01,5best) |
CN:word, t = 40 | 18.34(0.01,5best) |
CN:word (i.e., paraphrase all) | 18.91(0.005,10best) |
7.2 Manual Evaluation
We asked a native Indonesian speaker who does not speak Malay to judge whether our “Indonesian” adaptations are more understandable to him than the original Malay input for 100 random sentences. We used two extremes: the conservative CN:word,t=0 vs. CN:word′+morph. Because the latter is noisy, the top three choices were judged for it. Table 7 shows that CN:word,t=0 is better/equal to the original 53%/31% of the time. Thus, it is a very good step in the direction of turning Malay into Indonesian. In contrast, CN:word′+morph is typically worse than the original; moreover, those at rank 2 are a bit better than those at rank 1; even compared to the best in top 3, the better:worse ratio is 45%:43%. Still, this latter model works better, which means that phrase-based SMT systems are robust to noise and prefer more variety rather than better translations in the training bitext. That is, humans usually like high precision, whereas what the downstream SMT system really needs should be high recall. Note also that the judgments were at the sentence level, although phrases are sub-sentential, that is, there can be many good phrases to be extracted from a “bad” sentence. For example, CN:word′+morph adapted perisian navigasi kereta 3D di pasaran Malaysia menjelang akhir tahun (‘3D car navigation software hits Malaysia by year-end’) to the following three versions (changes are underlined):
- •
pertama kali mobil3D dipasarMalaysiapadaakhir tahun
- •
lunaknavigasimobil3D dipasarMalaysiapadaakhir tahun
- •
perangkatnavigasimobil3D dipasarMalaysiapadaakhir tahun
System . | Better . | Equal . | Worse . |
---|---|---|---|
CN:word, t = 0(Rank1) | 53% | 31% | 16% |
CN:word′+morph(Rank1) | 38% | 8% | 54% |
CN:word′+morph(Rank2) | 41% | 9% | 50% |
CN:word′+morph(Rank3) | 32% | 11% | 57% |
CN:word′+morph(Ranks:1−3) | 45% | 12% | 43% |
System . | Better . | Equal . | Worse . |
---|---|---|---|
CN:word, t = 0(Rank1) | 53% | 31% | 16% |
CN:word′+morph(Rank1) | 38% | 8% | 54% |
CN:word′+morph(Rank2) | 41% | 9% | 50% |
CN:word′+morph(Rank3) | 32% | 11% | 57% |
CN:word′+morph(Ranks:1−3) | 45% | 12% | 43% |
7.3 Reversed Adaptation
In all these experiments, we were adapting the Malay sentences to look like Indonesian. Here we try to reverse the direction of adaptation, that is, to adapt Indonesian to Malay: We thus built an Indonesian-to-Malay confusion network for each dev/test Indonesian sentence using word-level paraphrases extracted with the method of Section 4.1.1. We then use the confusion network as an input to a Malay–English SMT system trained on the ML2EN data set. We tried two variations of this idea:
- •
lattice: Use Indonesian-to-Malay confusion networks directly as input to the ML2EN SMT system; that is, tune a log-linear model using confusion networks for the source side of the IN2EN-dev data set, and then evaluate the tuned system using confusion networks for the source side of the IN2EN-test dataset.
- •
1-best: Decode the Indonesian-to-Malay confusion networks for the source side of IN2EN-dev and IN2EN-test with a Malay language model (trained on 41,842,640 Malay tokens in the same domain as the ML2EN data set) to get the 1-best outputs. Then pair each 1-best output with the corresponding English sentence. Finally, get an adapted “Malay”–English development set and an adapted “Malay”–English test set, and use them to tune and evaluate the ML2EN SMT system.
Table 8 shows that both variations perform worse than CN:word. We believe this is because lattice encodes many options, but does not use a Malay language model, and 1-best uses a Malay language model, but has to commit to 1-best. In contrast, CN:word uses both n-best outputs and an Indonesian language model. Designing a similar set-up for reversed adaptation is a research direction that we would like to pursue in future work, since the two reversed adaptation approaches have some advantages over the three adaptation approaches proposed in Section 4; for example, the reversed approaches could be more efficient.
System . | BLEU . |
---|---|
CN:word (Malay→Indonesian) | 18.91(0:005,10best) |
CN:word (Indonesian→Malay) – lattice | 17.22(0:05) |
CN:word (Indonesian→Malay) – 1-best | 17.77(0:001) |
System . | BLEU . |
---|---|
CN:word (Malay→Indonesian) | 18.91(0:005,10best) |
CN:word (Indonesian→Malay) – lattice | 17.22(0:05) |
CN:word (Indonesian→Malay) – 1-best | 17.77(0:001) |
7.4 Adapting Bulgarian to Macedonian to Help Macedonian–English Translation
In order to show the applicability of our framework to other closely related languages and other domains, we experimented with Macedonian (MK) and Bulgarian (BG), using data from a different, non-newswire domain: the OPUS corpus of movie subtitles (Tiedemann 2009). We used data sets of sizes that are comparable to those in the previous Malay–Indonesian experiments: 160K MK2EN and 1.5M BG2EN sentence pairs (1.2M and 11.5M English words). Because the sentences of movie subtitles were short, we used 10K MK2EN sentence pairs for tuning and testing (77K and 72K English words), respectively. For language modeling, we used 9.2M Macedonian and 433M English words.
Table 9 shows that all three approaches (CN:*, PPT:*, and DD:*) outperform the balanced concatenation with unadapted BG2EN. Moreover, system combination with MEMT improves even further. This indicates that our approach can work for other pairs of closely related languages and even for other domains.
System . | BLEU . | TER . | METEOR . |
---|---|---|---|
BG2EN (baseline) | 24.57 | 57.64 | 41.60 |
MK2EN (baseline) | 26.46 | 54.55 | 46.15 |
Balanced concatenation of MK2EN with an adapted BG2EN | |||
+ BG2EN (unadapted) | 27.33 | 54.61 | 48.16 |
+ CN:word′+morp | 27.97 | 54.08 | 49.65 |
+ PPT:phrase4::CN:morph | 28.38 | 53.35 | 48.21 |
+ DD:phrase4+morph | 28.44 | 53.51 | 50.95 |
Combining last four | 29.35 | 51.83 | 51.63 |
System . | BLEU . | TER . | METEOR . |
---|---|---|---|
BG2EN (baseline) | 24.57 | 57.64 | 41.60 |
MK2EN (baseline) | 26.46 | 54.55 | 46.15 |
Balanced concatenation of MK2EN with an adapted BG2EN | |||
+ BG2EN (unadapted) | 27.33 | 54.61 | 48.16 |
+ CN:word′+morp | 27.97 | 54.08 | 49.65 |
+ PPT:phrase4::CN:morph | 28.38 | 53.35 | 48.21 |
+ DD:phrase4+morph | 28.44 | 53.51 | 50.95 |
Combining last four | 29.35 | 51.83 | 51.63 |
We should note that the improvements here are less sizeable than those for Malay–Indonesian adaptation. This may be because our monolingual Macedonian data set is much smaller than the monolingual Indonesian data set (10M Macedonian vs. 20M Indonesian words). Also, our monolingual Macedonian data set is too noisy, because it contains many optical character recognition errors, typos, concatenated words, and even some Bulgarian text. Moreover, Macedonian and Bulgarian are arguably somewhat more dissimilar than Malay and Indonesian, as can be seen in Table 1.
7.5 Improving the Readability of the Adapted Bitext
We added the two features to the best isolated systems (DD:phrase4 and DD:phrase4+morph) in Table 4. The results are shown in Table 10, where we can see that the two features yield slightly lower BLEU scores, which is similar to what Hardmeier et al. (2013) observed. Hardmeier et al. (2013) also found that improving readability may result in a lower BLEU score, as simple texts would likely not match complicated reference translations, especially if the reference translations were not produced with high readability in mind in the first place.
System . | n-gram precision . | BLEU . | |||
---|---|---|---|---|---|
1-gr. . | 2-gr. . | 3-gr. . | 4-gr. . | ||
ML2EN (baseline) | 48.34 | 19.22 | 9.54 | 4.98 | 14.50 |
IN2EN (baseline) | 55.04 | 23.90 | 12.87 | 7.18 | 18.67 |
DD:phrase4 | 57.14 | 26.49 | 14.72 | 8.49 | 20.85(10best) |
DD:phrase4+morph | 57.35 | 26.71 | 14.92 | 8.63 | 21.07(10best) |
DD:phrase4+ttr | 56.75 | 26.18 | 14.53 | 8.38 | 20.63(10best) |
DD:phrase4+morph+ttr | 57.20 | 26.52 | 14.75 | 8.47 | 20.86(10best) |
DD:phrase4+ovix | 57.05 | 26.30 | 14.59 | 8.39 | 20.70(10best) |
DD:phrase4+morph+ovix | 57.12 | 26.44 | 14.64 | 8.39 | 20.75(5best) |
System . | n-gram precision . | BLEU . | |||
---|---|---|---|---|---|
1-gr. . | 2-gr. . | 3-gr. . | 4-gr. . | ||
ML2EN (baseline) | 48.34 | 19.22 | 9.54 | 4.98 | 14.50 |
IN2EN (baseline) | 55.04 | 23.90 | 12.87 | 7.18 | 18.67 |
DD:phrase4 | 57.14 | 26.49 | 14.72 | 8.49 | 20.85(10best) |
DD:phrase4+morph | 57.35 | 26.71 | 14.92 | 8.63 | 21.07(10best) |
DD:phrase4+ttr | 56.75 | 26.18 | 14.53 | 8.38 | 20.63(10best) |
DD:phrase4+morph+ttr | 57.20 | 26.52 | 14.75 | 8.47 | 20.86(10best) |
DD:phrase4+ovix | 57.05 | 26.30 | 14.59 | 8.39 | 20.70(10best) |
DD:phrase4+morph+ovix | 57.12 | 26.44 | 14.64 | 8.39 | 20.75(5best) |
7.6 Our Text Rewriting Decoder vs. Phrase-Level Paraphrasing
The results of our experiments show that phrase-level paraphrasing outperformed word-level paraphrasing, and they were both outperformed by the text rewriting decoder. Here, we discuss the differences between our text rewriting decoder and using phrase-level paraphrasing with a standard SMT phrase-based decoder like Moses:
- •
The standard phrase-based SMT decoder works at the phrase level, whereas our text rewriting decoder works at the sentence level, which allows it to make use of sentence-level features (e.g., the readability features in Section 7.5).
- •
Because of the general framework of our text rewriting decoder presented in Section 4.3, it can use a broader type of feature functions (e.g., a Malay word penalty, which would be hard to integrate in an SMT decoder, as discussed in Section 4.3.4).
- •
Adding the cross-lingual morphological variants to the text rewriting decoder is more straightforward, that is, as a hypothesis producer. In contrast, in the phrase-level paraphrasing approach, we had to transform the sentences in the development and the test sets into confusion networks, which contain the additional morphological variants. Alternatively, we could have also hacked the phrase tables to include the morphological variants.
- •
The text rewriting decoder can easily use rule-based hypothesis producers, for example, the number adaptation discussed in Section 4.3.3 can be added to the decoder as a hypothesis producer. It could also be implemented using XML markup in the Moses SMT decoder (Koehn 2013).
Ultimately, the greatest strength of our decoder is its flexibility. It provides access to a wide space of feature functions and hypothesis producers, and allows us to easily test many different ideas. Furthermore, because the original input sentence could be itself a valid hypothesis, the structure of evaluating rewrites is a natural fit to our problem.
8. Conclusion and Future Work
We have presented work on improving machine translation for a resource-poor language by making use of resources for a related resource-rich language. This is an important line of research because most world languages remain resource-poor for machine translation, while many, if not most, of them are actually related to some resource-rich language(s). We have proposed three approaches, which all adapt a bitext for a related resource-rich language to get closer to the resource-poor one: (1) word-level paraphrasing using confusion networks, (2) phrase-level paraphrasing using pivoted phrase tables, and (3) adaptation using a specialized text rewriting decoder.
More precisely, assuming a large RICH–TGT bitext for a resource-rich language and a small POOR–TGT bitext for a related resource-poor language, we use one of the three proposed approaches to adapt the RICH side of the RICH–TGT bitext to get closer to POOR, thus obtaining a synthetic “POOR”–TGT bitext, which we then combine with the original POOR–TGT bitext to improve the translation from POOR to TGT.
Using a large bitext for the resource-rich Malay–English language pair and a small bitext for the resource-poor Indonesian–English language pair, and adapting the former to look like the latter, we have achieved very significant improvements over several baselines: (1) +7.26 BLEU points over an unadapted version of the Malay–English bitext, (2) +3.09 BLEU points over the Indonesian–English bitext, and (3) 1.93–3.25 BLEU points over three bitext combinations of the Malay–English and Indonesian–English bitexts. We thus have shown the potential of the idea that source-language adaptation of a resource-rich bitext can improve machine translation for a related resource-poor language. Moreover, we have demonstrated the applicability of the general approach to other languages and domains.
The work presented here is of importance for resource-poor machine translation because it can provide a useful guideline for people building statistical machine translation systems for resource-poor languages. They can adapt bitexts for related resource-rich languages to the resource-poor language, and thus subsequently improve the resource-poor language translation using the adapted bitexts.
This work leaves several interesting directions for future research:
- •
One direction is to add more word editing operations, for example, word deletion, insertion, splitting, and concatenation (because we mainly focused on word substitution in this study).
- •
Another promising direction is to add more sentence-level feature functions to the text rewriting decoder to further improve language adaptation.
- •
Future work could also experiment with other phrase table combination methods, for example, Foster and Kuhn (2007) proposed a mixture model whose weights are learned with an EM algorithm (Foster, Chen, and Kuhn 2013).
- •
Another direction is to add word reordering. In the current work, we assume no word reordering is necessary (apart from what can be achieved within a phrase), but there actually can exist word-order differences between closely related languages.
- •
A further direction is to utilize the relationships between the source and the target sides of the input resource-rich bitext to perform language adaptation, since only the source side was used in our current work. For example, Malay–Indonesian adaptation may benefit from adapting a Malay word, considering the English words that this Malay word is aligned to in the Malay–English bitext.
- •
Another direction is to experiment with other closely related language pairs, for example, the language pairs mentioned in Section 1.
- •
Finally, further work may apply the language adaptation idea to other linguistic problems, for example, adapt the Malay training data for part-of-speech tagging to “Indonesian” in order to help Indonesian part-of-speech tagging.
Acknowledgments
We would like to give special thanks to Christian Hadiwinoto, Harta Wijaya, and Aldrian Obaja Muis, native speakers of Indonesian, for their help in the linguistic analysis of the input and output of our system. We would also like to thank the reviewers for their constructive comments and suggestions, which have helped us improve the quality of this article.
This research is supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office. Some of the results presented in this article were published in Wang, Nakov, and Ng (2012) and in the Ph.D. thesis of the first author (Wang 2013).
Notes
References
Author notes
2225 East Bayshore Road, Suite 200, Palo Alto, CA 94303. E-mail: [email protected]. The work reported in this article was part of the first author's Ph.D. thesis research in the Department of Computer Science, National University of Singapore.
Tornado Tower, floor 10, P.O. 5825, Doha, Qatar. E-mail: [email protected].
13 Computing Drive, Singapore 117417. E-mail: [email protected].