Abstract
In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.
1. Introduction
In many languages including most of the Germanic (German, Swedish, etc.) and Uralic (Finnish, Hungarian, etc.) language families, so-called closed compounds are used productively. Closed compounds are written as single words without spaces or other word boundaries, as in German Goldring. We will refer to these languages as compounding languages. In English, on the other hand, compounds are generally open, that is, written as two words as in gold ring.
This difference in compound orthography leads to problems for statistical machine translation (SMT). For translation into a compounding language, often fewer compounds than in normal texts are produced. This can be due to the fact that the desired compounds are missing in the training data or that they have not been aligned correctly. When a compound is the idiomatic word choice in the translation, systems often produce separate words, genitive or other alternative constructions, or translate only one part of the compound. For an SMT system to cope with the productivity of the phenomenon, any effective strategy should be able to correctly process compounds that have never been seen in the training data as such, although possibly their components have, either in isolation or within a different compound.
Previous work (e.g., Koehn and Knight 2003) has shown that compound splitting improves translation from compounding languages into English. In this article we explore several aspects of the less-researched area of compound treatment for translation into such languages, using three Germanic languages (German, Swedish, and Danish) as examples.1 The assumption is that splitting compounds will also improve translation for this translation direction and lead to more natural translations. The strategy we adopt is to split compounds in the training data, and to merge them in the translation output. Our overall goal is to improve translation quality by productively generating compounds in SMT systems.
The main contributions of the article are as follows:
Demonstrating improved coalescence (adjacency and order) of compound parts in translation through the use of sequence models based on customized part-of-speech sets and count features
Designing and evaluating several heuristic methods for compound merging that outperforms previous heuristic merging methods
Designing and evaluating a novel method for compound merging based on sequence labeling
Demonstrating the ability of these merging methods to generate novel unseen compounds
The rest of the article is structured as follows. Section 2 gives an overview of compound formation in the three target languages used in this work. Section 3 reports related work on compound processing for machine translation. Section 4 describes the compound processing strategy we use and Section 5 describes compound splitting. Section 6 addresses compound coalescence, followed by compound merging in Section 7. In Section 8 we present experimental results and in Section 9 we state our conclusions.
2. Closed Compounds in German, Swedish, and Danish
Compounds in German, Swedish, and Danish are generally closed, written without word boundaries, as exemplified for German in Example (1). Compounds can be made up of two (1a) or more (1b) parts where parts may also be coordinated (1c). In a few cases compounds are written with a hyphen (1d), often when one of the parts is a proper name or an abbreviation. Most compounds are nouns (1a–1e), but they can also be adjectives (1f), verbs (1g), and adverbs (1h). English translations of compounds can be written as open compounds with separate words (1a) or with hyphens (1f), as other constructions, possibly with inserted function words (1g) and reordering (1b), or as single words (1e). Generally the last part of the compound is the compound head, that is, it conveys the main meaning of the compound, and determines its part of speech. The other parts, compound modifiers, modify the meaning of the compound head in some way and need not have the same part of speech as the full compound.
- (1)
a. Regierungskonferenz intergovernmental conference
Regierung+Konferenz government conference
- b.
Friedensnobelpreisträger Nobel Peace Prize laureate
Frieden+Nobel+Preis+Träger peace Nobel prize bearer
- c.
See- und Binnenhäfen sea and inland ports
See- und Binnen+Häfen sea and interior ports
- d.
EU-Mitgliedstaaten EU member states
EU-Mitglied+Staaten EU member states
- e.
Jahrtausend millennium
Jahr+tausend year thousand
- f.
dunkelblau dark-blue
dunkel+blau dark blue
- g.
kennenlernen get to know
kennen+lernen know learn
- h.
grösstenteils in most instances
grössten+teils largest partly
Compound modifiers often have a special form, such as the addition of an “s” to the base form of Regierung in Example (1a). We will refer to these form variants as compounding forms. Table 1 exemplifies the type of operations used in Swedish, German, and Danish to form compounding forms. There are many more alternative compounding forms in Swedish and German than in Danish. For an overview of the possible compounding forms in German see Langer (1998) or Kürschner (2003), in Danish see Kürschner (2003), and in Swedish see Thorell (1981) or Stymne and Holmqvist (2008). Some compounding forms coincide with paradigmatic forms, such as German Jahres that can also be genitive, and Stadien that can also be plural, from Table 1. There are different views on whether to treat these forms as paradigmatic forms or as compounding forms. We follow Langer (1998) in viewing them as compounding forms, because they often do not correspond to plural or possessive semantics. Many individual compound modifiers have more than one possible compounding form. In Example (2) we provide examples of several possible forms of the modifier Kind (child) in German compounds.
- (2)
0 Kind+phase (child-caring period)
+s Kinds+lage (fetal position)
+es Kindes+unterhalt (child support)
+er Kinder+film (children's film)
+- Ein-Kind-Politik (one-child policy)
Type . | Examples . | . |
---|---|---|
Null operation | de0 | umweltfreundlich (environmentally-friendly) |
Umwelt+freundlich (environment friendly) | ||
sv0 | naturkatastrof (natural disaster) | |
natur+katastrof (nature disaster) | ||
da0 | håndbagage (hand luggage) | |
hånd+bagage (hand luggage) | ||
Addition | de +es | Jahreswechsel (turn of the year) |
Jahr+Wechsel (year change) | ||
sv +s | kvalitetstecken (quality mark) | |
kvalitet+tecken(quality sign) | ||
da +e | spillekonsol (game console) | |
spill+konsol(game console) | ||
Deletion | de −e | Lymphreaktion (lymphatic response) |
Lymphe+Reaktion (lymph response) | ||
sv −a | flickskola (girls' school) | |
flicka+skola (girl school) | ||
Combination | de −on/+en | Stadienexperte(stadium expert) |
Stadion+Experte (stadium expert) | ||
sv −e/+s | arbetsolycka (industrial accident) | |
arbete+olycka (work accident) | ||
da −e/+s | embedsmand (civil servant) | |
embede+mand(job man) | ||
Umlaut | de ″+er | Völkerrecht(international law) |
Volk+Recht(people right) | ||
sv ″−er/+ra | brödrakärlek (brotherly love) | |
broder+kärlek (brother love) | ||
da ″+e | børnesko (children's shoe) | |
barn+sko (child shoe) |
Type . | Examples . | . |
---|---|---|
Null operation | de0 | umweltfreundlich (environmentally-friendly) |
Umwelt+freundlich (environment friendly) | ||
sv0 | naturkatastrof (natural disaster) | |
natur+katastrof (nature disaster) | ||
da0 | håndbagage (hand luggage) | |
hånd+bagage (hand luggage) | ||
Addition | de +es | Jahreswechsel (turn of the year) |
Jahr+Wechsel (year change) | ||
sv +s | kvalitetstecken (quality mark) | |
kvalitet+tecken(quality sign) | ||
da +e | spillekonsol (game console) | |
spill+konsol(game console) | ||
Deletion | de −e | Lymphreaktion (lymphatic response) |
Lymphe+Reaktion (lymph response) | ||
sv −a | flickskola (girls' school) | |
flicka+skola (girl school) | ||
Combination | de −on/+en | Stadienexperte(stadium expert) |
Stadion+Experte (stadium expert) | ||
sv −e/+s | arbetsolycka (industrial accident) | |
arbete+olycka (work accident) | ||
da −e/+s | embedsmand (civil servant) | |
embede+mand(job man) | ||
Umlaut | de ″+er | Völkerrecht(international law) |
Volk+Recht(people right) | ||
sv ″−er/+ra | brödrakärlek (brotherly love) | |
broder+kärlek (brother love) | ||
da ″+e | børnesko (children's shoe) | |
barn+sko (child shoe) |
In some cases concatenating two words would lead to three identical consecutive consonants. In the Scandinavian languages, there is a spelling rule that does not allow this, and three identical consonants are reduced to two, as in Example (3SV). This spelling rule was also used for some German compounds before 1996, when it was changed by a spelling reform, so that nowadays three identical consecutive consonants are never reduced to two at compound boundaries in German (Institut für Deutsche Sprache 1998), as shown in Example (3DE).
- (3)
sv tullagstiftning customs legislation
tull+lagstiftning custom legislation
DE Zelllinie cell line
Zell+Linie cell line
Compounding is common and productive; new compounds can be readily formed and understood. This is confirmed in a number of corpus studies. In German, compounds have been shown to make up 5–7% of tokens and 43–47% of types in news text (Baroni, Matiasek, and Trost 2002; Schiller 2005). If function words are removed, an even higher number of the tokens are compounds; in both Swedish and German 10% of the content words in a news text have been found to be compounds (Hedlund 2002). That compounding is productive means that it is likely that a high number of compounds have a very low frequency in texts. Baroni, Matiasek, and Trost (2002) found that 83% of the compounds in a large German news corpus occur less than five times. In Swedish, compounds are the most common type of hapax words, that is, words that occur only once in a text (Carlberger et al. 2005). The most common type of compound is the noun+noun compound, which makes up 62% of the compounds in the German news corpus of Baroni, Matiasek, and Trost.
3. Related Work
The problems arising from differences in compounding strategies in translation from German into English have been addressed by several authors. The most common architecture for translation from German is to split compounds in a preprocessing step prior to training and translation using some automatic method, which has been suggested both for SMT (Nießen and Ney 2000; Koehn and Knight 2003; Popović, Stein, and Ney 2006; Holmqvist, Stymne, and Ahrenberg 2007) and example-based MT (Brown 2002). German compounds are split into their component parts in a preprocessing step and the translation model is then trained between modified German and English. At translation time, the German source text is also run through a compound splitter. In the studies cited here, only one splitting option is given as input to the decoder, which can be problematic in case the splitting is wrong, or if any of the parts are unknown. In Dyer (2009) several splitting options were given to the decoder in the form of a lattice. It is, however, not straightforward to use lattices during training, and in order to solve this, the training corpus was doubled, one part being without splits and the other part having the best splitting option for each word.
For translation into German, Popović, Stein, and Ney (2006) investigated three different strategies for compound processing. The first was to split compounds during training and after translation merge compound parts back into full compounds, the second merged English compounds prior to training instead of splitting German compounds, and the third used compound splitting only to improve word alignment. The split–merge strategy gave the best results, but using splitting only for alignment gave similar results. The merging of English compounds led to an improvement over a baseline without compound processing, but was not as good as the other two strategies. Popović, Stein, and Ney also presented one of few previous suggestions for compound merging. Each word in the translation output was looked up in a list of compound parts, and merged with the next word if it resulted in a known compound. This method led to improved overall translation results from English to German. The drawback of this method is that novel compounds cannot be merged. It might also merge words that should not be merged, but that happen to coincide with known compounds.
Fraser (2009) merged split German compounds after translation from English, by applying a second phrase-based SMT (PBSMT) system trained on German with split compounds and normal German. No separate results were presented for this extension alone, but in combination with other morphological processing the strategy led to worse results than the baseline. This method also suffers from the same drawbacks as that of Popović, Stein, and Ney (2006), that novel compounds cannot be merged and words that should not be merged can still be.
Koehn, Arun, and Hoang (2008) discussed the treatment of hyphenated compounds for translation into German. They used a separate mark-up token for hyphened compounds, where the hyphen was split into a separate token, and marked by a symbol. The impact on the translation result was small.
Botha, Dyer, and Blunsom (2012) discussed the approach of using customized language models to target German compounds. They presented hierarchical Pitman-Yor language models, where the compound head is conditioned on the words preceding the full compound, and the compound modifiers are modeled by a reverse compound language model. They used this model as a replacement of a standard language model for SMT and found that although the perplexity of the language model was reduced, there were only minor improvements on Bleu for the SMT task. In this case the SMT pipeline was left unchanged, meaning that novel compounds were not considered.
Compound merging has also been performed for speech recognition. An example of this is Berton, Fetter, and Regel-Brietzmann (1996), who extended the word graphs output by a German speech recognizer with possible compounds by combining edges of words during a lexical search. The final hypotheses were then identified from the graph using dynamic programming techniques. Compound merging for speech recognition is a somewhat different problem than for machine translation, however, because coalescence is not an issue, as compared with SMT, where there is no guarantee that the order of the parts in the translation output is correct.
Another somewhat related problem to compound merging is that of detection of erroneously split compounds in human text, which is faced by grammar checkers. Writing compounds as separate words, with spaces between parts, is a common writing error in compounding languages. Carlberger et al. (2005) described a system for Swedish that used handwritten rules to identify, among other errors, erroneously split compounds. The rules used parts of speech and morphological features. On a classified gold standard of writing errors they had a recall of 46% and a precision of 39% for identifying split compounds, indicating that it is a difficult problem to find split compounds in free, unmarked text.
There is also work on morphological merging, which is needed when the target words have been split into morphs in the training corpus. Virpioja et al. (2007) marked morphs with a symbol and merged all marked words with the next word for translation between Finnish, Swedish, and Danish, without showing any improvements over an unmarked baseline. This strategy does have the advantage of being able to merge novel word forms, but has a drawback in that it can merge parts into non-words if the parts are misplaced in the translation output.
El-Kahlout and Oflazer (2006) used a similar symbol-based merging strategy for translation from English into Turkish, but with the addition of morphographemic rules. They had positive results when performing limited splitting and grouping the split morphs, but not when splitting all morphs. They reported that there were problems with the order of morphs in the output. Badr, Zbib, and Glass (2008) reported results for translation from English to Arabic, where they used a combination merging method where forms were picked from the corpus for known combinations of morphs and words, and generated based on handwritten recombination rules otherwise, which led to improvements over the baseline. They also used the morphs+POS as factors in a factored translation model (Koehn and Hoang 2007) where surface forms were generated from this information; this gave a small improvement on a large corpus, but at the cost of high runtime. El Kholy and Habash (2010) extended the merging scheme of Badr, Zbib, and Glass (2008) by using the conditional probability and a language model score to pick the best known merging option. They showed a small effect of this on an MT task, even though this strategy was the best option in an intrinsic evaluation based on human reference translations.
Compound splitting has been addressed in many articles, as a separate task (Schiller 2005) or targeted for applications such as information retrieval (Holz and Biemann 2008), speech recognition (Larson et al. 2000), grammar checking (Sjöbergh and Kann 2004), lexicon acquisition (Kokkinakis 2001), word prediction (Baroni, Matiasek, and Trost 2002), and machine translation (Koehn and Knight 2003; Dyer 2009; Fritzinger and Fraser 2010; Macherey et al. 2011).
The most successful strategies that address compound processing for MT apply compound splitting as a preprocessing step before training the translation models. Koehn and Knight (2003) presented an empirical splitting algorithm targeted at SMT from German to English. They split words in all possible places, and considered a splitting option valid if all its parts had been seen as words in a monolingual corpus. They allowed the addition of −s or −es at all splitting points. If there were several valid splitting options they chose one based on the number of splits, the geometric mean of part frequencies, or based on alignment data. They evaluated the splitting algorithms intrinsically on a gold standard of manually split noun phrases and on machine translation of noun phrases. The best results for PBSMT were achieved by using either the geometric mean, or the highest number of splits. There were no correlations between translation results and the intrinsic evaluation.
Several other researchers have also explored compound splitting for translation from German to English. Nießen and Ney (2000) used a morpho-syntactic analyzer for splitting German compounds prior to translation. Popović, Stein, and Ney (2006) used the geometric mean version from Koehn and Knight (2003) as well as the morphosyntactic algorithm from Nießen and Ney (2000) for splitting, with similar positive results for both options. Fritzinger and Fraser (2010) combined linguistic analysis with corpus-driven scoring and showed an improvement compared to using only a corpus-driven approach. Macherey et al. (2011) described a corpus-driven method that learns compounding form transformations of a language in addition to just compound splitting, and showed an improvement for translation from several languages into English compared to a baseline without compound treatment. Dyer (2010) suggested a compound splitting method based on sequence labeling, which gave good results for lattice-based translation from German.
4. Compound Translation
For translation into a compounding language, we adopt the compound processing strategy suggested by Popović, Stein, and Ney (2006). The process is:
- 1.
Split compounds on the target (compounding language) side of the training corpus.
- 2.
Learn a translation model from source (e.g., English) into decomposed-target (e.g., decomposed-German).
- 3.
At translation time, translate using the learned model from source into decomposed-target.
- 4.
Apply a postprocessing merge step to reconstruct compounds.
The merging step must solve two problems: Identify which words should be merged into compounds, and choose the correct compounding form for the compound modifiers. The first problem can become hopelessly difficult if the translation did not put components nicely side by side and in the correct order. Preliminary to merging, then, the problem of coalescence needs to be addressed, that is, translations where compound elements are correctly positioned should be promoted.
Figure 1 gives an overview of the translation process with compound processing. Compounds are split before training the translation system, and merged after translation. We use factored decoding (Koehn and Hoang 2007), where features other than just surface words can be used by the system. In our case we tag the data with parts of speech and use a factored model with POS-tags on the target side, which allows us to have a POS-sequence model, beside the standard language model. This configuration has a very small overhead compared with decoding without factors. As we show, using part-of-speech tags on the target side helps to improve the coalescence of compound parts, and can be used to guide the merging process.
For the tuning step there are two options: either we can split the development set and tune with a translation with split compounds compared with a reference with split compounds, or we can merge compounds in the translation output, before performing the optimization. We found empirically that we got the best results using the second approach, of performing compound merging during the tuning process. When we tuned on split texts, the results were more unstable, and especially the number of words in the translation output varied substantially. We thus use merging also during the tuning process in all experiments, as shown in Figure 1.
5. Compound Splitting
Our method for compound splitting is based on Koehn and Knight (2003) and Stymne (2008). For each word all possible segmentations are explored, with the restrictions that all parts must have at least three characters, and the last part, the compound head, must have the same part-of-speech tag as the word itself. Hyphens are treated as additions to compound modifiers, just as +s or +e. Segmentations are scored with the arithmetic mean of frequencies for each part in the training corpus and the segmentation with the highest score is chosen.
We have investigated several variants of the basic method and their effects on compound translation. The variants investigated are:
Using geometric or arithmetic mean for choosing the highest frequency candidate split
Restricting the length of each split part, either to three or four characters
Restricting the highest number of parts per compound to two, or allowing any number of compound parts
Allowing all known compounding forms (c.f. Table 1), or restricting them to the most common ones
Restricting the last part of the compound to be known from the corpus with the same part-of-speech tag as the full compound, or allowing all possible parts-of-speech tags
6. Promoting Coalescence of Compounds
In this section we describe our approach to improve the coalescence of compounds, which is based on POS-sequence models. We first present the representation schemes we use for compounds, which form the basis of the sequence model approach, and then describe the sequence modeling approach in more detail. Finally, we discuss how POS-tags can be used for count features.
6.1 Representation of Compound Parts
As a result of compound splitting the segmentation of words is changed, and we know from Table 1 that compounding forms often do not coincide with any forms that can be used as standalone words. This raises several design decisions:
- 1.
Should alternative forms of the same compound modifier be normalized to a canonical form?
- 2.
Should compound modifiers be marked with a special symbol? Or should the separation between compound parts be marked?
- 3.
How should compound parts be tagged in a factored system?
Although apparently innocuous, these decisions do have some influence on the whole process. In this work we have used three combinations of marking and normalization, and three different tagsets.
In Example (4a), called the unmarked scheme, compound modifiers are normalized (kamps->kamp) and the words carry no special marking. In Example (4b), called the marked scheme, compound modifiers are not normalized, but they are marked with the symbol “#”. For these two marking schemes, an extended tagset, EPOS, is used. It contains tags for compound modifiers that also indicate the part of speech of the head word, such as “N-modif” if the head is a noun (N) or “ADJ-modif” if the head is an adjective (ADJ).
In Example (4c), called the sepmarked scheme, the split itself is marked, by using the special token “@#@” to mark split points, and compound modifiers are normalized. In this case we use a standard POS-tagset with the addition of a COMP-tag for the inserted split-token, and use the POS-tags that were found for the compound modifiers when they were looked up in the monolingual corpus during splitting. We call this tagset the SPOS-tagset.
- (4)
a. fem+kamps+seger:N (pentathlon (five battle) victory)
fem:N-modif kamp:N-modif seger:N
- b.
fem+kamps+seger:N (pentathlon (five battle) victory)
fem#:N-modif kamps#:N-modif seger:N
- c.
fem+kamps+seger:N (pentathlon (five battle) victory)
fem:NUM @#@:COMP kamp:N @#@:COMP seger:N
As a third alternative tagset we use a modified variant of the EPOS-tagset, where distinctions among parts of speech that are not relevant to the formation of compounds are blurred. This reduces the tagset to only a few tags, as in the case where only nouns are split:
N-modif – all parts of a split compound except the last
N – the last part of the compound (its head) and all other nouns
X – all other tokens
6.2 Part-of-Speech–Based Sequence Models
If compounds are split in the training data, then there is no guarantee that translations of components will end up in contiguous positions and in the correct order. This is primarily a language model problem, and we will model it as such by applying sequence models on the customized part-of-speech sets. These sequence models can be either standard language models trained on texts where split compound modifiers are marked with symbols, or they can be POS-sequence models trained on the specially designed tagsets, such as EPOS. Both these types of models can encourage compound parts to occur in the correct order; the former, however, is a more lightweight approach, because it relies on surface words.
A language model solution to compound coalescence could be viewed as a soft constraint in the decoder, which encourages good sequences of compound parts over bad sequences. An alternative to this would have been a hard constraint in the decoder that prohibits compound parts to be placed in an incorrect order. We opted for a soft constraint approach because we found that it gave sufficiently good results, and because previous work on soft versus hard constraints for other areas has shown that soft constraints give more stable and generally better results, for instance, for phrase cohesion (Cherry 2008).
As described in the previous section, we can add special POS tags and symbols to identify compound modifiers. Table 2 shows examples of the different representation schemes. We can then train a *POS n-gram sequence model using any of the *POS-tagsets, which naturally steers the decoder towards translations with good relative placement of these components.
A lightweight model that gives some additional information of the order of compound parts compared to a language model on unmarked data is to train language models on texts where compounds are represented either using the marked or sepmarked schemes but without parts-of-speech tags. These models are, however, not as strong as the *POS-based models, since they cannot generalize from surface words, and mainly can aid in keeping known compounds together.
6.3 Sequence Models as Count Features
We expect a *POS-based n-gram sequence model to learn to discourage sequences unseen in the training data, such as the sequence of compound parts not followed by a suitable head. Such a generative LM, however, might also have a tendency to bias lexical selection towards translations with fewer compounds, since the corresponding tag sequences might be more common in text. To compensate for this bias, we experiment with injecting a little dose of a priori knowledge, and add a count feature, which explicitly counts the number of occurrences of RPOS-sequences that we deem good and bad in the translation output. Table 3 gives an overview of the possible bigram combinations, using the three-symbol tagset, plus sentence beginning and end markers, and their judgment as good, bad, or neutral.
We define two new feature functions: the boost model counting the number of occurrences of Good sequences, and the punish model, counting the occurrences of Bad sequences, as indicated in Table 3. In the punish model we want to punish the two bad combinations, where compound modifiers are placed in isolation without a suitable head. In the boost model, we want to support the formation of compounds by rewarding the two good combinations. We definitely want to boost the combination of a compound part with its head. In addition, we can boost the formation of long compounds by boosting the combination of two compound modifiers as well. The boost and punish models can be used either in isolation or combined, with or without a further *POS n-gram sequence model.
To exemplify the two models, consider the translation hypothesis given in Example (5). The word pairs skogs bruks and bruks plan, marked in bold in the example, constitute good sequences of a modifier followed by another modifier or a head, and would give the count 2 with a boost model. The word sequence in italics, skogs av, constitutes a bad sequence, since the preposition av (of) cannot be a compound head, and would give the count 1 to a punish model. The other word sequences do not influence either of these models because they do not involve any compound modifiers.
- (5)
7. Compound Merging
Once a translation is generated using a system trained on split compounds, a postprocessing step is required to merge components back into compounds. All methods we are aware of only consider consecutive tokens for merging: We stick to this assumption, having delegated to the methods described earlier the task to promote a good relative positioning of component translations. For all pairs of consecutive tokens we have to decide whether to combine them or not. Depending on the language and on preprocessing choices, we might also have to decide whether to apply any boundary transformation such as, for example, inserting −s between components.
In this section we first describe reverse normalization, then we describe our new heuristic merging methods and modifications to existing heuristics. Finally we describe a novel sequence-labeling formulation for compound merging.
7.1 Reverse Normalization
For compound modifiers that were normalized in the training data the reverse process (reverse normalization) is needed at merging time to recreate the correct form for the specific compound. We designed a method for reverse normalization that is based on corpus frequencies of compound modifiers and compounds that can be collected during compound splitting.
The strategy is illustrated and exemplified in Figure 2.2 First we look up all known compounding forms, and their frequencies for all the compound modifiers. In Figure 2, for instance, we found three options for the word recht [right], with rechts being the most common option. Next, we try all combinations of forms to form a full compound, and if any matches are found we pick the most frequent match. If this fails we have two back-up strategies. First, we try to find known compounding forms for pairs of parts, starting from left to right, looking up frequencies of the resulting forms from adding two parts. If that fails we use our second back-up strategy, which is to use the most common form of each modifier, and concatenate those. For binary compounds, the second strategy is superfluous, because there is only one pair of compound parts, and the second strategy is always used.
7.2 Heuristic Approaches to Compound Merging
We have investigated three major approaches to heuristic compound merging. Our major contribution is a novel merging method based on part-of-speech matching. We contrast this method with adaptations of previous merging suggestions based on symbols and word lists. We also suggest improvements to these methods, and combination methods that combine the strengths of word lists and parts of speech.
7.2.1 POS-Based Merging
The POS-match algorithm uses the fact that it is possible to have several output factors besides surface form in a factored translation system. It merges words that are marked as compound modifiers in either the EPOS or RPOS tagsets if the next POS-tag matches. As described in Section 6.1, the part of speech of a compound modifier is based on the part of speech of its head word, so a word is considered matching if the next word is a compound modifier of the same type, or a head with a matching part of speech. In addition, if the next word does not match, the modifier could be part of a coordinated compound, which is checked by seeing if the next word is a conjunction, in which case a hyphen is added to the modifier. We investigated versions of the algorithm both with and without treatment of coordinated compounds.
If a compound modifier is followed by anything other than a matching part of speech or a conjunction it has most likely been misplaced in the translation process. These items are left as they are in the translation output, which is often fine, because only compound parts that occur as separate words in a corpus are split. When two matching compound modifiers are merged, the process is iterated to see if the next word is a matching compound modifier, head, or conjunction. This allows compounds with an arbitrary number of parts to be merged.
In summary, the POS-based merging algorithm has the following steps:
Step through each word+POS pair from left to right3
If a compound-POS, x-modif, is found:
- –
Remove mark-up of the part if present
- –
Store the compound part
- –
When the next POS is a matching part, x-modif:
Remove mark-up of the part if present
Store the compound part
- –
If the next POS is a matching head, x:
Store the compound head
- –
If at least two parts have been found (either several modifiers or a head):
Perform reverse normalization on the stored parts if parts are normalized
Merge all parts
For Swedish and Danish: remove a consonant if any of the merges resulted in three identical consecutive consonants
- –
If the next POS is a conjunction and no head was found:
Add a hyphen at the end of the compound part
- –
This heuristic has the advantage over previous merging algorithms that it can form novel compounds while reducing the risk of erroneous merging through the matching constraint. It is also the only merging method we are aware of that addresses coordinated compounds. However, it requires a factored decoder that can carry part-of-speech tags through the translation process. It also requires tagsets where compound modifiers are marked based on the compound head, such as EPOS or RPOS. In the current form the POS-match strategy cannot be used for the SPOS-tagset that does not use head-based tags for compound modifiers, because we cannot enforce the matching constraint in this way. It would be possible to design a POS-match strategy based on standard tags, which for instance allowed merging of noun+noun, but not of noun+preposition. Such a strategy requires linguistic knowledge and customization for each language, however.
7.2.2 Other Merging Heuristics
The symbol-based method is inspired by work on morphology merging (El-Kahlout and Oflazer 2006; Virpioja et al. 2007). It merges words that are marked with a symbol with the next word in the marked scheme. In the sepmarked scheme, when a standard symbol is found, the words on both sides of it are merged. These algorithms have the disadvantage, compared with the POS-match algorithm, that it is more likely that words are merged into non-compounds, since no matching check is carried out.
We have also investigated methods based on word lists, proposed by Popović, Stein, and Ney (2006). These methods use frequency word lists compiled at split time. Three types of lists were used: lists of compound modifiers, of compounds, and of words. If a compound modifier is encountered, it is checked whether merging it with the next word results in either another compound modifier, or a compound or word. Unlike Popović, Stein, and Ney (2006), we perform this process recursively, to allow compounds with several parts. Again, reverse normalization is performed when needed. Still, no novel compounds can be formed, and coordinated compounds are not handled. The method does not merge words into non-words, but there is another risk, that of merging words that should be separate in a specific context, but that happen to form a valid compound or other word when combined, such as the examples in Example (6). It has the advantage over the other proposed methods that it can be used on output from any MT decoder, without the use of customized POS-tags or symbols in the output. We investigate the usage of two types of frequency lists for this strategy, either a list of compounds, which can be created as a byproduct of the splitting algorithm, like Popović, Stein, and Ney (2006), or by using a list of all words in a corpus.
- (6)
de bei der (at the)
beider (both)
sv för små (too small)
försmå (spurn)
We empirically verified that the list-based heuristics tend to misfire quite often, leading to too many compounds, such as merging the words in Example (6). We thus modified them in two ways: (1) by additionally requiring the head word to be a content word and (2) by requiring the generated compound to be more frequent in a corpus than the corresponding bigram of isolated words. The head word restriction can block some erroneous merges such as those in Example (6DE), whereas the frequency restriction can potentially block both examples in Example (6). Compound and bigram frequencies can be computed on any available monolingual corpus in the domain of interest. We also investigated combinations of the heuristics by using either the union or intersection of merges from two different strategies.
7.3 Compound Merging as Sequence Labeling
Besides extending and combining existing heuristics, we propose a novel formulation of compound merging as a sequence labeling problem. The opposite problem, compound splitting, has successfully been cast as a sequence labeling problem before (Dyer 2010), but here we apply this formulation in the opposite direction.
Depending on choices made at compound splitting time, this task can be either a binary or multi-class classification task. If compound parts were kept as-is, the merging task is a simple concatenation of two words, and each separation point must receive a binary label encoding whether the two tokens should be merged. If compounds were normalized at splitting time, the compound form has to be restored before concatenating the parts. This can be modeled either as a multi-class classifier that has the possible boundary transformations as its classes or in a two-step process where the form of words are modeled in a separate step after the binary merging decision. In the latter case it is possible to use the reverse normalization process described in Section 7.1. In this work we limited our attention to binary classification.
Consider for instance translating into German the English in Example (7).
- (7)
Europe should promote the knowledge of foreign languages
Assuming that the training corpus did not contain occurrences of the pair (knowledge of foreign languages,‘fremdsprachenkenntnisse’) but contained occurrences of (knowledge,‘kenntnisse’), (foreign,‘fremd’), and (languages,‘sprachen’), then the translation model from English into decomposed German could be able to produce Example (8).
- (8)
Europa sollte fremd sprachen kenntnisse fördern
We cast the problem of merging compounds as one of making a series of correlated binary decisions, one for each pair of consecutive words, each deciding whether the whitespace between the two words should be suppressed (label 1) or not (label 0). In the example case, the correct labeling for the sentence would be {0,0,1,1,0}, reconstructing the correct German as shown in Example (9).
- (9)
Europa sollte fremdsprachenkenntnisse fördern
Although in principle one could address each atomic merging decision independently, it seems intuitive that a decision taken at one point should influence merging decisions in neighboring separation points, especially because compounds can be made up of more than two parts. For this reason, instead of a simple (binary or n-ary) classification problem, we prefer a sequence labeling formulation.
Depending on the choice of the features, this approach has the potential to be truly productive, that is, to form new compounds in an unrestricted way. As discussed, in fact, the list-based heuristics can only form compounds that were observed in the training data or in some suitable monolingual corpus, and are thus not productive. The POS-match heuristic is more flexible, but is still limited in that it can only form a compound if a modifying element (non-head) has been observed and tagged as such in the training data.
The array of sequence labeling algorithms potentially suitable to our problem is fairly broad, including hidden Markov models (Rabiner 1989), conditional random fields (CRFs) (Lafferty, McCallum, and Pereira 2001), Semi-CRFs (Sarawagi and Cohen 2004), structured perceptrons (Collins 2002), structured support vector machines (Tsochantaridis et al. 2005), Max-Margin Markov networks (Taskar, Guestrin, and Koller 2003), and more. Because the focus of this work is on the application rather than on a comparison among alternative structured learning approaches, we limited ourselves to a single implementation. Considering its good scaling capabilities, capability to handle strongly redundant and overlapping features, and widespread recognition in the NLP community, we chose to use CRFs.
7.3.1 Features
Each sequence item (i.e., each separation point between words) is represented by means of a vector of features. Our aim was to include features representing the knowledge available to the heuristics, such as part-of-speech tags, frequencies for compounds and bigrams, as well as comparisons between them. Features were also inspired by previous work on compound splitting, with the intuition that features that are useful for splitting compounds could also be useful for merging. Character n-grams have successfully been used for splitting Swedish compounds, as the only knowledge source by Brodda (1979), and as one of several knowledge sources by Sjöbergh and Kann (2004). Friberg (2007) tried to normalize letters, besides using the original letters. Although she was not successful, we still believe in the potential of this feature. Larson et al. (2000) used frequencies of prefixes and suffixes from a corpus as a basis of their method for splitting German compounds. We used the following features where -N refers to the nth position before the merge point, and +N to the nth position after the merge point:
Previous tag
Surface words: word-2, word-1, word+1, bigram word-1–word+1
Parts of speech: POS-2, POS-1, POS+1, bigram POS-1–POS+1
Character n-grams around the merge point
- –
three-character suffix of word-1
- –
three-character prefix of word+1
- –
Combinations crossing the merge points: 1+3, 3+1, 3+3 characters
- –
Normalized character n-grams around the merge point, where characters are replaced by phonetic approximations and grouped according to phonetic distribution, see Figure 3 (only for Swedish)
Frequency comparisons of two different frequencies in the training corpus, classified into four categories: freq1 = freq2 = 0, freq1 < freq2, freq1 = freq2, freq1 > freq2
- –
word-1,word+1 as bigram vs. compound
- –
word-1 as true prefix vs. single word
- –
word+1 as true suffix vs. single word
- –
7.3.2 Training Data for the Sequence Labeler
Because features are strongly lexicalized, a suitably large training data set is required to prevent overfitting, ruling out the possibility of manual labeling.
We created our training data automatically, using a subset of the compound merging heuristics described in Section 7.2.2, plus an additional heuristic enabled by the availability, when estimating parameters for the CRF, of a reference translation: Merge if two tokens are observed combined in the reference translation (possibly as a sub-sequence of a longer word). We compared multiple alternative combinations of heuristics on a validation data set. The validation and test data were created by applying all heuristics, and then manually having the positive instances checked.
A first possibility to automatically generate a training data set consists in applying the compound splitting preprocessing of choice to the target side of the parallel training corpus for the SMT system: Separation points where merges should occur are thus trivially identified. In practice, however, merging decisions will need to be taken on the noisy output of the SMT system, and not on the clean training data. To acquire training data that is similar to the test data, we could have held out from SMT training a large fraction of the training data, used the trained SMT to translate the source side of it, and then label decision points according to the heuristics. This would, however, imply making a large fraction of the data unavailable to the training of the SMT system.4 We thus settled for a compromise: We trained the SMT system on the whole training data, translated the whole source side, and then labeled decision points according to the heuristics. The translations we obtain are thus biased, of higher quality than those we should expect to obtain on unseen data. Nevertheless they are substantially more similar to real SMT output than the reference translations with automatic splits.
8. Experiments
In this section we report experimental results for the strategies proposed in this article. We first describe, in Section 8.1, the overall experimental set-up. We then report on the experiments, as follows:
In Section 8.2 we report on general effects of compound processing.
Section 8.3 reports effects on coalescence of different representation schemes. Here we do not vary merging or splitting.
Section 8.4 reports on compound splitting and the way different parameter settings affect translation quality. Here we do not vary representation scheme or merging.
Section 8.5 reports results from varying the merging process. In particular we compare the novel sequence labeling method with the heuristic methods. Here we do not vary representation scheme or splitting.
In Section 8.6 we apply the overall best strategies to corpora of different sizes and to out-of-domain data.
8.1 Experimental Set-up
We performed experiments on translation from English into German, Swedish, and Danish, all of which have closed compounds. We tested all experimental conditions on Europarl (Koehn 2005) for translation from English to German. We also give contrasting results on English–Swedish Europarl and for an automotive corpus for translation from English to Swedish and Danish. The automotive corpus was gathered from translation memory data. The two corpora are quite different. The automotive corpus is from a limited domain, and of a homogeneous nature, whereas Europarl is more diverse, and tends to have a more complex language than the automotive corpus. Table 4 summarizes the sizes for the corpora that we used in Sections 8.28c8d–8.5. The German test set is the test2007 set from the WMT 2008 workshop.5 We have chosen to use a smaller part of Europarl, to reduce runtimes and allow more experiments. In Section 8.6 we report results from scaling experiments and on out-of-domain data.
Name . | Europarl German . | Europarl Swedish . | Auto Swedish . | Auto Danish . |
---|---|---|---|---|
Corpus | Europarl | Europarl | Automotive | Automotive |
Languages | En→De | En→Sv | En→Sv | En→Da |
Training sentences | 701,157 | 701,157 | 329,090 | 168,047 |
Avg. target sentence length | 20.5 | 19.4 | 9.3 | 9.2 |
Dev sentences | 500 | 500 | 2,000 | 1,000 |
Test sentences | 2,000 | 2,000 | 1,000 | 1,000 |
Name . | Europarl German . | Europarl Swedish . | Auto Swedish . | Auto Danish . |
---|---|---|---|---|
Corpus | Europarl | Europarl | Automotive | Automotive |
Languages | En→De | En→Sv | En→Sv | En→Da |
Training sentences | 701,157 | 701,157 | 329,090 | 168,047 |
Avg. target sentence length | 20.5 | 19.4 | 9.3 | 9.2 |
Dev sentences | 500 | 500 | 2,000 | 1,000 |
Test sentences | 2,000 | 2,000 | 1,000 | 1,000 |
We used factored translation (Koehn and Hoang 2007) in our experiments, with both surface words and part-of-speech tags on the target side, with a sequence model on parts-of-speech. For part-of-speech tagging we used TreeTagger (Schmid 1994) for German, an in-house hidden Markov model tagger based on Cutting et al. (1992) for Danish and Swedish, and for Swedish also the Granska tagger (Carlberger and Kann 1999). For the majority of experiments we used the Moses decoder (Koehn et al. 2007), which is a standard phrase-based statistical decoder, which allows factored decoding. For word alignment, Giza++ (Och and Ney 2003) was used and for language modeling we used SRILM (Stolcke 2002). For parameter optimization we used minimum error rate training (Och 2003). For each experiment we ran minimum error rate training three times in order to reduce the effect of optimizer instability, and report the average result and standard deviation. In the merging experiments based on sequence labeling we used the CRF++ toolkit.6 For the merging experiment with sequence labeling, we used the Matrax decoder (Simard et al. 2005) on the automotive corpus. Matrax is a phrase-based decoder that allows discontiguous phrases, and parameter optimization based on gradient descent for smoothed NIST. We extended the original Matrax decoder with factored decoding on the target side.
Compounds were split before training using the corpus-based method described in Section 5. Except for the experiments comparing different compound merging methods, we used the POS-match merging algorithm developed by us.
We report results on three metrics: Bleu (Papineni et al. 2002), NIST (Doddington 2002), and Meteor. For Meteor we use the version tuned on adequacy and fluency (Lavie and Agarwal 2007) for German, and the original version with default weights (Banerjee and Lavie 2005) for Swedish and Danish, since there is no tuned version for those languages. For Bleu and Meteor we use the %-notation. Significance testing was performed using approximate randomization (Riezler and Maxwell 2005), with 10,000 iterations, on output based on three optimizer runs, as recommended by Clark et al. (2011).
8.2 General Effects of the Compound Processing Strategy
One effect of compound splitting is that the number of word types is reduced. Tables 5 and 6show the number of types (number of unique words) and tokens (total number of words) and the type/token ratio for Europarl in the different representation schemes. The type count and the type/token ratio in the baseline system are much higher in German and Swedish than in English, and is drastically reduced after compound splitting in all markup schemes, even though it is still higher than in English. One reason for the type count still being higher in German and Swedish than in English is morphological complexity, which is low in English and higher in German and Swedish. The type count is higher in the marked representation scheme than in the other schemes due to the fact that compound modifiers are marked, and do not coincide with other words as they do in the other schemes.
System . | Tokens . | Types . | Ratio . | |
---|---|---|---|---|
German | baseline | 14,356,051 | 184,215 | 1.28% |
marked | 15,674,728 | 93,746 | 0.60% | |
unmarked | 15,674,728 | 81,806 | 0.52% | |
sepmarked | 17,007,929 | 81,808 | 0.48% | |
English | 15,158,429 | 63,692 | 0.42% |
System . | Tokens . | Types . | Ratio . | |
---|---|---|---|---|
German | baseline | 14,356,051 | 184,215 | 1.28% |
marked | 15,674,728 | 93,746 | 0.60% | |
unmarked | 15,674,728 | 81,806 | 0.52% | |
sepmarked | 17,007,929 | 81,808 | 0.48% | |
English | 15,158,429 | 63,692 | 0.42% |
System . | Tokens . | Types . | Ratio . | |
---|---|---|---|---|
Swedish | baseline | 13,603,062 | 182,000 | 1.34% |
marked | 14,401,784 | 107,047 | 0.74% | |
unmarked | 14,401,784 | 100,492 | 0.70% | |
English | 15,043,321 | 67,044 | 0.45% |
System . | Tokens . | Types . | Ratio . | |
---|---|---|---|---|
Swedish | baseline | 13,603,062 | 182,000 | 1.34% |
marked | 14,401,784 | 107,047 | 0.74% | |
unmarked | 14,401,784 | 100,492 | 0.70% | |
English | 15,043,321 | 67,044 | 0.45% |
We believe that the fact that compound splitting leads to both type and token counts that are more similar to English is a positive thing, because it likely contributes to making the two languages structurally more similar, which makes the SMT process easier. This is supported by Birch, Osborne, and Koehn (2008), who showed that morphological complexity of a target language correlates negatively with Bleu scores. A lower type/token ratio translates into fewer out of vocabulary (OOV) words, and can give more accurate probability estimates for the SMT translation and language models.
These changes in the number of types and tokens could potentially influence the translation process. In Tables 7 and 8 we show the effect of splitting on average phrase length and average phrase length ratio in the phrase table and during translation in the baseline and in the unmarked system. For the unmarked system we also compensated for the split compounds by counting phrase length as if compounds were merged using the POS-match heuristic. For both languages the lengths and ratios are similar between the baseline and the unmarked system when we compensate for split compounds. This indicates that the type of phrases used are not much affected by the higher number of tokens in the split system. We can also see that the phrases used for translations are on average longer on the source side and shorter on the target side, with a very different ratio than the phrases in the phrase table.
8.3 Promoting Compound Coalescence
In this section we describe three sets of experiments that investigate the use of the *POS-tagsets in sequence models and as count features. In the first experiment we explore the effect of different representation schemes for translation into German, and in the second experiment we do the same, on a smaller scale, for Swedish. In the third experiment we investigate the effect of the RPOS-tagset and the boost and punish models. In all experiments merging was performed using the POS-match heuristic (see Section 7.2.1), except for the sepmarked scheme that used the SPOS-tagset that does not allow this type of POS-match, and for which the symbol merging algorithm (see Section 7.2.2) was used.
8.3.1 Experiment 1
In the first experiment we investigated different compound representation schemes for German Europarl. Table 9 shows the results using different representation schemes, comparing them with two baselines, with and without a POS-sequence model. There is generally a small significant improvement when a POS-based sequence model is used compared with the same model without a POS-based sequence model. Especially for the systems with compound splitting, it was clearly worse not to use *POS-models, and all such systems perform significantly worse than both baselines on most metrics. The representation scheme with sepmarked words and SPOS-tags did not perform well, which can both be due to less power of the mark-up system, and to the fact that it cannot use the POS-match merging heuristic. The two systems with EPOS-models do perform well though, and are on par with the factored baseline, and mostly better than the unfactored baseline. The marked and unmarked systems with EPOS perform similarly, with no significant differences between them. It is thus hard to say which of these classification systems are preferable, but at least it is clear that the use of EPOS-tags are useful.
. | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline | 20.0 (0.3) | 5.94 (0.04) | 27.5 (0.2) |
Baseline+POS | 20.2 (0.1)** | 5.93 (0.02) | 27.6 (0.2) |
Unmarked | 19.7 (0.2)**## | 5.90 (0.04)**# | 27.4 (0.2)## |
Marked | 19.7 (0.2)**## | 5.91 (0.02)** | 27.5 (0.1) |
Sepmarked | 19.0 (0.1)**## | 5.81 (0.02)**## | 27.0 (0.0)**## |
EPOS-unmarked | 20.1 (0.0)** | 5.97 (0.03)* | 27.8 (0.1)** |
EPOS-marked | 19.9 (0.1) | 5.96 (0.02)* | 27.7 (0.1)* |
SPOS-sepmarked | 19.4 (0.3)**## | 5.83 (0.01)**## | 27.3 (0.2)*## |
. | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline | 20.0 (0.3) | 5.94 (0.04) | 27.5 (0.2) |
Baseline+POS | 20.2 (0.1)** | 5.93 (0.02) | 27.6 (0.2) |
Unmarked | 19.7 (0.2)**## | 5.90 (0.04)**# | 27.4 (0.2)## |
Marked | 19.7 (0.2)**## | 5.91 (0.02)** | 27.5 (0.1) |
Sepmarked | 19.0 (0.1)**## | 5.81 (0.02)**## | 27.0 (0.0)**## |
EPOS-unmarked | 20.1 (0.0)** | 5.97 (0.03)* | 27.8 (0.1)** |
EPOS-marked | 19.9 (0.1) | 5.96 (0.02)* | 27.7 (0.1)* |
SPOS-sepmarked | 19.4 (0.3)**## | 5.83 (0.01)**## | 27.3 (0.2)*## |
A more detailed analysis was performed of the compound parts in the output. The outcomes of the merging process were classified into four groups: known compounds; novel compounds; parts of coordinated compounds; and unmerged, single parts. They were further classified into good or bad outcomes. Compounds were judged as bad if they formed non-words or had the wrong form, and compound parts were judged as bad if they should have been merged with the next word, or did not work as a standalone word.
Table 10 shows the results of this analysis. The majority of the merged compounds are known from the training corpus for all representation schemes. There is, again, a marked difference between the systems that use POS-match merging and the sepmarked systems that do not have that information. The sepmarked system found the highest number of novel compounds, but also had the highest error rate, which shows that it is useful to match POS-tags. The EPOS systems have a higher number of novel compounds than the marked and unmarked options without an EPOS model. All these systems had a low error rate of the novel compounds, however. Very few errors were due to reverse normalization; In the EPOS-unmarked system there were only three such errors.
Generally, the percentage of bad parts or compounds is lower for the systems with a *POS-sequence model, which shows that the sequence model is useful for the ordering of compound parts. The number of single compound parts is also much higher for the systems without a POS sequence model.
The number of compounds found by the splitting algorithm in the German reference text was 4,472. All systems produce fewer compounds than this number. The numbers in Table 10 cannot be directly compared to the baseline system, however, because we do not know which words in its output are compounds.
An indication of how many compounds there are in a text is the number of long words. In the reference text there are 231 word types with at least 20 characters and 1,178 word types with at least 15 characters, which we used as the limits for long words. Other length limits give similar results. Table 11 gives data on absolute numbers of these long word types and recall, precision, and F-score compared with long words found in the reference for the different systems. The distribution of long words in the systems with compound processing generally follows the distribution of compounds from Table 10, which indicates that this is a good indicator of the number of compounds. Both baseline systems have fewer long words than all compound processing systems, indicating that the split-merge strategy does indeed help in increasing the number of compounds in the translation output. The SPOS systems had the highest number of long words; for up to 15 characters there are even more long words than in the reference. The EPOS systems have the same number for up to 15 characters, but a bit lower for up to 20 characters. The EPOS systems also have the highest recall of all systems for both character lengths. The F-scores are similar for the baseline systems and the EPOS systems, with the baseline higher on precision, and the EPOS systems higher on recall. Although the baseline has a higher precision, the absolute number of overlapping words is still higher in many of the systems with compound processing—for instance, for words of up to 20 characters there are 431 in EPOS-unmarked compared with 388 in baseline+POS. It is important to notice, however, that the reference is not a trustworthy gold standard, because there are many possible alternative good translations, and the real quality of long words are likely underestimated. Thus, as can be seen in Table 10, the clear majority of produced compounds in the EPOS systems are judged acceptable, even though the precision of overlap for long words with the reference is at most 47% for the EPOS models.
. | ≥20 chars . | ≥15 chars . | ||||||
---|---|---|---|---|---|---|---|---|
Number . | R . | P . | F . | Number . | R . | P . | F . | |
Reference | 231 | – | – | – | 1,178 | – | – | – |
Baseline | 157 | .27 | .49 | .35 | 743 | .33 | .53 | .41 |
Baseline+POS | 151 | .26 | .49 | .34 | 733 | .33 | .53 | .41 |
Unmarked | 192 | .27 | .40 | .32 | 826 | .34 | .49 | .40 |
Marked | 217 | .29 | .38 | .33 | 862 | .35 | .48 | .40 |
Sepmarked | 279 | .28 | .28 | .28 | 1,031 | .33 | .38 | .35 |
EPOS-unmarked | 231 | .30 | .37 | .33 | 936 | .36 | .46 | .40 |
EPOS-marked | 233 | .30 | .37 | .34 | 895 | .36 | .47 | .41 |
SPOS-sepmarked | 290 | .30 | .30 | .30 | 1,043 | .34 | .39 | .36 |
. | ≥20 chars . | ≥15 chars . | ||||||
---|---|---|---|---|---|---|---|---|
Number . | R . | P . | F . | Number . | R . | P . | F . | |
Reference | 231 | – | – | – | 1,178 | – | – | – |
Baseline | 157 | .27 | .49 | .35 | 743 | .33 | .53 | .41 |
Baseline+POS | 151 | .26 | .49 | .34 | 733 | .33 | .53 | .41 |
Unmarked | 192 | .27 | .40 | .32 | 826 | .34 | .49 | .40 |
Marked | 217 | .29 | .38 | .33 | 862 | .35 | .48 | .40 |
Sepmarked | 279 | .28 | .28 | .28 | 1,031 | .33 | .38 | .35 |
EPOS-unmarked | 231 | .30 | .37 | .33 | 936 | .36 | .46 | .40 |
EPOS-marked | 233 | .30 | .37 | .34 | 895 | .36 | .47 | .41 |
SPOS-sepmarked | 290 | .30 | .30 | .30 | 1,043 | .34 | .39 | .36 |
8.3.2 Experiment 2
To further test whether a compound processing strategy using a customized tagset is useful, we performed experiments on an additional language pair, English–Swedish. In this case we always used the successful EPOS-tagset, in combination with either the marked or unmarked representation of compounds.
Table 12 shows the results for translation into Swedish. Both systems with compound processing have higher scores on all metrics. The difference is significant on Meteor, and on either Bleu or NIST for the two systems with compound processing. The only significant differences between using marked and unmarked representations is that the marked system is better on NIST.
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline+POS | 21.6 (0.0) | 6.11 (0.00) | 57.8 (0.1) |
EPOS-unmarked | 22.0 (0.1)** | 6.14 (0.01) | 58.3 (0.1)** |
EPOS-marked | 21.9 (0.0) | 6.21 (0.00)** | 58.3 (0.0)** |
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline+POS | 21.6 (0.0) | 6.11 (0.00) | 57.8 (0.1) |
EPOS-unmarked | 22.0 (0.1)** | 6.14 (0.01) | 58.3 (0.1)** |
EPOS-marked | 21.9 (0.0) | 6.21 (0.00)** | 58.3 (0.0)** |
To investigate compound translation specifically, we manually classified the system translations corresponding to the first 100 compounds in the reference text that had a clear counterpart in the English source text. As good translations we considered identical translations to the reference, and alternative translations with the same meaning, which we also distinguished between other compounds, single words, or other constructions. We also had a category for word groups that were translated as separate words, but should have been compounded, split compounds.
The result of this evaluation can be seen in Table 13. There are more translations that are identical to the reference in the two systems with splitting, but the total number of identical and alternative translations is approximately the same in the three systems. The number of split compounds is higher in the baseline system. The unmarked system produces more split compounds and partial translations than the marked system. This can be seen as an indication of marking having an effect, which, however, is not clear in the automatic evaluation.
. | Baseline+POS . | EPOS-Unmarked . | EPOS-Marked . |
---|---|---|---|
Identical comp. | 48 | 53 | 57 |
Alt. compound | 14 | 9 | 10 |
Alt. word | 16 | 16 | 12 |
Alt. other | 9 | 8 | 9 |
Split compound | 7 | 5 | 3 |
Partly transl. | 4 | 7 | 4 |
No equivalent | 0 | 0 | 2 |
OOV | 2 | 2 | 3 |
. | Baseline+POS . | EPOS-Unmarked . | EPOS-Marked . |
---|---|---|---|
Identical comp. | 48 | 53 | 57 |
Alt. compound | 14 | 9 | 10 |
Alt. word | 16 | 16 | 12 |
Alt. other | 9 | 8 | 9 |
Split compound | 7 | 5 | 3 |
Partly transl. | 4 | 7 | 4 |
No equivalent | 0 | 0 | 2 |
OOV | 2 | 2 | 3 |
We also investigated the quality of the merged compounds, especially with regard to the reverse normalization that is needed in the unmarked systems, where compounding forms were normalized. There were no merging errors in the marked system. In the unmarked system there were only two minor errors due to failed reverse normalization. The first error is a missing insertion of an +s for medlem/+s/länder (member countries) and in the second case, *samhäll/-e+s/politiska (socio-political), a combination change −e/+s is wrong.
8.3.3 Experiment 3
In the third set of experiments with factored translation models we investigated the RPOS tagsets and the use of count features for the Danish and Swedish automotive corpus and German Europarl. Compound parts were merged using the POS-match heuristic.
Results on the two automotive corpora are shown in Table 14. The scores are very high, which is due to the fact that it is an easy domain with many repetitive sentence types. In this case there are no significant differences between the two baselines when using a POS-model. For Swedish, the results are overall higher for the systems with compound splitting, a difference that is significant for some systems and metrics. Overall, the RPOS system performs best for Swedish, with results significantly better than at least one baseline on all metrics. For Danish, on the other hand, the only significant difference between any baseline and system with splitting is for the EPOS system, which is slightly worse than the baseline with POS. For the other systems there are no significant differences to the baseline.
Table 15 shows results using RPOS for German Europarl. For this corpus neither the RPOS model nor the boost and punish models works well. They all give significantly worse results than the baseline. As we have seen before, however, the EPOS model gives competitive results to the baseline for this corpus.
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline+POS | 20.2 (0.1) | 5.93 (0.02) | 27.6 (0.2) |
EPOS | 20.1 (0.1) | 5.97 (0.03) | 27.7 (0.1) |
RPOS | 16.5 (0.1)** | 5.32 (0.03)** | 25.0 (0.2)** |
boost | 16.3 (0.2)** | 5.29 (0.03)** | 24.8 (0.3)** |
punish | 16.6 (0.1)** | 5.33 (0.03)** | 25.1 (0.2)** |
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline+POS | 20.2 (0.1) | 5.93 (0.02) | 27.6 (0.2) |
EPOS | 20.1 (0.1) | 5.97 (0.03) | 27.7 (0.1) |
RPOS | 16.5 (0.1)** | 5.32 (0.03)** | 25.0 (0.2)** |
boost | 16.3 (0.2)** | 5.29 (0.03)** | 24.8 (0.3)** |
punish | 16.6 (0.1)** | 5.33 (0.03)** | 25.1 (0.2)** |
Overall, we found that it was successful to use *POS-sequence models in a factored translation model in order to handle compound coalescence. Our best models with compound splitting perform at least on par with the baseline, and sometimes better. We also showed that there are other advantages to using compound processing, especially that the number of long words are similar to the baseline, which is not the case for our baseline systems, which have too few long words. We also showed that *POS-sequence models are essential for the compound processing approach to be successful. On Europarl the EPOS-tagset was the most successful, whereas the RPOS-tagset and count features could help for the automotive domain.
8.4 Influence of Splitting Strategies
In the following experiments we investigated how compound splitting strategies of different quality influenced the suggested compound processing method. In order to do this we performed an intrinsic evaluation of several splitting methods, and compared it to translation results using the same splitting strategies. In this experiment we used German Europarl.
We used a modified version of the splitting strategy of Koehn and Knight (2003) as a basis for the comparison, and applied it to all nouns, adjectives, adverbs, and verbs of minimum length six characters into parts of minimum three characters, by allowing all splits where the parts were found in a corpus, and were tagged as content words. Parts were allowed to be modified by all compound suffixes from Langer (1998). The best splitting option, which can be no split, was chosen based on the arithmetic mean of the corpus frequencies of the parts. We also imposed the restriction on the compound head that its part-of-speech tag needs to be the same as for the full word. We call this system arith. In other variants of splitting strategies, one feature at the time was changed, based on the arith system:
geom – using the geometric mean instead of the arithmetic mean
eager – choosing the split with the highest number of parts, instead of using the arithmetic mean
part4 – limiting the length of compound parts to four characters
max2 – limiting the maximum number of parts per compound to two
common – only allowing the common compound suffixes −s, −es, −n, −nen
anypos – not using part-of-speech tags
To measure the success of the different compound splitting algorithms we performed an intrinsic evaluation. We used the gold standard from Stymne (2008), created by manually annotating the first 5,000 words of the test text for one-to-one correspondence with the English reference text, similar to Koehn and Knight (2003). A one-to-one correspondence occurs when the words in a German compound are translated as separate words in English. In addition there can be inserted function words. As an example, Medienfreiheit is in one-to-one correspondence with freedom of the media, since the two German parts Medien and Freiheit corresponds to two separate words, media and freedom. The two function words of, the are ignored. Out of the 5,000 words 174 were compounds in one-to-one correspondence with English. The result of the one-to-one evaluation is shown in Table 16. The same metrics and categories as in Koehn and Knight (2003) were used:
correct split: words that were correctly split
correct not: words that should not be split and were not
wrong not: words that should be split but were not
wrong faulty: words that were split but in an incorrect way
wrong split: words that should not be split but were
precision: (correct split) / (correct split + wrong faulty + wrong split)
recall: (correct split) / (correct split + wrong faulty + wrong not)
accuracy: (correct) / (correct + wrong)
The splitting options have their strengths on different metrics, with three different methods having the best results for the three metrics used. Compared to the arith method it can be seen that both imposing length restrictions and using the geometric mean increases the results on all three metrics. Limiting the number of parts gives the highest recall. Using only common compound suffixes gives higher precision, and not using part-of-speech gives lower precision. The baseline system without splitting actually has the highest accuracy. To a large extent this is due to the fact that the test set is taken from running text, with a high number of non-compounds.
Overall, the results for all splitting variations are quite low. However, most of the strategies tend to split too much rather than too little, which is preferable, because the phrase-based translation system has the possibility of recovering from over-splitting by handling compounds that are split into too many parts in consistent phrase pairs. Evaluation towards a one-to-one gold standard is rather harsh because the splitting algorithm has no knowledge of the corresponding English structure. If we instead use a gold standard of linguistically motivated compounds, which are many more (545 for the 5,000 word set), precision is much higher for all systems, with mostly a lower recall. For the arith system, for instance, the precision is more than doubled at .597 with a slightly lower recall of .522.
The results for the translation task from English into German are shown in Table 17. In this experiment we used a different selection of 701K training sentences from Europarl than in Section 8.3, which resulted in overall lower scores both for the baseline and for the systems with compound processing. Overall, there are very small differences between both the baseline system and the systems with different types of splitting, and most of the small differences between the systems are not significant. The arith, max2, and common systems are on par with the baseline on all metrics, with no significant differences, whereas the other systems are worse than the baseline on at least one metric.
. | Bleu . | NIST . | Meteor . |
---|---|---|---|
baseline+POS | 19.1 (0.1) | 5.79 (0.02) | 26.8 (0.0) |
arith | 19.0 (0.0) | 5.79 (0.01) | 26.6 (0.1) |
geom | 18.9 (0.1)* | 5.78 (0.02) | 26.5 (0.2)** |
eager | 18.8 (0.1)** | 5.80 (0.01) | 26.6 (0.1) |
part4 | 18.8 (0.1)** | 5.79 (0.01) | 26.5 (0.1)* |
max2 | 18.8 (0.1) | 5.80 (0.01) | 26.6 (0.1) |
common | 18.9 (0.1) | 5.78 (0.01) | 26.6 (0.1) |
anypos | 18.9 (0.1) | 5.75 (0.01)** | 26.6 (0.1) |
. | Bleu . | NIST . | Meteor . |
---|---|---|---|
baseline+POS | 19.1 (0.1) | 5.79 (0.02) | 26.8 (0.0) |
arith | 19.0 (0.0) | 5.79 (0.01) | 26.6 (0.1) |
geom | 18.9 (0.1)* | 5.78 (0.02) | 26.5 (0.2)** |
eager | 18.8 (0.1)** | 5.80 (0.01) | 26.6 (0.1) |
part4 | 18.8 (0.1)** | 5.79 (0.01) | 26.5 (0.1)* |
max2 | 18.8 (0.1) | 5.80 (0.01) | 26.6 (0.1) |
common | 18.9 (0.1) | 5.78 (0.01) | 26.6 (0.1) |
anypos | 18.9 (0.1) | 5.75 (0.01)** | 26.6 (0.1) |
On the intrinsic evaluation there were quite large differences between the systems, which are not found on the translation task. This is similar to previous work by Koehn and Knight (2003) for translation in the other direction, where systems similar to the eager and geom system performed similarly on a translation task, while the eager system was much worse on their intrinsic evaluation. In our evaluation only the eager system is significantly worse than any other system on Bleu, where it is worse than the baseline and the arith system. The arith system overall is competitive with both the baseline and the other split systems, despite the relatively low results on the intrinsic evaluation.
Overall, it seems that the translation task is not very sensitive to the quality of the splitting strategy. As in previous research (Koehn and Knight 2003; Stymne 2008) there are no clear relations between the intrinsic evaluation and the MT evaluation. Both systems with low intrinsic accuracy (such as anypos) and with high accuracy (such as geom) tend to be worse than other systems on at least some MT metrics, even though the differences are small. We thus think that for the task of translation into compounding languages, the MT performance cannot be predicted based on intrinsic evaluations.
In a previous similar study using a smaller corpus (Stymne 2008), we found somewhat bigger differences between the systems, and in that study most of the systems with splitting were better than the baseline. In that study, too, the arith system was among the best performing on the MT task, even though many of the differences to the other systems with splitting were not significant. Because the arith strategy consistently has given good results for the MT task, we chose to use it in our other experiment. There would, however, have been other reasonable choices of splitting, such as using the max2 system.
8.5 Compound Merging
In the following studies we investigated the impact of the compound merging strategy used. We compared previously suggested strategies to the new strategies we have developed. We first present results for heuristic merging strategies, and then go on to compare the best heuristic methods to a sequence labeling merging strategy. For the heuristic merging we used German Europarl, and for the comparison with sequence labeling we used German Europarl and the automotive corpus for Swedish and Danish. In these experiments we used the arith compound splitting method from Section 8.4.
8.5.1 Heuristic Merging
Three main types of merging algorithms were investigated in this study. The first group, inspired by Popović, Stein, and Ney (2006), is based on frequency lists of words or compounds, and of parts, compiled at split-time. The second group uses symbols to guide merging, inspired by work on morphology merging (Virpioja et al. 2007). The third group takes *POS-tags for compounds into account, so that merging takes place if and only if the part-of-speech tags match. In addition, we examined combined merging methods, using either the union or intersection of merges proposed by different methods. We also extended the list- and symbol-based methods by a restriction that the head of the compound should have a compounding part of speech, that is, a noun, adjective, or verb. By using these additions and also combinations of the main algorithms, a total of eleven algorithms were explored, as summarized in Table 18. For all algorithms, compounds can have an arbitrary number of parts.
In these experiments we used the unmarked representation scheme, which means that compound modifiers were normalized, but not marked with any symbol, and we used the EPOS-tagset. To handle normalization, we used the reverse normalization process in combination with all merging strategies. If there were any compound modifiers that could not be combined with the next word, in any of the algorithms, that part was left as a single word. For frequency calculations for the list-based strategies we used the German part of the SMT training corpus.
Table 19 shows the translation results using the different merging algorithms. The merging methods based on lists all give significantly worse results than the baseline on all metrics, except when combined with the symbol method. The compound-list+ method is the best method of the systems that only uses list information; it is significantly better than word-list and compound-list on all metrics, but still significantly worse than the baseline, symbol, and POS-match systems. When using the head-pos restriction, the results get even better, but here POS-tag information is used in addition to only frequency list information. There are no significant differences between the baseline and the systems with symbol or POS-match merging, but the trend on most metrics is that the POS-match systems have a slightly higher score than the other methods. However, as shown in Section 8.3 (Table 10), POS-match does block some erroneous merging. Especially with EPOS the order is so good that the matching constraint is rarely needed. Adding treatment of coordinated compounds to the POS-match algorithm changes scores marginally, but again, as shown in Table 10, it does improve merging for coordinated compounds. The combination of POS-match and compound-list+ was not successful, neither as a union nor as an intersection of the strategies. Table 19 only shows results with an EPOS-model. The result pattern for systems without an EPOS-model are similar to using an EPOS-model, but lower.
. | Bleu . | NIST . | Meteor . |
---|---|---|---|
baseline+POS | 20.2 (0.1) | 5.93 (0.02) | 27.6 (0.2) |
word-list | 17.9 (0.0)** | 5.70 (0.02)** | 25.8 (0.1)** |
word-list + head-pos | 19.3 (0.0)** | 5.83 (0.03)** | 27.1 (0.1)** |
compound-list | 18.1 (0.0)** | 5.61 (0.02)** | 26.0 (0.1)** |
compound-list+ | 18.7 (0.0)** | 5.66 (0.03)** | 26.6 (0.1)** |
symbol | 20.0 (0.0) | 5.96 (0.03) | 27.7 (0.1) |
symbol + head-pos | 20.0 (0.0) | 5.95 (0.03) | 27.8 (0.1) |
symbol ∧ word-list | 20.0 (0.0)** | 5.95 (0.03) | 27.8 (0.1) |
POS-match | 20.1 (0.1) | 5.97 (0.03) | 27.7 (0.1) |
POS-match + coord | 20.1 (0.0) | 5.97 (0.03) | 27.8 (0.1) |
POS-match ∧ compound-list+ | 19.0 (0.1)** | 5.70 (0.03)** | 26.7 (0.1)** |
POS-match ∨ compound-list+ | 18.8 (0.0)** | 5.74 (0.03)** | 26.3 (0.1)** |
. | Bleu . | NIST . | Meteor . |
---|---|---|---|
baseline+POS | 20.2 (0.1) | 5.93 (0.02) | 27.6 (0.2) |
word-list | 17.9 (0.0)** | 5.70 (0.02)** | 25.8 (0.1)** |
word-list + head-pos | 19.3 (0.0)** | 5.83 (0.03)** | 27.1 (0.1)** |
compound-list | 18.1 (0.0)** | 5.61 (0.02)** | 26.0 (0.1)** |
compound-list+ | 18.7 (0.0)** | 5.66 (0.03)** | 26.6 (0.1)** |
symbol | 20.0 (0.0) | 5.96 (0.03) | 27.7 (0.1) |
symbol + head-pos | 20.0 (0.0) | 5.95 (0.03) | 27.8 (0.1) |
symbol ∧ word-list | 20.0 (0.0)** | 5.95 (0.03) | 27.8 (0.1) |
POS-match | 20.1 (0.1) | 5.97 (0.03) | 27.7 (0.1) |
POS-match + coord | 20.1 (0.0) | 5.97 (0.03) | 27.8 (0.1) |
POS-match ∧ compound-list+ | 19.0 (0.1)** | 5.70 (0.03)** | 26.7 (0.1)** |
POS-match ∨ compound-list+ | 18.8 (0.0)** | 5.74 (0.03)** | 26.3 (0.1)** |
Table 20 shows the number of merges performed by applying the different algorithms. The word-list–based method produces the highest number of merges, including many cases that should not be merged. The number of merges is greatly reduced by the head-pos restriction, or by combining the word-list method with some other merging strategy. An investigation of the output of the word-list–based method shows that it often merges common words that incidentally form a new word, such as bei[at] and der [the] to beider [both]. Another type of error is due to errors in the corpus, such as merging umwelt [environment] and und [and], which exists as a mistyped word in the corpus, but is not a correct German word. These two error types are often prohibited by the head-pos restrictions and by the symbol and POS-match algorithms. The extensions of the simple word-list method avoid these types of errors, resulting in a lower number of merges. The compound-list+ method has the overall lowest number of splits, but it still is the best of the methods based only on frequency lists, without the use of POS or symbols. With the EPOS-model there are very small differences in number of merges between the methods based on symbol and POS-match, although this difference is much larger without the EPOS-model. For instance, the difference is 68 merges between symbol and POS-match with the EPOS-model, whereas without the EPOS-model the difference is 277. This difference shows that the EPOS-model clearly helps in improving coalescence.
. | with EPOS . | without EPOS . |
---|---|---|
word-list | 5,275 | 5,897 |
word-list + head-pos | 4,161 | 4,752 |
compound-list | 4,460 | 5,116 |
compound-list+ | 3,843 | 4,427 |
symbol | 4,431 | 5,144 |
symbol + head-pos | 4,323 | 4,832 |
symbol ∧ word-list | 4,178 | 4,753 |
POS-match | 4,363 | 4,867 |
POS-match + coord | 4,361 | 4,865 |
POS-match ∧ compound-list+ | 3,502 | 4,018 |
POS-match ∨ compound-list+ | 4,706 | 5,281 |
. | with EPOS . | without EPOS . |
---|---|---|
word-list | 5,275 | 5,897 |
word-list + head-pos | 4,161 | 4,752 |
compound-list | 4,460 | 5,116 |
compound-list+ | 3,843 | 4,427 |
symbol | 4,431 | 5,144 |
symbol + head-pos | 4,323 | 4,832 |
symbol ∧ word-list | 4,178 | 4,753 |
POS-match | 4,363 | 4,867 |
POS-match + coord | 4,361 | 4,865 |
POS-match ∧ compound-list+ | 3,502 | 4,018 |
POS-match ∨ compound-list+ | 4,706 | 5,281 |
The choice of merging method thus has a large impact on the final translation results. For merging to be successful we found that some knowledge source that is passed through the translation process, such as EPOS-tags, is needed. The pure word-list–based method performed the worst of all systems in most cases. On the automatic metrics, the POS-match and symbol methods gave about the same results as the baseline; we believe there are other strengths to this method as compared to the baseline system, however, such as the production of a higher number of compounds. A further advantage of the POS-match strategy is that it can form novel compounds, which is not the case for either the baseline or the word-list–based methods.
8.5.2 Merging as Sequence Labeling
Finally, we performed experiments where we compared our suggestion of a sequence labeling merging method to heuristic merging. We chose two heuristic strategies, the POS-match strategy, which gave the overall best results, and compound-list+, which gave the best results of the word-list based merging strategies without any POS or symbol information. We also found that the intersection of these two strategies worked well on some of our corpora, and decided to include that as well. In addition, we used a reference-based heuristic for creating the CRF training data, which identified merges resulting in compounds that are present in the reference data. Such a heuristic cannot be used at translation-time, however, because there is no reference data then. In this section we will refer to these heuristics as list for the compound-list+ heuristic, POS for the POS-match strategy, and ref for the reference-based heuristic.
For these experiments we used three data sets, as summarized in Table 21. For the training data for the sequence labeler we translated all of the training data, except one-word sentences, which were filtered out, for the smaller automotive corpora. For the Europarl experiment, we limited the training data to the 250,000 first sentences of the training data, and again filtered out one-word sentences. For the Danish corpora we also had access to additional data, which we used to control for the effect of reusing SMT training data. For the machine learning we wanted to use separate development and test sets. We chose to use the test sets from Table 4 for development testing, where we used the first 1,000 sentences for German Europarl, and picked new unseen 1,000 sentence test sets for the final evaluation. For frequency calculations of compounds and compound parts that were needed for compound splitting and some of the compound merging strategies, we used the target side of the SMT training data.
Corpus . | Europarl German . | Auto Swedish . | Auto Danish . |
---|---|---|---|
Compounds split | N, V, Adj | N, V, Adj | N |
POS tagsets | POS | POS | RPOS |
Training sentences CRF | 249,388 | 317,398 | 164,702 |
Extra training sentences CRF | – | – | 163,201 |
Corpus . | Europarl German . | Auto Swedish . | Auto Danish . |
---|---|---|---|
Compounds split | N, V, Adj | N, V, Adj | N |
POS tagsets | POS | POS | RPOS |
Training sentences CRF | 249,388 | 317,398 | 164,702 |
Extra training sentences CRF | – | – | 163,201 |
In these experiments we used the unmarked representation scheme. For German Europarl we used the unmarked scheme as it is with a binary sequence labeler. Reverse normalization can be performed afterwards as for the other strategies. For the automotive corpus we used the unmarked scheme without normalization.
We compared alternative combinations of the heuristics on our three validation data sets (see Figure 4). In order to estimate the amount of false negatives for all three heuristics, we inspected the first 100 sentences of each validation set, looking for words that should be merged, but were not marked by any of the heuristics. In no case could we find any such words, so we thus assumed that between them, the heuristics can find the overwhelming majority of all compounds to be merged. The differences across domains, Europarl and automotive, is quite striking, with a much higher number of compounds found by using the reference-based heuristic for the automotive corpus.
We conducted a round of preliminary experiments to identify the best combination of the heuristics available at training time (compound-list+, POS-match, and reference-based) to use to create automatically the training data for the CRF. The best results on the validation data are obtained by different combinations of heuristics for the three data sets, as could be expected by the different distribution of errors in Figure 4. In the following experiments we trained the CRF using for each data set the combination of heuristics corresponding to leaving out the gray portions of the Venn diagrams. This sort of preliminary optimization requires hand-labeling a certain amount of data. Based on our experiments, skipping this optimization and just using ref∨POS (the optimal configuration for the German–English Europarl corpus) seems to be a reasonable alternative.
The validation data was also used to set a frequency cut-off for feature occurrences (set at 3 in the following experiments) and to tune the regularization parameter in the CRF objective function. Results are largely insensitive to variations in these hyper-parameters, especially to the CRF regularization parameter.
For the Danish automotive corpus we had access to training data that had not been used to train the SMT system, so we could test the performance of the CRF when trained on that data as compared with training on the possibly biased data that was used to train the SMT system. Table 22 shows the results with the two types of training data. For the smallest training size, 20,000 sentences, the unseen training data is significantly better on recall and F-score. For the other sizes, there are no significant differences when the same amount of training data is used. When we added the unseen data to the SMT data, we only saw small non-significant improvements. This indicates that it is feasible to use translated SMT training data for the sequence labeler. We still decided to use all available training data for our subsequent experiments on Danish Europarl.
The overall merging results of the heuristics and the best sequence labeler are shown in Table 23. Notice how the list and POS heuristics have complementary sets of false negatives: when merging on the union of the two heuristics, the number of false negatives decreases drastically, in general compensating for the inevitable increase in false positives.
Among the heuristics, the combination of the improved list heuristic and the POS-based heuristic has a significantly higher recall and F-score than either heuristic alone in all cases on test data, and in several cases on validation data. The list heuristic alone performs reasonably well on the Swedish data set, but has a very low recall on the German and Danish data sets. In all four cases the SMT training data has been used for the list used by the heuristic, so this is unexpected, especially considering the fact that the Danish data set is in the same domain as one of the Swedish data sets. The Danish training data is smaller than the Swedish data though, which might be an influencing factor. It is possible that this heuristic could perform better on the other data sets given more data for frequency calculations.
The sequence labeler is competitive with the heuristics; on F-score it is only significantly worse than any of the heuristics once, for Danish auto test data, and in several cases it has a significantly higher F-score than some of the heuristics. The sequence labeler has a higher precision, significantly so in four cases, than the best heuristic, the combination heuristic, which is positive (because erroneously merged compounds are usually more disturbing for a reader or post-editor than non-merged compounds).
The differences between the systems on MT metrics are generally small and mostly not significant, as could be expected, since the differences were small on the intrinsic evaluation. For German Europarl only 10–739 out of 29,786 words are actually different between any pair of systems. The only noticeable differences between the systems are that the list system is worse than the other systems for Danish and German. The list strategy is competitive for Swedish automotive, which is much bigger than Danish, and in one of our previous experiments, with a large Swedish Europarl corpus (Stymne and Cancedda 2011), confirming our intuition that the list strategy tends to work better with large corpora. Even though we do not see much effect on MT metrics, we still think the small differences on the intrinsic evaluation are important. Compound merging is a postprocessing process, which operates directly on the actual translation results. This is quite different from compound splitting, as described in Section 8.4, which is performed as part of the preprocessing and where we cannot be sure on the impact of the preprocessing on the translation process.
The sequence labeler has the advantage over the heuristics that it is able to merge completely novel compounds, whereas the list strategy can only merge compounds that it has seen, and the POS-based strategy can create novel compounds, but only with known modifiers. An inspection of the test data showed that there were some novel compounds merged by the sequence labeler that were not identified with either of the heuristics. In the test data we found knap+start (button start) and vand+nedsænkning (water submersion) for Danish, and kvarts sekel (quarter century) and bostad(s)+ersättning (housing grant) for Swedish. This confirms that the sequence labeler, trained from automatically labeled data based on heuristics, can learn to merge new compounds that the heuristics themselves cannot find.
8.6 Effects of Corpus Size and Domain
Most of the previous experiments were performed on relatively small data sets, and it has been shown that effects of preprocessing strategies sometimes are larger on small corpora (e.g., Nießen and Ney 2004; Popović and Ney 2006). To investigate this issue, we first present a scaling experiment for small data sets from German Europarl, and then report results on two large data sets for translation from English to German on Europarl and news data.
Table 24 shows the result for baseline and EPOS systems trained on 100K and 440K sentences. In the smaller data condition the results are similar for the baseline and the EPOS system. With 440K data the EPOS system is significantly better than the baseline on all three metrics. We thus see that our compound processing method works well on medium-sized data sets as well, and is competitive to the baseline on very small data sets.
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
100K baseline+POS | 16.8 (0.0) | 5.37 (0.02) | 24.6 (0.1) |
100K EPOS | 16.9 (0.1) | 5.39 (0.02) | 24.8 (0.1) |
440K baseline+POS | 19.3 (0.1) | 5.76 (0.05) | 26.7 (0.2) |
440K EPOS | 19.7 (0.2)** | 5.83 (0.03)** | 27.0 (0.2)** |
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
100K baseline+POS | 16.8 (0.0) | 5.37 (0.02) | 24.6 (0.1) |
100K EPOS | 16.9 (0.1) | 5.39 (0.02) | 24.8 (0.1) |
440K baseline+POS | 19.3 (0.1) | 5.76 (0.05) | 26.7 (0.2) |
440K EPOS | 19.7 (0.2)** | 5.83 (0.03)** | 27.0 (0.2)** |
In our final experiments we used larger data sets from the workshop of statistical machine transition, year 2008 and 2010, which we will refer to as WMT08 and WMT10.7 The corpora used are presented in Table 25. In these experiments we also investigated the effects of in- and out-of-domain data. For WMT08 we trained and tuned on only Europarl data, and tested both in-domain, on Europarl and out-of-domain, on News. For WMT10 we trained on a mix of data, and tested on News. For these experiments we used morphologically enriched POS-tagsets. For WMT08 we used a commercial dependency parser for the morphology (Tapanainen and Järvinen 1997), and for WMT10 we used RFTagger (Schmid and Laws 2008). For the systems with split compounds we used the EPOS-tagset based on morphological tags. We used the POS-match merging algorithm.
. | WMT08 . | WMT10 . | ||
---|---|---|---|---|
Name/type . | Size . | Name/type . | Size . | |
Translation models | Europarl | 1,054,688 | Europarl+NewsComm | 1,462,401 |
Language model 1 | Europarl | 1,412,546 | Europarl+NewsComm | 1,885,872 |
Language model 2 | – | News | 17,449,589 | |
Tuning data | dev2006 | 600 | news-test2008 | 1,025 |
Test Set Europarl | test2007 | 2,000 | – | |
Test Set News | nc-test2007 | 2,007 | news-test2009 | 2,525 |
. | WMT08 . | WMT10 . | ||
---|---|---|---|---|
Name/type . | Size . | Name/type . | Size . | |
Translation models | Europarl | 1,054,688 | Europarl+NewsComm | 1,462,401 |
Language model 1 | Europarl | 1,412,546 | Europarl+NewsComm | 1,885,872 |
Language model 2 | – | News | 17,449,589 | |
Tuning data | dev2006 | 600 | news-test2008 | 1,025 |
Test Set Europarl | test2007 | 2,000 | – | |
Test Set News | nc-test2007 | 2,007 | news-test2009 | 2,525 |
Table 26 shows the results on the WMT08 evaluation. On Europarl the system with compound processing and an EPOS-model is significantly better than both baselines on NIST, and on the News domain it is significantly better than both baselines on Bleu. On the other metrics the EPOS system is on par with the baseline for both domains. The differences between the two baselines are mostly small. Overall, the scores are much lower on the out-of-domain data. Table 27 shows the results for the WMT10 evaluation, where more data was used to train the systems. The EPOS system is significantly better than the baseline without POS on all metrics, and also significantly better than the baseline with POS on Bleu. This shows that the suggested compound processing is effective also on larger data sets and out-of-domain data.
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline | 13.5 (0.1) | 5.29 (0.04) | 21.0 (0.2) |
Baseline+POS | 13.8 (0.1)* | 5.35 (0.01) | 21.3 (0.2) |
EPOS-unmarked | 14.1 (0.3)**## | 5.35 (0.07)** | 21.4 (0.1 |
System . | Bleu . | NIST . | Meteor . |
---|---|---|---|
Baseline | 13.5 (0.1) | 5.29 (0.04) | 21.0 (0.2) |
Baseline+POS | 13.8 (0.1)* | 5.35 (0.01) | 21.3 (0.2) |
EPOS-unmarked | 14.1 (0.3)**## | 5.35 (0.07)** | 21.4 (0.1 |
9. Conclusion
In this article, we have investigated several methods for splitting and merging compounds when translating into languages with closed compounds, with the goal of generating novel compounds. We have described methods for promoting coalescence and for deciding if and how to merge words that are either competitive with, or superior to, any previously published method. A common trait of all these methods is that they promote coalescence and allow formation of novel compounds by using customized *POS-tagsets. This shows that the integration of light-weight linguistic information through POS-tags is useful for compound processing into compounding languages. We believe that some of the techniques, such as customized POS-sequence models, could be used to target morphological phenomena other than compounding as well. We also believe that the suggested methods are likely to be useful for other compounding languages such as Finnish or Japanese.
For promoting compound coalescence we introduced additional LMs based on customized POS-tagsets, and with dedicated SMT model features counting the number of sequences known a priori to be desirable and undesirable. Experiments showed consistently good results when using the extended tagset EPOS in sequence models. The other tagsets were less successful, but on an automotive corpora a restricted tagset, RPOS worked well both in sequence models and using count features.
For compound splitting, previously proposed strategies for the opposite translation direction could be modified to improve performance when translating into German. We also confirmed previous results that showed no correlations between intrinsic evaluations of compound splitting and translation results in the other translation direction.
For merging we designed a heuristic algorithm based on part-of-speech matching, which gave consistently good results. We also improved existing heuristics based on frequency lists, which worked well on some data sets. We furthermore cast the compound merging problem as a sequence labeling problem, opening it to solutions based on a broad array of models and algorithms. We experimented with one model, conditional random fields, designed a set of easily computed features reaching beyond the information accessed by the heuristics, and showed that it gives very competitive results, even trained on automatically labeled data.
Our goal was to identify methods that can generate novel compounds. Depending on the choice of the features, the sequence labeling approach has the potential to be truly productive, that is, to form new compounds in an unrestricted way. The list-based heuristic is not productive: It can only form a compound if this was already observed as such. The POS-match heuristic yields some productivity. Because it uses special POS-tags for compound modifiers, it can form a compound provided its head has been seen either alone or as a head, and its modifier(s) have been seen elsewhere, possibly separately, as modifier(s) of compounds. The sequence labeling approach can decide to merge two consecutive words even if neither was ever seen before in a compound.
In the merging step we only considered the order of words as they appear in the 1-best output. Other options would include also attempting reordering of words in the output in addition to merging, or to use lattice or n-best output and perform compound merging and reranking in combination. However, we believe that the use of sequence models for coalescence gives sufficiently good results and minimizes the potential impact of more costly options.
In this work we mainly viewed the translation system as a black box. Our modifications to the PBSMT pipeline were based on pre- and postprocessing and on the use of *POS-sequence models. It is possible that further improvements can be had by integrating compound processing tighter into the decoder.
The rate of novel compounds is relatively low in the data sets used in our experiments, mainly due to the fact that most experiments are carried out with in-domain test data. However, the best methods tend to vary between the domain-restricted automotive corpora and Europarl. The importance of effective compound splitting and merging can be expected to grow in domain adaptation settings, outside the scope of this work.
Overall we have shown that systems with compound processing give consistently good results when they use POS-based language models and use either the POS-match heuristic or sequence labeling for compound merging. The difference between a baseline using a POS-model but no compound splitting and the split-merge EPOS-models is small in all experiments and does not seem to increase with the size of the training corpus. However, only the latter can produce novel compounds.
Acknowledgments
We would like to thank Maria Holmqvist and Tamás Gaál for valuable discussions about different aspects of the work in this article. We also thank the anonymous reviewers for their valuable comments on an earlier draft. The work presented in the article was mostly conducted while S. Stymne was at Linköping University and Xerox Research Centre Europe.
Notes
Nouns in German are capitalized. This is normally dealt with as further recasing postprocessing, and is an orthogonal problem from the one we deal with here.
The words that are processed in the inner if-clause are skipped in the outer loop.
This option can also be extended so that the training data is split into chunks where each chunk is translated by a system that is trained on the remaining chunks, a strategy that has been successfully used for parse reranking (Collins and Koo 2005). Because we found that using fresh data did not give any significant improvements on the merging task except with little training data (see Section 8.5.2, Table 22), we choose not to use this approach since it would be very time consuming and thus impractical.
References
Author notes
Department of Linguistics and Philology, Uppsala University, Box 635, 751 26 Uppsala, Sweden. E-mail: [email protected].
Xerox Research Centre Europe, 6 chemin de Maupertuis, 38240 Meylan, France. E-mail: [email protected].
Department of Computer and Information Science, Linköping University, 58183 Linköping, Sweden. E-mail: [email protected].