Skip to Main Content

We have applied our composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 language model that is trained by a 1.3 billion word corpus for the task of re-ranking the N-best list in statistical MT. We used the same two 1,000-best lists that were used by Zhang and colleagues (Zhang, Hildebrand, and Vogel 2006; Zhang 2008; Zhang et al. 2011). The first list was generated on 919 sentences of 100 documents from the MT03 Chinese–English evaluation set, and the second was generated on 191 sentences of 20 documents from the MT04 Chinese–English evaluation set, both by Hiero (Chiang 2007), a state-of-the-art parsing-based translation model. Its decoder uses a trigram language model trained with modified Kneser-Ney smoothing (Jurafsky and Martin 2008) on a 200 million token corpus. Each translation has 11 features and language model is one of them. We substitute our language model and use MERT (Och 2003) to optimize the Bleu score (Papineni et al. 2002). We conduct two experiments on these two data sets. In the first experiment, we partition the first data set that consists of 100 documents into ten pieces; each piece consists of 10 documents, nine pieces are used as training data to optimize the Bleu score (Papineni et al. 2002) by MERT (Och 2003), and the remaining single piece is used to re-rank the 1,000-best list and obtain the Bleu score. The cross-validation process is then repeated 10 times (the folds), with each of the 10 pieces used exactly once as the validation data. The 10 results from the folds then can be averaged (or otherwise combined) to produce a single estimation for Bleu score. The mean and variance of the Bleu score are calculated with each different LM. We assume that the score follows Student's t-distribution and we compute the 95% confidence interval according to mean and variance. Table 10 shows the Bleu scores through 10-fold cross-validation. The composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 language model gives 1.57 percentage point Bleu score improvement over the baseline and 0.79 percentage point Bleu score improvement over the 5-gram. We are not able to further improve Bleu score when we use either the 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA2 or 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA3. This is because there is not much diversity on the 1,000-best list, and essentially only 20 ∼ 30 distinct sentences are in the 1,000-best list.

Table 10 

10-fold cross-validation Bleu score results for the task of re-ranking the 1,000-best list generated on 919 sentences of 100 documents from the MT03 Chinese–English evaluation set.

system model
mean (%)
95% CI (%)
Baseline 31.75 0.22 
5-gram 32.53 0.24 
5-gram/2-SLM + 2-gram/4-SLM 32.87 0.24 
5-gram/PLSA1 33.01 0.24 
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 33.32 0.25 
system model
mean (%)
95% CI (%)
Baseline 31.75 0.22 
5-gram 32.53 0.24 
5-gram/2-SLM + 2-gram/4-SLM 32.87 0.24 
5-gram/PLSA1 33.01 0.24 
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 33.32 0.25 

Close Modal

or Create an Account

Close Modal
Close Modal