Skip to Main Content

In the second experiment, we used the first data set as training data to optimize the Bleu score by MERT, then the second data set is used to re-rank the 1,000-best list and obtain the Bleu score. To obtain the confidence interval of the Bleu score, we resort to the bootstrap resampling described by Koehn (2004). We randomly select 10 re-ranked documents from the 20 re-ranked documents in the second data set with replacement. We draw the translation results of the 10 documents and compute the Bleu score. We repeat this procedure 1,000 times. When we compute the 95% confidence interval, we drop the top 25 and bottom 25 Bleu scores, and only consider the range of 26th to 975th Bleu scores. Table 11 shows the Bleu scores. These statistics are computed with different language models, but on the same chosen test sets. The 5-gram gives 0.51 percentage point Bleu score improvement over the baseline. The composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 language model gives 1.19 percentage point Bleu score improvement over the baseline and 0.68 percentage point Bleu score improvement over the 5-gram.

Table 11 

Bleu score results for the task of re-ranking the 1,000-best list generated on 191 sentences of 20 documents from the MT04 Chinese–English evaluation set.

system model
mean (%)
95% CI (%)
Baseline 27.59 0.31 
5-gram 28.10 0.32 
5-gram/2-SLM + 2-gram/4-SLM 28.34 0.32 
5-gram/PLSA1 28.53 0.31 
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 28.78 0.31 
system model
mean (%)
95% CI (%)
Baseline 27.59 0.31 
5-gram 28.10 0.32 
5-gram/2-SLM + 2-gram/4-SLM 28.34 0.32 
5-gram/PLSA1 28.53 0.31 
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 28.78 0.31 

Close Modal

or Create an Account

Close Modal
Close Modal