In the second experiment, we used the first data set as training data to optimize the Bleu score by MERT, then the second data set is used to re-rank the 1,000-best list and obtain the Bleu score. To obtain the confidence interval of the Bleu score, we resort to the bootstrap resampling described by Koehn (2004). We randomly select 10 re-ranked documents from the 20 re-ranked documents in the second data set with replacement. We draw the translation results of the 10 documents and compute the Bleu score. We repeat this procedure 1,000 times. When we compute the 95% confidence interval, we drop the top 25 and bottom 25 Bleu scores, and only consider the range of 26th to 975th Bleu scores. Table 11 shows the Bleu scores. These statistics are computed with different language models, but on the same chosen test sets. The 5-gram gives 0.51 percentage point Bleu score improvement over the baseline. The composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA^{1} language model gives 1.19 percentage point Bleu score improvement over the baseline and 0.68 percentage point Bleu score improvement over the 5-gram.

Table 11

system model . | mean (%) . | 95% CI (%) . |
---|---|---|

Baseline | 27.59 | 0.31 |

5-gram | 28.10 | 0.32 |

5-gram/2-SLM + 2-gram/4-SLM | 28.34 | 0.32 |

5-gram/PLSA^{1} | 28.53 | 0.31 |

5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA^{1} | 28.78 | 0.31 |

system model . | mean (%) . | 95% CI (%) . |
---|---|---|

Baseline | 27.59 | 0.31 |

5-gram | 28.10 | 0.32 |

5-gram/2-SLM + 2-gram/4-SLM | 28.34 | 0.32 |

5-gram/PLSA^{1} | 28.53 | 0.31 |

5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA^{1} | 28.78 | 0.31 |

This site uses cookies. By continuing to use our website, you are agreeing to our privacy policy.