Ranking (lower is better) of the top candidate selected by each decoding method, as ranked among the 1,000 candidates using Bleurt v0.2 (BL.2). The percentiles are calculated on the 1,002 test queries of Newstest2021 En→De. A smaller value indicates that the chosen candidate is also preferred by the actual Ref-C BL.2 metric. This table shows that MBR provides more stable quality estimates than single references.

Rank wrt Bleurt v0.2 Ref-C
p5p25p50p75p95
MAP  13 78 181 355 717
Oracle Ref-D 18 78 327
MBR BL.2 26 105
