Skip to Main Content
Table 3: 
Automatic evaluation metrics performance on random negatives (PBC refers to point-biserial correlation. Column subheading ‘Single’ refers to experiments using single reference response and ‘Avg’ and ‘Max’ are the average and maximum aggregation strategies when using multiple reference responses. ‘Standard’ is applicable when the metric aggregates multiple references differently. * indicates statistical significance in performance over all other metrics (with p-values ¡1e-9) on William’s test for comparing correlations and Chi-squared test for accuracies. p-values for individual correlations are in parenthesis.
MetricPoint Biserial Correlation (p-value)Accuracy in percentage
Single Multiple Single Multiple 
Avg Max Standard Avg Max Standard 
BLEU-1 0.26 (¡1e-9) 0.42 (¡1e-9) 0.41 (¡1e-9) 0.41 (¡1e-9) 61.26 68.60 68.75 70.36 
BLEU-2 0.22 (¡1e-9) 0.39 (¡1e-9) 0.36 (¡1e-9) 0.40 (¡1e-9) 58.09 68.26 68.37 68.66 
BLEU-3 0.14 (¡1e-9) 0.26 (¡1e-9) 0.24 (¡1e-9) 0.28 (¡1e-9) 53.11 58.85 58.90 58.89 
BLEU-4 0.08 (¡1e-9) 0.17 (¡1e-9) 0.15 (¡1e-9) 0.18 (¡1e-9) 51.16 53.56 53.56 53.50 
METEOR 0.23 (¡1e-9) 0.40 (¡1e-9) 0.41 (¡1e-9) − 59.77 68.51 68.01 − 
ROUGE-L 0.23 (¡1e-9) 0.41 (¡1e-9) 0.40 (¡1e-9) 0.37 (¡1e-9) 59.47 67.89 68.25 68.43 
deltaBLEU (Galley et al., 2015) − − − 0.29 (¡1e-9) − − − 64.89 
 
Embed Avg 0.23 (¡1e-9) 0.25 (¡1e-9) 0.23 (¡1e-9) − 61.27 61.56 62.67 − 
Vec Extr (Forgues et al., 2014) 0.24 (¡1e-9) 0.35 (¡1e-9) 0.33 (¡1e-9) − 59.22 63.70 63.90 − 
GreedyMatch (Rus and Lintean, 2012) 0.24 (¡1e-9) 0.36 (¡1e-9) 0.32 (¡1e-9) − 60.02 63.99 65.56 − 
BERTScore (Zhang et al., 2020a) 0.29 (¡1e-9) 0.39 (¡1e-9) 0.39 (¡1e-9) − 63.71 69.05 68.59 − 
 
ADEM (Lowe et al., 2017) 0.40 (¡1e-9) 64.74 
BERT regressor (Shimanaka et al., 2019) 0.52 (¡1e-9) 73.40 
BERT+DNN (Ghazarian et al., 2019) 0.57 (¡1e-9) 74.67 
RUBER (Tao et al., 2018) 0.64 (¡1e-9) 78.18 
RUBER-Large (Tao et al., 2018) 0.69 (¡1e-9) 82.36 
DEB (ours) 0.79*(¡1e-9) 88.27* 
MetricPoint Biserial Correlation (p-value)Accuracy in percentage
Single Multiple Single Multiple 
Avg Max Standard Avg Max Standard 
BLEU-1 0.26 (¡1e-9) 0.42 (¡1e-9) 0.41 (¡1e-9) 0.41 (¡1e-9) 61.26 68.60 68.75 70.36 
BLEU-2 0.22 (¡1e-9) 0.39 (¡1e-9) 0.36 (¡1e-9) 0.40 (¡1e-9) 58.09 68.26 68.37 68.66 
BLEU-3 0.14 (¡1e-9) 0.26 (¡1e-9) 0.24 (¡1e-9) 0.28 (¡1e-9) 53.11 58.85 58.90 58.89 
BLEU-4 0.08 (¡1e-9) 0.17 (¡1e-9) 0.15 (¡1e-9) 0.18 (¡1e-9) 51.16 53.56 53.56 53.50 
METEOR 0.23 (¡1e-9) 0.40 (¡1e-9) 0.41 (¡1e-9) − 59.77 68.51 68.01 − 
ROUGE-L 0.23 (¡1e-9) 0.41 (¡1e-9) 0.40 (¡1e-9) 0.37 (¡1e-9) 59.47 67.89 68.25 68.43 
deltaBLEU (Galley et al., 2015) − − − 0.29 (¡1e-9) − − − 64.89 
 
Embed Avg 0.23 (¡1e-9) 0.25 (¡1e-9) 0.23 (¡1e-9) − 61.27 61.56 62.67 − 
Vec Extr (Forgues et al., 2014) 0.24 (¡1e-9) 0.35 (¡1e-9) 0.33 (¡1e-9) − 59.22 63.70 63.90 − 
GreedyMatch (Rus and Lintean, 2012) 0.24 (¡1e-9) 0.36 (¡1e-9) 0.32 (¡1e-9) − 60.02 63.99 65.56 − 
BERTScore (Zhang et al., 2020a) 0.29 (¡1e-9) 0.39 (¡1e-9) 0.39 (¡1e-9) − 63.71 69.05 68.59 − 
 
ADEM (Lowe et al., 2017) 0.40 (¡1e-9) 64.74 
BERT regressor (Shimanaka et al., 2019) 0.52 (¡1e-9) 73.40 
BERT+DNN (Ghazarian et al., 2019) 0.57 (¡1e-9) 74.67 
RUBER (Tao et al., 2018) 0.64 (¡1e-9) 78.18 
RUBER-Large (Tao et al., 2018) 0.69 (¡1e-9) 82.36 
DEB (ours) 0.79*(¡1e-9) 88.27* 
Close Modal

or Create an Account

Close Modal
Close Modal