Skip to Main Content
Table 9: 
Human correlations on DailyDialog++ data with different models. (Individual p-values in parenthesis.) * indicates statistical significance in performance over other models, with p-values ¡1e-6 on the William’s test.
ModelPearsonSpearmanKendall tau
Response level 
BERT+DNN 0.016 (0.73) 0.009 (0.89) 0.007 (0.88) 
RUBER 0.111 (2.5e-2) 0.126 (1.1e-2) 0.090 (8.9e-2) 
RUBER-Large 0.265 (¡1e-7) 0.256 (¡1e-6) 0.173 (¡1e-6) 
DEB w/o Reddit 0.356 (¡1e-9) 0.295 (¡1e-9) 0.202 (¡1e-9) 
DEB w/o DD++ 0.274 (¡1e-9) 0.337 (¡1e-9) 0.232 (¡1e-9) 
DEB 0.440* (¡1e-9) 0.523* (¡1e-9) 0.374* (¡1e-9) 
 
System level 
BERT+DNN 0.050 (0.89) -0.100 (0.87) 0.000 (1.1) 
RUBER 0.221 (0.72) 0.300 (0.62) 0.200 (0.81) 
RUBER-Large 0.679 (0.20) 0.499 (0.39) 0.399 (0.483) 
DEB w/o Reddit 0.784 (0.12) 0.600 (0.28) 0.400 (0.48) 
DEB w/o DD++ 0.855 (0.06) 0.600 (0.28) 0.400 (0.48) 
DEB 0.973 (5.2e-3) 0.700 (0.18) 0.600 (0.23) 
ModelPearsonSpearmanKendall tau
Response level 
BERT+DNN 0.016 (0.73) 0.009 (0.89) 0.007 (0.88) 
RUBER 0.111 (2.5e-2) 0.126 (1.1e-2) 0.090 (8.9e-2) 
RUBER-Large 0.265 (¡1e-7) 0.256 (¡1e-6) 0.173 (¡1e-6) 
DEB w/o Reddit 0.356 (¡1e-9) 0.295 (¡1e-9) 0.202 (¡1e-9) 
DEB w/o DD++ 0.274 (¡1e-9) 0.337 (¡1e-9) 0.232 (¡1e-9) 
DEB 0.440* (¡1e-9) 0.523* (¡1e-9) 0.374* (¡1e-9) 
 
System level 
BERT+DNN 0.050 (0.89) -0.100 (0.87) 0.000 (1.1) 
RUBER 0.221 (0.72) 0.300 (0.62) 0.200 (0.81) 
RUBER-Large 0.679 (0.20) 0.499 (0.39) 0.399 (0.483) 
DEB w/o Reddit 0.784 (0.12) 0.600 (0.28) 0.400 (0.48) 
DEB w/o DD++ 0.855 (0.06) 0.600 (0.28) 0.400 (0.48) 
DEB 0.973 (5.2e-3) 0.700 (0.18) 0.600 (0.23) 
Close Modal

or Create an Account

Close Modal
Close Modal