Table 4: 

The Pearson r, Spearman ρ, and Kendall τ correlation coefficients calculated between the metrics’ scores and expert responsiveness judgments on the TAC’08 (left) and TAC’09 (right) datasets. QAEval has the highest system-level correlations, even better than the fully manual Pyramid Score, whereas the summary-level correlations are lower (EM) or competitive (F1) with other metrics. We believe this supports our hypothesis that the QA model and answer verification are noisy (causing lower summary-level correlations) but average out to a high-quality metric given enough QA pairs (causing high system-level correlations). On TAC’09, the QA r values are much lower because of an outlier, and r is sensitive to outliers. If the outlier is removed, the r values become 0.92 and 0.93 for EM and F1.

TAC 2008TAC 2009
MetricSystem-LevelSummary-LevelMetricSystem-LevelSummary-Level
rρτrρτrρτrρτ
Pyramid Score .90 .88 .70 .59 .59 .50 Pyramid Score .90 .87 .70 .59 .57 .48 
 
ROUGE-1 .79 .80 .60 .49 .48 .39 ROUGE-1 .83 .78 .60 .54 .47 .38 
ROUGE-2 .83 .87 .67 .48 .48 .39 ROUGE-2 .76 .84 .67 .50 .50 .40 
ROUGE-L .74 .77 .57 .46 .45 .36 ROUGE-L .82 .72 .54 .54 .47 .37 
ROUGE-SU4 .80 .83 .63 .49 .48 .39 ROUGE-SU4 .77 .81 .63 .52 .50 .39 
PyrEval .81 .79 .59 .31 .31 .25 PyrEval .86 .82 .64 .39 .35 .28 
MoverScore .83 .82 .63 .50 .49 .40 MoverScore .82 .80 .63 .51 .52 .42 
APES .74 .82 .60 .25 .25 .21 APES .87 .80 .63 .41 .35 .28 
 
QAEval-EM .93 .91 .76 .33 .33 .27 QAEval-EM .70 .87 .69 .42 .38 .30 
QAEval-F1 .90 .88 .71 .46 .45 .36 QAEval-F1 .81 .89 .72 .50 .45 .36 
TAC 2008TAC 2009
MetricSystem-LevelSummary-LevelMetricSystem-LevelSummary-Level
rρτrρτrρτrρτ
Pyramid Score .90 .88 .70 .59 .59 .50 Pyramid Score .90 .87 .70 .59 .57 .48 
 
ROUGE-1 .79 .80 .60 .49 .48 .39 ROUGE-1 .83 .78 .60 .54 .47 .38 
ROUGE-2 .83 .87 .67 .48 .48 .39 ROUGE-2 .76 .84 .67 .50 .50 .40 
ROUGE-L .74 .77 .57 .46 .45 .36 ROUGE-L .82 .72 .54 .54 .47 .37 
ROUGE-SU4 .80 .83 .63 .49 .48 .39 ROUGE-SU4 .77 .81 .63 .52 .50 .39 
PyrEval .81 .79 .59 .31 .31 .25 PyrEval .86 .82 .64 .39 .35 .28 
MoverScore .83 .82 .63 .50 .49 .40 MoverScore .82 .80 .63 .51 .52 .42 
APES .74 .82 .60 .25 .25 .21 APES .87 .80 .63 .41 .35 .28 
 
QAEval-EM .93 .91 .76 .33 .33 .27 QAEval-EM .70 .87 .69 .42 .38 .30 
QAEval-F1 .90 .88 .71 .46 .45 .36 QAEval-F1 .81 .89 .72 .50 .45 .36 
Close Modal

or Create an Account

Close Modal
Close Modal