The Pearson r, Spearman ρ, and Kendall τ correlation coefficients calculated between the metrics’ scores and expert responsiveness judgments on the TAC’08 (left) and TAC’09 (right) datasets. QAEval has the highest system-level correlations, even better than the fully manual Pyramid Score, whereas the summary-level correlations are lower (EM) or competitive (F1) with other metrics. We believe this supports our hypothesis that the QA model and answer verification are noisy (causing lower summary-level correlations) but average out to a high-quality metric given enough QA pairs (causing high system-level correlations). On TAC’09, the QA r values are much lower because of an outlier, and r is sensitive to outliers. If the outlier is removed, the r values become 0.92 and 0.93 for EM and F1.
TAC 2008 . | TAC 2009 . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metric . | System-Level . | Summary-Level . | Metric . | System-Level . | Summary-Level . | ||||||||
r . | ρ . | τ . | r . | ρ . | τ . | r . | ρ . | τ . | r . | ρ . | τ . | ||
Pyramid Score | .90 | .88 | .70 | .59 | .59 | .50 | Pyramid Score | .90 | .87 | .70 | .59 | .57 | .48 |
ROUGE-1 | .79 | .80 | .60 | .49 | .48 | .39 | ROUGE-1 | .83 | .78 | .60 | .54 | .47 | .38 |
ROUGE-2 | .83 | .87 | .67 | .48 | .48 | .39 | ROUGE-2 | .76 | .84 | .67 | .50 | .50 | .40 |
ROUGE-L | .74 | .77 | .57 | .46 | .45 | .36 | ROUGE-L | .82 | .72 | .54 | .54 | .47 | .37 |
ROUGE-SU4 | .80 | .83 | .63 | .49 | .48 | .39 | ROUGE-SU4 | .77 | .81 | .63 | .52 | .50 | .39 |
PyrEval | .81 | .79 | .59 | .31 | .31 | .25 | PyrEval | .86 | .82 | .64 | .39 | .35 | .28 |
MoverScore | .83 | .82 | .63 | .50 | .49 | .40 | MoverScore | .82 | .80 | .63 | .51 | .52 | .42 |
APES | .74 | .82 | .60 | .25 | .25 | .21 | APES | .87 | .80 | .63 | .41 | .35 | .28 |
QAEval-EM | .93 | .91 | .76 | .33 | .33 | .27 | QAEval-EM | .70 | .87 | .69 | .42 | .38 | .30 |
QAEval-F1 | .90 | .88 | .71 | .46 | .45 | .36 | QAEval-F1 | .81 | .89 | .72 | .50 | .45 | .36 |
TAC 2008 . | TAC 2009 . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metric . | System-Level . | Summary-Level . | Metric . | System-Level . | Summary-Level . | ||||||||
r . | ρ . | τ . | r . | ρ . | τ . | r . | ρ . | τ . | r . | ρ . | τ . | ||
Pyramid Score | .90 | .88 | .70 | .59 | .59 | .50 | Pyramid Score | .90 | .87 | .70 | .59 | .57 | .48 |
ROUGE-1 | .79 | .80 | .60 | .49 | .48 | .39 | ROUGE-1 | .83 | .78 | .60 | .54 | .47 | .38 |
ROUGE-2 | .83 | .87 | .67 | .48 | .48 | .39 | ROUGE-2 | .76 | .84 | .67 | .50 | .50 | .40 |
ROUGE-L | .74 | .77 | .57 | .46 | .45 | .36 | ROUGE-L | .82 | .72 | .54 | .54 | .47 | .37 |
ROUGE-SU4 | .80 | .83 | .63 | .49 | .48 | .39 | ROUGE-SU4 | .77 | .81 | .63 | .52 | .50 | .39 |
PyrEval | .81 | .79 | .59 | .31 | .31 | .25 | PyrEval | .86 | .82 | .64 | .39 | .35 | .28 |
MoverScore | .83 | .82 | .63 | .50 | .49 | .40 | MoverScore | .82 | .80 | .63 | .51 | .52 | .42 |
APES | .74 | .82 | .60 | .25 | .25 | .21 | APES | .87 | .80 | .63 | .41 | .35 | .28 |
QAEval-EM | .93 | .91 | .76 | .33 | .33 | .27 | QAEval-EM | .70 | .87 | .69 | .42 | .38 | .30 |
QAEval-F1 | .90 | .88 | .71 | .46 | .45 | .36 | QAEval-F1 | .81 | .89 | .72 | .50 | .45 | .36 |