Table 4 uses kappa scores to measure the agreement between FactBank and our annotations on this 500-sentence subset of the data. We treat FactBank as one annotator and our collective Turkers as a second annotator, with the majority label the correct one for that annotator. What we see is modest to very high agreement for all the categories except Uu. The agreement level is also relatively low for CT+. The corresponding confusion matrix in Table 5 helps explicate these numbers. The Uu category is used much more often in FactBank than by Turkers, and the dominant alternative choice for the Turkers was CT+. Thus, the low score for Uu also effectively drops the score for CT+. The question is why this contrast exists. In other words, why do Turkers choose CT+ where FactBank says Uu?

Table 4.

Inter-annotator agreement comparing FactBank annotations with MTurk annotations. The data are limited to the 500 examples in which at least 6 of the 10 Turkers agreed on the label, which is then taken to be the true MTurk label. The very poor value for PS− derives from the fact, in this subset, that label was chosen only once in FactBank and not at all by our annotators.

κ
p value
CT+ 0.37 < 0.001
PR+ 0.79 < 0.001
PS+ 0.86 < 0.001
CT− 0.91 < 0.001
PR− 0.77 < 0.001
PS− − 0.001 = 0.982
Uu 0.06 = 0.203
Overall 0.60 < 0.001
Table 5.

Confusion matrix comparing the FactBank annotations (rows) with our annotations (columns).

MTurk
CT+
PR+
PS+
CT−
PR−
PS−
Uu
Total
FactBank CT+ 54 56
PR+ 63 69
PS+ 55 59
CT− 146 153
PR−
PS−
Uu 94 18 12 21 156
Total 158 84 66 158 27 500
