Skip to Main Content

We collected 10 annotations for each of the 642 events. A total of 177 Turkers participated in the annotations. Most Turkers did just one batch of 23 non-test examples; the mean number of annotations per Turker was 44, and they each annotated between 23 and 552 sentences. Table 2 reports Fleiss kappa scores (Fleiss 1971) using the full seven-category scheme. These scores are conservative because they do not take into account the fact that the scale is partially ordered, with CT+, PR+, and PS+ forming a “positive” category, CT −, PR −, and PS − forming a “negative” category, and Uu remaining alone. The overall Fleiss kappa for this three-category version is much higher (0.66), reflecting the fact that many of the disagreements were about degree of confidence (e.g., CT+ vs. PR+) rather than the basic veridicality judgment of “positive”, “negative”, or “unknown”. At least 6 out of 10 Turkers agreed on the same tag for 500 of the 642 sentences (78%). For 53% of the examples, at least 8 Turkers agreed with each other, and total agreement is obtained for 26% of the data (165 sentences).

Table 2. 

Fleiss kappa scores with associated p-values.

κ
p value
CT+ 0.63 < 0.001 
CT− 0.80 < 0.001 
PR+ 0.41 < 0.001 
PR− 0.34 < 0.001 
PS+ 0.40 < 0.001 
PS− 0.12 < 0.001 
Uu 0.25 < 0.001 
Overall 0.53 < 0.001 
κ
p value
CT+ 0.63 < 0.001 
CT− 0.80 < 0.001 
PR+ 0.41 < 0.001 
PR− 0.34 < 0.001 
PS+ 0.40 < 0.001 
PS− 0.12 < 0.001 
Uu 0.25 < 0.001 
Overall 0.53 < 0.001 

Close Modal

or Create an Account

Close Modal
Close Modal