We collected 10 annotations for each of the 642 events. A total of 177 Turkers participated in the annotations. Most Turkers did just one batch of 23 non-test examples; the mean number of annotations per Turker was 44, and they each annotated between 23 and 552 sentences. Table 2 reports Fleiss kappa scores (Fleiss 1971) using the full seven-category scheme. These scores are conservative because they do not take into account the fact that the scale is partially ordered, with CT+, PR+, and PS+ forming a “positive” category, CT −, PR −, and PS − forming a “negative” category, and Uu remaining alone. The overall Fleiss kappa for this three-category version is much higher (0.66), reflecting the fact that many of the disagreements were about degree of confidence (e.g., CT+ vs. PR+) rather than the basic veridicality judgment of “positive”, “negative”, or “unknown”. At least 6 out of 10 Turkers agreed on the same tag for 500 of the 642 sentences (78%). For 53% of the examples, at least 8 Turkers agreed with each other, and total agreement is obtained for 26% of the data (165 sentences).

Table 2.

. | κ. | p value. |
---|---|---|

CT+ | 0.63 | < 0.001 |

CT− | 0.80 | < 0.001 |

PR+ | 0.41 | < 0.001 |

PR− | 0.34 | < 0.001 |

PS+ | 0.40 | < 0.001 |

PS− | 0.12 | < 0.001 |

Uu | 0.25 | < 0.001 |

Overall | 0.53 | < 0.001 |

. | κ. | p value. |
---|---|---|

CT+ | 0.63 | < 0.001 |

CT− | 0.80 | < 0.001 |

PR+ | 0.41 | < 0.001 |

PR− | 0.34 | < 0.001 |

PS+ | 0.40 | < 0.001 |

PS− | 0.12 | < 0.001 |

Uu | 0.25 | < 0.001 |

Overall | 0.53 | < 0.001 |

This site uses cookies. By continuing to use our website, you are agreeing to our privacy policy.