We can also compute precision and recall for the subsets of the data where there is a majority vote, that is, where six out of ten annotators agreed on the same label. This allows us to give results per veridicality tag. We take as the true veridicality value the one on which the annotators agreed. The value assigned by the classifier is the one with the highest probability. Table 7 reports precision, recall, and F1 scores on the training and test sets, along with the number of instances in each category. None of the items in our test data were tagged with PR− or PS− and these categories were very infrequent in the training data, so we leave them out. The table also gives baseline results: We used a weighted random guesser, as for the lower-bound given in Table 6. Our results significantly exceed the baseline (McNemar’s test, p < 0.001).7
Precision, recall, and F1 on the subsets of the training data (10-fold cross-validation) and test data where there is majority vote, as well as F1 for the baseline.
. | Train . | Test . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | # . | P . | R . | F1 . | Baseline F1 . | # . | P . | R . | F1 . | Baseline F1 . |
CT+ | 158 | 74.3 | 84.2 | 78.9 | 32.6 | 61 | 86.9 | 86.9 | 86.9 | 31.8 |
CT− | 158 | 89.4 | 91.1 | 90.2 | 34.1 | 31 | 96.6 | 90.3 | 93.3 | 29.4 |
PR+ | 84 | 74.4 | 69.1 | 71.6 | 19.8 | 7 | 50.0 | 57.1 | 53.3 | 6.9 |
PS+ | 66 | 75.4 | 69.7 | 72.4 | 16.7 | 7 | 62.5 | 71.4 | 66.7 | 0.0 |
Uu | 27 | 57.1 | 44.4 | 50.0 | 10.7 | 6 | 50.0 | 50.0 | 50.0 | 0.0 |
Macro-avg | 74.1 | 71.7 | 72.6 | 22.8 | 69.2 | 71.1 | 70.0 | 13.6 | ||
Micro-avg | 78.6 | 78.6 | 78.6 | 27.0 | 83.0 | 83.0 | 83.0 | 22.3 |
. | Train . | Test . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | # . | P . | R . | F1 . | Baseline F1 . | # . | P . | R . | F1 . | Baseline F1 . |
CT+ | 158 | 74.3 | 84.2 | 78.9 | 32.6 | 61 | 86.9 | 86.9 | 86.9 | 31.8 |
CT− | 158 | 89.4 | 91.1 | 90.2 | 34.1 | 31 | 96.6 | 90.3 | 93.3 | 29.4 |
PR+ | 84 | 74.4 | 69.1 | 71.6 | 19.8 | 7 | 50.0 | 57.1 | 53.3 | 6.9 |
PS+ | 66 | 75.4 | 69.7 | 72.4 | 16.7 | 7 | 62.5 | 71.4 | 66.7 | 0.0 |
Uu | 27 | 57.1 | 44.4 | 50.0 | 10.7 | 6 | 50.0 | 50.0 | 50.0 | 0.0 |
Macro-avg | 74.1 | 71.7 | 72.6 | 22.8 | 69.2 | 71.1 | 70.0 | 13.6 | ||
Micro-avg | 78.6 | 78.6 | 78.6 | 27.0 | 83.0 | 83.0 | 83.0 | 22.3 |