Skip to Main Content

We can also compute precision and recall for the subsets of the data where there is a majority vote, that is, where six out of ten annotators agreed on the same label. This allows us to give results per veridicality tag. We take as the true veridicality value the one on which the annotators agreed. The value assigned by the classifier is the one with the highest probability. Table 7 reports precision, recall, and F1 scores on the training and test sets, along with the number of instances in each category. None of the items in our test data were tagged with PR− or PS− and these categories were very infrequent in the training data, so we leave them out. The table also gives baseline results: We used a weighted random guesser, as for the lower-bound given in Table 6. Our results significantly exceed the baseline (McNemar’s test, p < 0.001).7

Table 7. 

Precision, recall, and F1 on the subsets of the training data (10-fold cross-validation) and test data where there is majority vote, as well as F1 for the baseline.

Train
Test
#
P
R
F1
Baseline F1
#
P
R
F1
Baseline F1
CT+ 158 74.3 84.2 78.9 32.6 61 86.9 86.9 86.9 31.8 
CT− 158 89.4 91.1 90.2 34.1 31 96.6 90.3 93.3 29.4 
PR+ 84 74.4 69.1 71.6 19.8 50.0 57.1 53.3 6.9 
PS+ 66 75.4 69.7 72.4 16.7 62.5 71.4 66.7 0.0 
Uu 27 57.1 44.4 50.0 10.7 50.0 50.0 50.0 0.0 
Macro-avg 74.1 71.7 72.6 22.8 69.2 71.1 70.0 13.6 
Micro-avg 78.6 78.6 78.6 27.0 83.0 83.0 83.0 22.3 
Train
Test
#
P
R
F1
Baseline F1
#
P
R
F1
Baseline F1
CT+ 158 74.3 84.2 78.9 32.6 61 86.9 86.9 86.9 31.8 
CT− 158 89.4 91.1 90.2 34.1 31 96.6 90.3 93.3 29.4 
PR+ 84 74.4 69.1 71.6 19.8 50.0 57.1 53.3 6.9 
PS+ 66 75.4 69.7 72.4 16.7 62.5 71.4 66.7 0.0 
Uu 27 57.1 44.4 50.0 10.7 50.0 50.0 50.0 0.0 
Macro-avg 74.1 71.7 72.6 22.8 69.2 71.1 70.0 13.6 
Micro-avg 78.6 78.6 78.6 27.0 83.0 83.0 83.0 22.3 

Close Modal

or Create an Account

Close Modal
Close Modal