Skip to Main Content
Table 3.

Interrater reliability metrics per question

QuestionMean total agreementMean percentage correctKrippendorff’s alpha
Original classification task 66.0% 84.8% 0.670 
Classifier area/domain 34.5% 65.4% 0.520 
Labels from human judgment 37.5% 68.2% 0.517 
Human labeling for training data 46.5% 77.3% 0.517 
Used original human labeling 43.5% 71.0% 0.498 
Original human labeling source 43.5% 71.1% 0.330 
Prescreening for crowdwork 58.5% 84.2% 0.097 
Labeler compensation 46.0% 68.0% 0.343 
Training for human labelers 48.0% 70.0% 0.364 
Formal instructions 47.5% 66.8% 0.337 
Multiple labeler overlap 48.5% 69.3% 0.370 
Synthesis of labeler overlap 53.0% 83.4% 0.146 
Reported interrater reliability 55.5% 85.8% 0.121 
Total number of human labelers 50.5% 69.3% 0.281 
Median number of labelers per item 48.5% 69.3% 0.261 
Link to data set available 41.0% 66.1% 0.322 
Average across all questions 48.0% 73.1% 0.356 
Median across all questions 48.0% 70.0% 0.343 
QuestionMean total agreementMean percentage correctKrippendorff’s alpha
Original classification task 66.0% 84.8% 0.670 
Classifier area/domain 34.5% 65.4% 0.520 
Labels from human judgment 37.5% 68.2% 0.517 
Human labeling for training data 46.5% 77.3% 0.517 
Used original human labeling 43.5% 71.0% 0.498 
Original human labeling source 43.5% 71.1% 0.330 
Prescreening for crowdwork 58.5% 84.2% 0.097 
Labeler compensation 46.0% 68.0% 0.343 
Training for human labelers 48.0% 70.0% 0.364 
Formal instructions 47.5% 66.8% 0.337 
Multiple labeler overlap 48.5% 69.3% 0.370 
Synthesis of labeler overlap 53.0% 83.4% 0.146 
Reported interrater reliability 55.5% 85.8% 0.121 
Total number of human labelers 50.5% 69.3% 0.281 
Median number of labelers per item 48.5% 69.3% 0.261 
Link to data set available 41.0% 66.1% 0.322 
Average across all questions 48.0% 73.1% 0.356 
Median across all questions 48.0% 70.0% 0.343 
Close Modal

or Create an Account

Close Modal
Close Modal