Skip to Main Content
Table 1: 

The average and standard deviation of precision, recall, and F-score of model predictions on the GHC dataset, evaluated during 5 iterations of 5-fold stratified cross validation. Majority Vote section represent models’ performance on predicting the majority vote, while Individual Labels section reports performance on predicting each raw annotation.

Majority VoteIndividual Labels
ModelPrecisionRecallF1PrecisionRecallF1
Baseline 49.53± 3.8 68.78± 4.4 57.32± 1.2 – – – 
Ensemble 63.98± 1.1 46.09± 1.9 53.54± 1.0 60.92± 0.7 60.97± 0.8 60.94± 0.3 
Multi-label 66.02± 2.2 50.16± 2.0 56.94± 1.0 67.22 ± 1.4 55.33± 2.0 60.65± 0.7 
Multi-task 59.03± 0.9 59.98± 0.6 59.49± 0.2 63.71± 1.3 62.76 ± 1.5 63.20 ± 0.3 
Majority VoteIndividual Labels
ModelPrecisionRecallF1PrecisionRecallF1
Baseline 49.53± 3.8 68.78± 4.4 57.32± 1.2 – – – 
Ensemble 63.98± 1.1 46.09± 1.9 53.54± 1.0 60.92± 0.7 60.97± 0.8 60.94± 0.3 
Multi-label 66.02± 2.2 50.16± 2.0 56.94± 1.0 67.22 ± 1.4 55.33± 2.0 60.65± 0.7 
Multi-task 59.03± 0.9 59.98± 0.6 59.49± 0.2 63.71± 1.3 62.76 ± 1.5 63.20 ± 0.3 
Close Modal

or Create an Account

Close Modal
Close Modal