Skip to Main Content
Table 6 

Performance of the baseline and best-performing TAG models, both separately and in combination. TAG justifications of different short lengths were found to best combine in single classifiers (denoted with a +), where models that combine the CR baseline or long (3G) TAG justifications best combined using voting ensembles (denoted with a ∪). Bold font indicates the best score in a given column for each model group. Asterisks indicate that a score is significantly better than the highest-performing baseline model (* signifies p < 0.05, ** signifies p < 0.01). The dagger indicates that a score is significantly higher than the score in the line number indicated in superscript (p < 0.01). All significance tests were implemented using one-tailed non-parametric bootstrap resampling using 10,000 iterations.

#ModelP@1 Impr.P@1MRRMRR Impr.
 Baselines 
Random 25.00 – 52.08 – 
CR 40.20 – 62.49 – 
Jansen et al. (2014) 37.30 – 60.95 – 
 Combined models with justifications of variable lengths (Single classifier) 
1G + 2G 38.69 – 61.43 – 
1GCT + 2GCT 42.884 +6.7% 63.94% +2.3% 
 Combined models that include the CR baseline (Voting) 
CR ∪ 1GCT ∪ 2GCT ∪ 3GCT 43.15* +7.3% 64.51* +3.2% 
CR ∪ (1GCT + 2GCT) ∪ 3GCT 44.46** +10.6% 65.53** +4.9% 
#ModelP@1 Impr.P@1MRRMRR Impr.
 Baselines 
Random 25.00 – 52.08 – 
CR 40.20 – 62.49 – 
Jansen et al. (2014) 37.30 – 60.95 – 
 Combined models with justifications of variable lengths (Single classifier) 
1G + 2G 38.69 – 61.43 – 
1GCT + 2GCT 42.884 +6.7% 63.94% +2.3% 
 Combined models that include the CR baseline (Voting) 
CR ∪ 1GCT ∪ 2GCT ∪ 3GCT 43.15* +7.3% 64.51* +3.2% 
CR ∪ (1GCT + 2GCT) ∪ 3GCT 44.46** +10.6% 65.53** +4.9% 
Close Modal

or Create an Account

Close Modal
Close Modal