Performance of the baseline and best-performing TAG models, both separately and in combination. TAG justifications of different short lengths were found to best combine in single classifiers (denoted with a +), where models that combine the CR baseline or long (3G) TAG justifications best combined using voting ensembles (denoted with a ∪). Bold font indicates the best score in a given column for each model group. Asterisks indicate that a score is significantly better than the highest-performing baseline model (* signifies p < 0.05, ** signifies p < 0.01). The dagger indicates that a score is significantly higher than the score in the line number indicated in superscript (p < 0.01). All significance tests were implemented using one-tailed non-parametric bootstrap resampling using 10,000 iterations.
# . | Model . | P@1 Impr. . | P@1 . | MRR . | MRR Impr. . |
---|---|---|---|---|---|
Baselines | |||||
1 | Random | 25.00 | – | 52.08 | – |
2 | CR | 40.20 | – | 62.49 | – |
3 | Jansen et al. (2014) | 37.30 | – | 60.95 | – |
Combined models with justifications of variable lengths (Single classifier) | |||||
4 | 1G + 2G | 38.69 | – | 61.43 | – |
5 | 1GCT + 2GCT | 42.88 †4 | +6.7% | 63.94% | +2.3% |
Combined models that include the CR baseline (Voting) | |||||
6 | CR ∪ 1GCT ∪ 2GCT ∪ 3GCT | 43.15* | +7.3% | 64.51* | +3.2% |
7 | CR ∪ (1GCT + 2GCT) ∪ 3GCT | 44.46** | +10.6% | 65.53** | +4.9% |
# . | Model . | P@1 Impr. . | P@1 . | MRR . | MRR Impr. . |
---|---|---|---|---|---|
Baselines | |||||
1 | Random | 25.00 | – | 52.08 | – |
2 | CR | 40.20 | – | 62.49 | – |
3 | Jansen et al. (2014) | 37.30 | – | 60.95 | – |
Combined models with justifications of variable lengths (Single classifier) | |||||
4 | 1G + 2G | 38.69 | – | 61.43 | – |
5 | 1GCT + 2GCT | 42.88 †4 | +6.7% | 63.94% | +2.3% |
Combined models that include the CR baseline (Voting) | |||||
6 | CR ∪ 1GCT ∪ 2GCT ∪ 3GCT | 43.15* | +7.3% | 64.51* | +3.2% |
7 | CR ∪ (1GCT + 2GCT) ∪ 3GCT | 44.46** | +10.6% | 65.53** | +4.9% |