Precision (P), recall (R), and F1 results per argument for the end-to-end semantic role labeling task. We compared two models: the original SwiRL model and the one where the classification component was replaced with the meta-classifier introduced at the beginning of the section. We used the official CoNLL-2005 shared-task scorer to produce these results. We checked for statistical significance for the overall F1 scores (All row). Values in boldface font indicate the highest F1 score in the corresponding row and block. F1 values marked with † are significantly lower than the corresponding highest F1 score.