Evaluation results for the two baselines (generic neural NER model and Microsoft Presidio) on the development and test sections of the TAB corpus. We report both the standard, token-level recall Rdi +qi and precision Pdi +qi on all identifiers (micro-averaged over all annotators) as well as the three proposed evaluation metrics ERdi, ERqi, and WPdi +qi from Section 6.
System . | Set . | Rdi +qi . | ERdi . | ERqi . | Pdi +qi . | WPdi +qi . |
---|---|---|---|---|---|---|
Neural NER (RoBERTa | Dev | 0.910 | 0.970 | 0.874 | 0.447 | 0.531 |
fine-tuned on Ontonotes v5) | Test | 0.906 | 0.940 | 0.874 | 0.441 | 0.515 |
Presidio (default) | Dev | 0.696 | 0.452 | 0.739 | 0.771 | 0.795 |
Test | 0.707 | 0.460 | 0.758 | 0.761 | 0.790 | |
Presidio (+ORG) | Dev | 0.767 | 0.465 | 0.779 | 0.549 | 0.622 |
Test | 0.782 | 0.463 | 0.802 | 0.542 | 0.609 |
System . | Set . | Rdi +qi . | ERdi . | ERqi . | Pdi +qi . | WPdi +qi . |
---|---|---|---|---|---|---|
Neural NER (RoBERTa | Dev | 0.910 | 0.970 | 0.874 | 0.447 | 0.531 |
fine-tuned on Ontonotes v5) | Test | 0.906 | 0.940 | 0.874 | 0.441 | 0.515 |
Presidio (default) | Dev | 0.696 | 0.452 | 0.739 | 0.771 | 0.795 |
Test | 0.707 | 0.460 | 0.758 | 0.761 | 0.790 | |
Presidio (+ORG) | Dev | 0.767 | 0.465 | 0.779 | 0.549 | 0.622 |
Test | 0.782 | 0.463 | 0.802 | 0.542 | 0.609 |