Table 6

Evaluation results for the two baselines (generic neural NER model and Microsoft Presidio) on the development and test sections of the TAB corpus. We report both the standard, token-level recall Rdi +qi and precision Pdi +qi on all identifiers (micro-averaged over all annotators) as well as the three proposed evaluation metrics ERdi, ERqi, and WPdi +qi from Section 6.

SystemSetRdi +qiERdiERqiPdi +qiWPdi +qi
Neural NER (RoBERTa Dev 0.910 0.970 0.874 0.447 0.531 
 
fine-tuned on Ontonotes v5) Test 0.906 0.940 0.874 0.441 0.515 
Presidio (default) Dev 0.696 0.452 0.739 0.771 0.795 
Test 0.707 0.460 0.758 0.761 0.790 
 
Presidio (+ORG) Dev 0.767 0.465 0.779 0.549 0.622 
Test 0.782 0.463 0.802 0.542 0.609 
SystemSetRdi +qiERdiERqiPdi +qiWPdi +qi
Neural NER (RoBERTa Dev 0.910 0.970 0.874 0.447 0.531 
 
fine-tuned on Ontonotes v5) Test 0.906 0.940 0.874 0.441 0.515 
Presidio (default) Dev 0.696 0.452 0.739 0.771 0.795 
Test 0.707 0.460 0.758 0.761 0.790 
 
Presidio (+ORG) Dev 0.767 0.465 0.779 0.549 0.622 
Test 0.782 0.463 0.802 0.542 0.609 
Close Modal

or Create an Account

Close Modal
Close Modal