Skip to Main Content
Table 3: 
Performance on the OntoNotes coreference resolution benchmark. The main evaluation is the average F1 of three metrics: MUC, B3, and CEAFφ4 on the test set.
MUCB3 CEAFφ4
PRF1PRF1PRF1Avg. F1
Prev. SotA: 
(Lee et al., 2018) 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0 
 
Google BERT 84.9 82.5 83.7 76.7 74.2 75.4 74.6 70.1 72.3 77.1 
Our BERT 85.1 83.5 84.3 77.3 75.5 76.4 75.0 71.9 73.9 78.3 
Our BERT-1seq 85.5 84.1 84.8 77.8 76.7 77.2 75.3 73.5 74.4 78.8 
SpanBERT 85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3 79.6 
MUCB3 CEAFφ4
PRF1PRF1PRF1Avg. F1
Prev. SotA: 
(Lee et al., 2018) 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0 
 
Google BERT 84.9 82.5 83.7 76.7 74.2 75.4 74.6 70.1 72.3 77.1 
Our BERT 85.1 83.5 84.3 77.3 75.5 76.4 75.0 71.9 73.9 78.3 
Our BERT-1seq 85.5 84.1 84.8 77.8 76.7 77.2 75.3 73.5 74.4 78.8 
SpanBERT 85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3 79.6 
Close Modal

or Create an Account

Close Modal
Close Modal