Skip to Main Content
Table 7: 
F1 scores for the simplified TyDiQA-GoldP task v1.1. Left: Fine tuned and evaluated on the TyDiQA-GoldP set. Middle: Fine tuned on SQuAD v1.1 and evaluated on the TyDiQA-GoldP dev set, following the XQuAD zero-shot setting. Right: Estimate of human performance on TyDiQA-GoldP. Models are averaged over five fine tunings.
TyDiQA-GoldPSQuAD Zero ShotHuman
(English) (76.8) (73.4) (84.2) 
Arabic 81.7 60.3 85.8 
Bengali 75.4 57.3 94.8 
Finnish 79.4 56.2 87.0 
 
Indonesian 84.8 60.8 92.0 
Kiswahili 81.9 52.9 92.0 
Korean 69.2 50.0 82.0 
 
Russian 76.2 64.4 96.3 
Telugu 83.3 49.3 97.1 
 
Overall 79.0 56.4 90.9 
TyDiQA-GoldPSQuAD Zero ShotHuman
(English) (76.8) (73.4) (84.2) 
Arabic 81.7 60.3 85.8 
Bengali 75.4 57.3 94.8 
Finnish 79.4 56.2 87.0 
 
Indonesian 84.8 60.8 92.0 
Kiswahili 81.9 52.9 92.0 
Korean 69.2 50.0 82.0 
 
Russian 76.2 64.4 96.3 
Telugu 83.3 49.3 97.1 
 
Overall 79.0 56.4 90.9 
Close Modal

or Create an Account

Close Modal
Close Modal