Table 4: 

ByT5 and mT5 performance on a subset of xtreme tasks. Our evaluation setup follows Xue et al. (2021). For QA tasks we report F1 / EM scores.

SmallBaseLargeXLXXL
mT5ByT5mT5ByT5mT5ByT5mT5ByT5mT5ByT5
In-language multitask (models fine-tuned on gold data in all target languages) 
WikiAnn NER 86.4 90.6 88.2 91.6 89.7 91.8 91.3 92.6 92.2 93.7 
TyDiQA-GoldP 75.9 / 64.8 82.6 / 73.6 81.7 / 71.2 86.4 / 78.0 85.3 / 75.3 87.7 / 79.2 87.6 / 78.4 88.0 / 79.3 88.7 / 79.5 89.4 / 81.4 
 
Translate-train (models fine-tuned on English data plus translations in all target languages) 
XNLI 75.3 76.6 80.5 79.9 84.4 82.8 85.3 85.0 87.1 85.7 
PAWS-X 87.7 88.6 90.5 89.8 91.3 90.6 91.0 90.5 91.5 91.7 
XQuAD 71.3 / 55.7 74.0 / 59.9 77.6 / 62.2 78.5 / 64.6 81.3 / 66.5 81.4 / 67.4 82.7 / 68.1 83.7 / 69.5 85.2 / 71.3 84.1 / 70.2 
MLQA 56.6 / 38.8 67.5 / 49.9 69.7 / 51.0 71.9 / 54.1 74.0 / 55.0 74.4 / 56.1 75.1 / 56.6 75.9 / 57.7 76.9 / 58.3 76.9 / 58.8 
TyDiQA-GoldP 49.8 / 35.6 64.2 / 50.6 66.4 / 51.0 75.6 / 61.7 75.8 / 60.2 80.1 / 66.4 80.1 / 65.0 81.5 / 67.6 83.3 / 69.4 83.2 / 69.6 
 
Cross-lingual zero-shot transfer (models fine-tuned on English data only) 
XNLI 67.5 69.1 75.4 75.4 81.1 79.7 82.9 82.2 85.0 83.7 
PAWS-X 82.4 84.0 86.4 86.3 88.9 87.4 89.6 88.6 90.0 90.1 
WikiAnn NER 50.5 57.6 55.7 62.0 58.5 62.9 65.5 61.6 69.2 67.7 
SmallBaseLargeXLXXL
mT5ByT5mT5ByT5mT5ByT5mT5ByT5mT5ByT5
In-language multitask (models fine-tuned on gold data in all target languages) 
WikiAnn NER 86.4 90.6 88.2 91.6 89.7 91.8 91.3 92.6 92.2 93.7 
TyDiQA-GoldP 75.9 / 64.8 82.6 / 73.6 81.7 / 71.2 86.4 / 78.0 85.3 / 75.3 87.7 / 79.2 87.6 / 78.4 88.0 / 79.3 88.7 / 79.5 89.4 / 81.4 
 
Translate-train (models fine-tuned on English data plus translations in all target languages) 
XNLI 75.3 76.6 80.5 79.9 84.4 82.8 85.3 85.0 87.1 85.7 
PAWS-X 87.7 88.6 90.5 89.8 91.3 90.6 91.0 90.5 91.5 91.7 
XQuAD 71.3 / 55.7 74.0 / 59.9 77.6 / 62.2 78.5 / 64.6 81.3 / 66.5 81.4 / 67.4 82.7 / 68.1 83.7 / 69.5 85.2 / 71.3 84.1 / 70.2 
MLQA 56.6 / 38.8 67.5 / 49.9 69.7 / 51.0 71.9 / 54.1 74.0 / 55.0 74.4 / 56.1 75.1 / 56.6 75.9 / 57.7 76.9 / 58.3 76.9 / 58.8 
TyDiQA-GoldP 49.8 / 35.6 64.2 / 50.6 66.4 / 51.0 75.6 / 61.7 75.8 / 60.2 80.1 / 66.4 80.1 / 65.0 81.5 / 67.6 83.3 / 69.4 83.2 / 69.6 
 
Cross-lingual zero-shot transfer (models fine-tuned on English data only) 
XNLI 67.5 69.1 75.4 75.4 81.1 79.7 82.9 82.2 85.0 83.7 
PAWS-X 82.4 84.0 86.4 86.3 88.9 87.4 89.6 88.6 90.0 90.1 
WikiAnn NER 50.5 57.6 55.7 62.0 58.5 62.9 65.5 61.6 69.2 67.7 
Close Modal

or Create an Account

Close Modal
Close Modal