Results for each baseline, broken down by retrieval metrics (Recall @ K passages), answerable question metrics (F1 at the best confidence threshold), and end-to-end metrics (F1 at the best confidence threshold). A naive approach, predicting exclusively No Answer, achieves a lower bound score of 32.42% F1. Translate-Train using NQs Gold passages and an Xlm-R reader outperforms all alternate settings. A ∈D denotes metrics for where the answer A exists in the top retrieved document D (exact match). A∉D denotes metrics for where the answer A does not exist in top retrieved document D (exact match). * Elasticsearch benchmark does not include Hebrew, Khmer, Korean, Malay, and Vietnamese.
Retriever . | Reader . | Translation . | Retrieval Metrics . | Answerable Metrics . | End-to-End Metrics . | |||
---|---|---|---|---|---|---|---|---|
Query . | Answer . | R@1 . | MeanA ∈ DF1 . | MeanA ∉ DF1 . | En F1 . | Mean F1 . | ||
No Answer | – | – | – | – | – | – | 32.4 | 32.4 |
MULTILINGUAL RETRIEVER | ||||||||
Elasticsearch* | Xlm-R | – | – | 42.57 ± 1.2 | 25.18 ± 3.8 | 7.24 ± 2.5 | 34.99 | 34.13± 0.4 |
TRANSLATE-TEST ENGLISH RETRIEVER | ||||||||
DPR | RoBERTa | Test | Test | 53.62 ± 2.2 | 20.33 ± 4.1 | 10.24 ± 1.8 | 45.19 | 36.81± 1.2 |
GOLD NQ PASSAGES | ||||||||
Gold NQ | M-Bert | – | Test | 80.22 | 20.13 ± 5.5 | 7.56 ± 1.7 | 51.97 | 37.8± 2.0 |
Gold NQ | M-Bert | Test | 28.10 ± 6.5 | 12.1 ± 2.1 | 41.4± 2.2 | |||
Gold NQ | M-Bert | Train | 32.21 ± 6.0 | 14.8 ± 1.9 | 44.1 ± 1.8 | |||
Gold NQ | Xlm-R | – | 38.81 ± 3.2 | 20.05 ± 2.6 | 52.27 | 45.5± 1.4 | ||
Gold NQ | Xlm-R | Test | 34.23 ± 5.0 | 16.38 ± 2.6 | 42.9± 2.1 | |||
Gold NQ | Xlm-R | Train | 40.28 ± 3.1 | 20.93 ± 2.7 | 46.0 ± 1.4 | |||
GENERATIVE MODELS | ||||||||
Query-only | mT5 | – | – | – | – | – | 43.8 | 35.0± 1.2 |
Gold NQ | mT5 | – | – | 80.22 | 36.8 ± 6.2 | 17.07 ± 2.6 | 47.6 | 38.5± 2.2 |
Retriever . | Reader . | Translation . | Retrieval Metrics . | Answerable Metrics . | End-to-End Metrics . | |||
---|---|---|---|---|---|---|---|---|
Query . | Answer . | R@1 . | MeanA ∈ DF1 . | MeanA ∉ DF1 . | En F1 . | Mean F1 . | ||
No Answer | – | – | – | – | – | – | 32.4 | 32.4 |
MULTILINGUAL RETRIEVER | ||||||||
Elasticsearch* | Xlm-R | – | – | 42.57 ± 1.2 | 25.18 ± 3.8 | 7.24 ± 2.5 | 34.99 | 34.13± 0.4 |
TRANSLATE-TEST ENGLISH RETRIEVER | ||||||||
DPR | RoBERTa | Test | Test | 53.62 ± 2.2 | 20.33 ± 4.1 | 10.24 ± 1.8 | 45.19 | 36.81± 1.2 |
GOLD NQ PASSAGES | ||||||||
Gold NQ | M-Bert | – | Test | 80.22 | 20.13 ± 5.5 | 7.56 ± 1.7 | 51.97 | 37.8± 2.0 |
Gold NQ | M-Bert | Test | 28.10 ± 6.5 | 12.1 ± 2.1 | 41.4± 2.2 | |||
Gold NQ | M-Bert | Train | 32.21 ± 6.0 | 14.8 ± 1.9 | 44.1 ± 1.8 | |||
Gold NQ | Xlm-R | – | 38.81 ± 3.2 | 20.05 ± 2.6 | 52.27 | 45.5± 1.4 | ||
Gold NQ | Xlm-R | Test | 34.23 ± 5.0 | 16.38 ± 2.6 | 42.9± 2.1 | |||
Gold NQ | Xlm-R | Train | 40.28 ± 3.1 | 20.93 ± 2.7 | 46.0 ± 1.4 | |||
GENERATIVE MODELS | ||||||||
Query-only | mT5 | – | – | – | – | – | 43.8 | 35.0± 1.2 |
Gold NQ | mT5 | – | – | 80.22 | 36.8 ± 6.2 | 17.07 ± 2.6 | 47.6 | 38.5± 2.2 |