Overview of tasks and data used to pre- train models. LV refers to the 4 multimodal models; Vok. to Vokenization. *Initialized with BERT.
pre-training task(s) | pre-training data | |
GloVe | Unsupervised vector learning | Wikipedia 2014 |
+ Gigaword 5 | ||
BERT | Masked Language Model (MLM) | English Wikipedia |
+ Next Sentence Prediction (NSP) | + BooksCorpus | |
LV* | Masked Language Model (MLM) | Conceptual Captions |
+ Masked Object Classification KL | ||
+ Image-Text Matching (ITM) | ||
Vok. | Token-Image Matching (TIM)* | COCO |
+ Visual Genome | ||
Masked Language Model (MLM) | English Wikipedia | |
+ Wiki103 |
pre-training task(s) | pre-training data | |
GloVe | Unsupervised vector learning | Wikipedia 2014 |
+ Gigaword 5 | ||
BERT | Masked Language Model (MLM) | English Wikipedia |
+ Next Sentence Prediction (NSP) | + BooksCorpus | |
LV* | Masked Language Model (MLM) | Conceptual Captions |
+ Masked Object Classification KL | ||
+ Image-Text Matching (ITM) | ||
Vok. | Token-Image Matching (TIM)* | COCO |
+ Visual Genome | ||
Masked Language Model (MLM) | English Wikipedia | |
+ Wiki103 |