Table 3: 

Overview of tasks and data used to pre- train models. LV refers to the 4 multimodal models; Vok. to Vokenization. *Initialized with BERT.

 pre-training task(s) pre-training data 
GloVe Unsupervised vector learning Wikipedia 2014 
+ Gigaword 5 
 
BERT Masked Language Model (MLM) English Wikipedia 
+ Next Sentence Prediction (NSP) + BooksCorpus 
 
LVMasked Language Model (MLM) Conceptual Captions 
+ Masked Object Classification KL 
+ Image-Text Matching (ITM) 
 
Vok. Token-Image Matching (TIM)* COCO 
+ Visual Genome 
Masked Language Model (MLM) English Wikipedia 
+ Wiki103 
 pre-training task(s) pre-training data 
GloVe Unsupervised vector learning Wikipedia 2014 
+ Gigaword 5 
 
BERT Masked Language Model (MLM) English Wikipedia 
+ Next Sentence Prediction (NSP) + BooksCorpus 
 
LVMasked Language Model (MLM) Conceptual Captions 
+ Masked Object Classification KL 
+ Image-Text Matching (ITM) 
 
Vok. Token-Image Matching (TIM)* COCO 
+ Visual Genome 
Masked Language Model (MLM) English Wikipedia 
+ Wiki103 
Close Modal

or Create an Account

Close Modal
Close Modal