Skip to Main Content
Table 2: 
Architectural differences of the specific pre-trained representations used in this paper.
training objectivecorpus (#words)output dimensionbasic unit
word embeddings 
word2vec Predicting surrounding words Google News (100B) 300 word 
GloVe Predicting co-occurrence probability Wikipedia + Gigaword 5 (6B) 300 word 
fastText Predicting surrounding words Wikipedia + UMBC + statmt.org (16B) 300 subword 
 
contextualized word embeddings 
ELMo Language model 1B Word Benchmark (1B) 1024 character 
OpenAI GPT Language model BooksCorpus (800M) 768 subword 
BERT Masked language model (Cloze) BooksCorpus + Wikipedia (3.3B) 768 subword 
training objectivecorpus (#words)output dimensionbasic unit
word embeddings 
word2vec Predicting surrounding words Google News (100B) 300 word 
GloVe Predicting co-occurrence probability Wikipedia + Gigaword 5 (6B) 300 word 
fastText Predicting surrounding words Wikipedia + UMBC + statmt.org (16B) 300 subword 
 
contextualized word embeddings 
ELMo Language model 1B Word Benchmark (1B) 1024 character 
OpenAI GPT Language model BooksCorpus (800M) 768 subword 
BERT Masked language model (Cloze) BooksCorpus + Wikipedia (3.3B) 768 subword 
Close Modal

or Create an Account

Close Modal
Close Modal