The pretraining datasets: the type and number of images and captions.
Dataset . | # images . | Caption . | |
---|---|---|---|
Type . | # . | ||
MSCOCO | 83K | Annot. | 592K |
Visual Genome (VG) | 110K | Annot. | 5.4M |
MSCOCO-narratives | 83K | Narration | 230K |
OI-narratives | 500K | Narration | 1.3M |
SBU | 1M | Web | 1M |
Conceptual Captions | 2.7M | Alt-text | 2.7M |
Dataset . | # images . | Caption . | |
---|---|---|---|
Type . | # . | ||
MSCOCO | 83K | Annot. | 592K |
Visual Genome (VG) | 110K | Annot. | 5.4M |
MSCOCO-narratives | 83K | Narration | 230K |
OI-narratives | 500K | Narration | 1.3M |
SBU | 1M | Web | 1M |
Conceptual Captions | 2.7M | Alt-text | 2.7M |