We introduce the Conceptual 12M (CC12M), a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-training. It is larger and covers a much more diverse set of visual concepts than the Conceptual Captions (CC3M), a dataset that is widely used for pre-training and end-to-end training of image captioning models. Check our paper for further details.
Click here to download (2.5GB)
Format (.tsv)
[image_url_1]\t[caption_1] [image_url_2]\t[caption_2] [image_url_3]\t[caption_3] … [image_url_N]\t[caption_N]
Click here to download (2.12GB). Credit: Nicholas Carlini
Format (.tsv)
[image_url_1]\t[SHA256_1]\t[MD5_1] [image_url_2]\t[SHA256_2]\t[MD5_2] [image_url_3]\t[SHA256_3]\t[MD5_3] … [image_url_N]\t[SHA256_N]\t[MD5_N]
If you use this dataset in your research, please cite:
Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. CVPR 2021.
@inproceedings{changpinyo2021cc12m, title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts}, author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle = {CVPR}, year = {2021}, }