We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for...
Improving captions can be its own machine learning task, as shown in DALL-E 3, which was trained on 5% ground-truth (human-annotated) captions and 95% long and highly descriptive (synthetic/generated) captions created by an image captioner [55]. Since generative models may underperform when ...