Conceptual Captions is a new dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More pr...
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more im- ages than the MS-COCO dataset (Lin et al., 2014) and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering im...
We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for...
Since generative models may underperform when sampled out of their distribution, training a model on long captions could be a problem for users who write shorter prompts; this is addressed by using GPT-4 to expand (or ‘upsample’) a user’s caption and disambiguate terms. As pointed out by...
Underlying this direction of research was the invention of text-to-image embeddings, learned in a self-supervised manner by neural network models trained to map images and their captions to similar vectors in a latent space [71]. Turning this image-wide mapping into a pixel-wise segmentation ...