2024/06/13: 🚀 We introduce OmniCorpus, a 10 billion-level image-text interleaved dataset. This dataset contains 8.6 billion images, 1,696 billion text tokens, and 2.2 billion documents! Introduction OmniCorpus dataset is the largest multimodal dataset to date, which pushes the boundaries of ...