python import torch import torch.nn as nn import torchvision.transforms as transforms from transformers import VisionTransformer, BertModel, BertTokenizer # 数据加载和预处理 # 假设我们有一个Dataset类,用于加载交错图像-文本数据 from
image-text pairs datasets: LAION:laion.ai/laion-400-open Conceptual Captions:github.com/google-resea ALIGN:未开源 COYO:huggingface.co/datasets DataComp:datacomp.ai/ 2 创建多模态网页文档数据集2.1 收集HTML文件 数据收集过程从考虑数据集创建时可用的最新25个Common Crawl(commoncrawl.org/)数据转储开始。
{ // list of input text sentences "sentences": [ "a kitchen is shown with a variety of items on the counters." ], // list of input image paths "images": [ "./assets/dataset/coco/val2014/COCO_val2014_000000384213.jpg" ], // list of corresponding sentence indexs for "images" "se...
Anole excels at the complex task of generating coherent sequences of alternating text and images. Through an innovative fine-tuning process using a carefully curated dataset of approximately 6,000 images, Anole achieves remarkable image generation and understanding capabilities with minimal additional trai...
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on ...
To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing, exemplar learning, composition, reasoning, e...
In this work, we introduce M3DBench, a comprehensive multi-modal instruction dataset for complex 3D environments with over 320k instruction-response pairs that: 1) supports general interleaved multi-modal instructions with text, user clicks, images, and other visual prompts, 2) unifies diverse ...
2024/06/13: We introduce 🐳 OmniCorpus, a 10 billion-level image-text interleaved dataset. This dataset contains 8.6 billion images, 1,696 billion text tokens, and 2.2 billion documents! Introduction OmniCorpus dataset is the largest multimodal dataset to date, which pushes the boundaries of ...
OBELICS is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. Dataset page: https://huggingface.co/datasets/HuggingFaceM4/OBELICS Visualization of OBELICS web documents: https://huggingface.co/spaces/Hugging...
image big-data text dataset document interleaved multimodal Updated Feb 2, 2024 Python stdlib-js / math-iter-sequences-odd-integers Sponsor Star 2 Code Issues Pull requests Create an iterator which generates an interleaved sequence of odd integers. nodejs javascript node math stdlib iteration ...