不使用任何图像-文本对,ViECap在多个I2T任务上表现出了SOTA的迁移性(跨数据集测试,NoCaps),并能够生成用户期望风格的文本(幽默,浪漫)。 Transferable Decoding with Visual Entities for Zero-Shot Image Captioning Paper:https://arxiv.org/abs/2307.16525 Code:https://github.com/FeiElysia/ViECap 效果展示 任务...
we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight ...
Zero-shot learning, a technique that has gained widespread attention in recent research, performs tasks without relying on domain-specific training datasets. However, current zero-shot image captioning methods mainly depend on non-autoregressive language models, which often suffer from operational ...
This paper aims at the transferability of the zero-shot captioning for out-of-domain images. As shown in this image, we demonstrate the susceptibility of pre-trained vision-language models and large language models to modality bias induced by language models when adapting them into image-to-text...
视频-字幕交互中,co-attention方式表现出色。引入辅助query-caption后,匹配分数有显著提升。对于离线和在线视频,该方法均优于全局匹配的基准。总的来说,zero-shot captioning在文本-视频检索中发挥着关键作用,通过有效的数据增强、交互和辅助匹配策略,提高了跨模态匹配的性能。
Zero-shot learning, a technique that has gained widespread attention in recent research, performs tasks without relying on domain-specific training datasets. However, current zero-shot image captioning methods mainly depend on non-autoregressive language models, which often suffer from operational ...
Zero-shot image captioning (IC) without well-paired image-text data can be divided into two categories, training-free and text-only-training. The main difference between them is whether using a textual corpus to train the LM. Though achieving attractive performance w.r.t. some metrics, existin...
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic 论文地址:https://arxiv.org/abs/2111.14447 代码地址:https://github.com/YoadTew/zero-shot-image-to-text 2. 动机 深度学习至少导致了计算机视觉的三大革命:(1)机器在多个领域中比预期更早地实...
Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (...
zero-shotvideo captioning. 利用LLM模型来生成caption 利用caption进行数据增强 video captioner可以生成多个caption(比如20个),除了数据集中给定的query-video为正样本外,caption-video也可以作为正样本,为了避免生成的caption包含噪声,以至于caption完全和视频内容无关,作者通过使用预训练的文本编码器计算caption-query之间的...