1.Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval:t2v: 47.8 2023 论文:[2308.07648] Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval (arxiv.org) 动机:作者利用promt,实现了学习视频特征语义增强(目前是通过clip一帧一帧的提取图片特征),通过prompt来得到全局视频语义。
在不使用任何后处理的情况下,Cap4Video 在四个标准文本-视频检索基准上达到了最新的性能:MSR-VTT(51.4%)、VATEX(66.6%)、MSVD(51.8%)和 DiDeMo(52.0%)。 一、引言 文本-视频检索是视频语言学习中的一个基础任务。随着图像-语言预训练技术的快速发展 [15, 30, 46, 47],研究者们逐渐将重点放在扩展图像-...
在四个模型上对ALPRO预训练任务进行训练,包括MLM(masked language modeling)/VTM(video-text matching)/VTC( video-text contrastive loss)/PEM( prompting entity modeling loss) 后两者是用来加强视频和文本之间的跨模态对齐的。其中VTC着重捕获instance-level的对齐,而PEM着重局部视频区域预文本实体描述的对齐。 VTC:...
这篇paper做的任务是video-text retrieval任务,也就是给定文本检索视频或给定视频检索文本。为了应对复杂的语言和视频内容,本文提出了层级化的graph reasoning(HGR),分别从事件(event),action(行为)以及实体(entity)三个层次对视频和语言建模,构建成graph中的node;关于视频和语言的对齐也是分别计算三个层次的score,最后...
Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.相关任务 视频检索 任务数量 3 模型数量 31 可用模型 选择基准,对比模型表现 模型名模型规模最佳表现情况技术方法发布时间适配资源 UniAdapter - ON MSR-VTT 2023 SOTA! R@1 49.9...
Video-text retrievalTransformerMulti-modal attentionAttribute learningGraph Convolutional NetworkDespite significant advancements in deep learning-based video-text retrieval methods, three challenges persist: the alignment of fine-grained semantic information from text and video, ensuring that the obtained ...
Add a description, image, and links to the text-to-video-retrieval topic page so that developers can more easily learn about it. Curate this topic Add this topic to your repo To associate your repository with the text-to-video-retrieval topic, visit your repo's landing page and select...
Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous ...
UATVR: Uncertainty-Adaptive Text-Video RetrievalBo Fang 1∗ Wenhao Wu 2,3∗ Chang Liu 4∗ Yu Zhou 1† Yuxin Song 3Weiping Wang 1 Xiangbo Shu 5 Xiangyang Ji 4 Jingdong Wang 31 Institute of Information Engineering, Chinese Academy of Sciences 2 The University of Sydney3 Baidu Inc. ...
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. 5 Paper Code How...