To solve the limitations of semantic-based methods, researchers pay more attention to utilizing original sentences that contain rich contextual information than semantic concepts. At present, the main methods for text–video cross-modal retrieval map video and text into a common latent space, where t...
Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e. MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.Jianfeng Dong...
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval Xiaoshuai Hao1,2, Wanqian Zhang1*, Dayan Wu1, Fei Zhu1,2, Bo Li1,2 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of ...
其中,\text{sim}(t_i, v_j)表示文本t_i和视频v_j之间的相似度,τ 是温度系数,B 是批量大小。 修改后的 Negative-aware InfoNCE (NegNCE) Loss 与传统的InfoNCE loss不同,本文提出了一种修改版的Negative-aware InfoNCE (NegNCE) Loss,该方法识别出批量中的所有难负样本,并在训练损失中对其进行更高的惩罚。
Video-to-Text16.132.141.5170.112 Dual Encoding on Ad-hoc Video Search (AVS) Data The following three datasets are used for training, validation and testing: tgif-msrvtt10k, tv2016train and iacc.3. For more information about these datasets, please refer tohttps://github.com/li-xirong/avs....
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss,2022 0 摘要 采用大规模预训练模型CLIP进行视频文本检索任务(VTR)已成为一种新的趋势,超过了以往的VTR方法。虽然,由于视频和文本之间的结构和内容的异质性,以往的基于clip的模型在训练阶段容易出现过拟合,导致检索性能相对较差...
text cross-modal retrieval, and the extraction of video feature is similar to the extraction network of image feature. For example, He et al. [14] use the same convolutional neural network to extract video and image features, and then projects the extracted features into a public space. Bai...
uses inter-video contrastive learningto roughly align the global features of paragraphs and videos,reducing modality differences and constructing a coarse-grainedfeature space to break free from the need for correspondencebetween paragraphs and videos. Additionally, this coarse-grainedfeature space further ...
Text-to-Video8.025.037.577.1147.6 Reference @inproceedings{dong2023DLDKD, title = {Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval}, author = {Jianfeng Dong and Minsong Zhang and Zheng Zhang and Xianke Chen and Daizong Liu and Xiaoye Qu and Xun Wang and ...
Dual Encoding for Zero-Example Video Retrieval Dual Encoding for Zero-Example Video Retrieval Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, Xun Wang (Submitted on 17 Sep 2018 (v1), last revised 19 Mar 2019 (this version, v3))...