Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities. Unlike the existing methods which usually learn from the features extracted by offline networks, in this paper, we propose an approach to jointly train the components of cross-modal retr...
Feature Projector Modality Classifier解决的只是Modality-invariant问题,Cross-Modal similarity与Semantically-discriminative就需要Feature Projector(FP)来解决了,FP分为两部分,分别想解决这两个问题: Label prediction:Semantically-discriminative Structure preservation:Cross-Modal similarity 其中,Label prediction的目的在于使...
proposed MVI for modal and view-invariant feature learning by con- trasting where the learned features can be used for cross- modal retrieval [17]. 3143 Cross-modal retrieval: Several methods have been pro- posed for cross-modal retrieval task, mainly targeting image-text retrieval...
Matching face images across different modalities is a challenging open problem for various reasons, notably feature heterogeneity, and particularly in the case of sketch recognition 鈥 abstraction, exaggeration and distortion. Existing studies have attempted to address this task by engineering invariant ...
The image and text features used are 512-dimensional Scale-Invariant Feature Transform (SIFT) features and 1386-dimensional Bag of Words (BoW) features, respectively. To facilitate online cross-modal hashing, the training set is divided into 9 data chunks, with the first 8 chunks containing 2,...
(2) Our notation separates the feature extractor from the final class weights wy, since the former is typically pre-trained on a massive source dataset and the latter is trained on the few-shot target dataset. However, sometimes the rep- resentation can ...
(scale invariant feature transform, SIFT), 面向特征检测与匹配进行手工设计, 但目前大多数的图像特征描述是通过神经网络从数据中学习得到的. 在单个模态的特征提取上, 从AlexNet[16]在ImageNet图像分类任务中出现突破性成就以来, 卷积神经网络(convolutional neural networks, CNN) 结构在计算机视觉领域一直占主导地位...
Discriminative Learning of Deep Convolutional Feature Point Descriptors Edgar Simo-Serra,Eduard Trulls,Luis Ferraz,... - 2016 - 被引量: 258 Unsupervised Learning of Discriminative Edge Measures for Vehicle Matching between Nonoverl...
, 2020) for unsupervised feature learning, two memories are built to track (unmasked) video and sentence keys across mini-batches, which serve as negative keys. During pre-training, CoCo strengthens the holistic vision-language association by maximizing the inter-modal relevance between masked video...
Translation-invariant (TI) [7] attack per- forms horizontal and vertical shifts with a short distance to the input. The second way modifies gradients used for up- dating adversarial perturbations. For example, Momentum Iterative (MI) [6] attack integrates the momentu...