Introduction 对于image-text embedding learning,作者提出了 cross-modal projection matching (CMPM) loss 和 cross-modal projection classification (CMPC) loss。前者最小化两个模态特征投影分布的KL散度;后者基于norm-softmax损失,对模态A在模态B上的投影特征进行分类,进一步增强模态之间的契合度。 The Proposed Algor...
与原始的softmax loss相比,norm-softmax loss将所有权向量标准化为相同长度,以减少权值在区分不同样本时的影响。 如上图所示,softmax损失的分类结果依靠于\[\left\| {{W_k}} \right\|\left\| x \right\|\cos \left( {{\theta _k}} \right),\left( {k = 1,2} \right)\]。对于norm-softmax,...
On this basis, cross-modal projection matching constrain (CMPM) is introduced which minimizes the Kullback-Leibler divergence between feature projection matching distributions and label projection matching distributions, and label information is used to align similarities between low-dimensional features of ...
Various methods are proposed to reduce the cross-domain discrepancy by using adversarial loss, sharing a projection network, using triplet loss with pairs/triplets of different modalities, maximizing cross-modal pairwise item correlation [29, 42, 34, 20, 10]. Even though the existing...
Considering the similarity between the modalities, an automatic encoder is utilized to associate the feature projection to the semantic code vector. In addition, regularization and sparse constraints are applied to low-dimensional matrices to balance reconstruction errors. The high dimensional data is ...
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), pp. 707–723. Munich, Germany (2018) Google Scholar Li, S., Cao, M....
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 686–701 Chen Y, Zhang G, Lu Y, Wang Z, Zheng Y (2022) Tipcb: a simple but effective part-based convolutional baseline for text...
In recent years, most existing ZSL methods (Jiang et al., 2019; Li, Hou, Lai, & Yang, 2022; Romera-Paredes & Torr, 2015) first learn a projection function between image feature space and attribute space, and then recognize new categories of images by evaluating the compatibility between ...
true modality label in discriminator and fake modality label in generator. This is because the main purpose of the discriminator loss function is to narrow the modality difference between the generated representation and the original data in the projection process, thereby improving the discriminant cons...
To facilitate feature projection into a descriptive latent space, a frozen linear transformation Htp, as mentioned in Section 3.2, is used to project the features from the last layer into a latent space capable of describing [CLS] of vision tokens. Similarly, we use a trainable linear transforma...