Sharma, “Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention,” in WACV, 2022. [15] F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fusion transformer for multispectral object detection,” arXiv, 2021. [16] A. Baevski, ...
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [paper] ResViT: Residual vision transformers for multi-modal medical image synthesis [paper] [CrossEfficientViT] Combining EfficientNet and Vision Transformers for Video Deepfake Detection [paper] [code] [Discrete ViT] Discrete Repre...
Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possibl... R Guerrero,HX Pham,V Pavlov...
2. CMSA: Cross-modal Self-Attention 其实CVer前不久刚推过Transformer在语义分割中的应用: 复旦大学提出SETR:基于Transformer的语义分割 四、用于图像生成的Transformer 1. Image GPT 2. Image Transformer 3 High-resolution Image Synthesis 4. SceneFormer 五、用于low-level视觉的Transformer 1. Transformers for ...
X. Huang, J. J. Zhang, C. Q. Zong. Entity-level cross-modal learning improves multi-modal machine translation. InProceedings of the Findings of the Association for Computational Linguistics: EMNLP, ACL, Punta Cana, Dominican Republic, pp. 1067–1080, 2021. DOI:https://doi.org/10.18653/v1...
For the second problem, we build a cross-attention-based fusion module using the Swin Transformer [26] as a novel Transformer-based backbone. Considering that the Transformer is better suited for multi-modal fusion due to its global attention mechanism, using the cross-attention mechanism for ...
Text-to-Image synthesis is a typical application of multimodal and cross-modal comparative learning. In the field of image generation, most models mainly fall into two categories, i.e. the GAN-based generation models [17,18,19,20,21,22] and the diffusion-based models [23,24,25,26,27]....
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval 标题:Transformer解码器,具有多模式正则化,用于跨模型食品检索 论文/Paper: http://arxiv.org/pdf/2204.09730 代码/Code: https://github.com/mshukor/TFood 多模态 / Multimodal - 1 篇 ...
CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 ...
Cross-Modal Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [CVPR 2021] [paper] Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [CVPR2021] [paper] [code] Topological Planning With Transformers for Vision-and-Language Navigati...