UVTR[22] generates a unified representation in the 3D voxel space by deformable attention[60]. While for query-based methods, FUTR3D[8] defines the 3D reference points as queries and directly samples the features from the coordinates of pro- jected planes. Tran...
Up-DownBottom-up and top-down attention for image captioning and visual question answeringCVPR2018 GCN-LSTMExploring visual relationship for image captioningECCV2018 TransformerConceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioningACL2018 ...
1. Introduction With the stream of multimedia data flourishing on the In- ternet in the format of videos, images, text, etc, cross-modal retrieval task has attracted more and more attention from the multimedia communities. Cross-modal retrieval is the task of retrieving data from ...
Next, KV stacked transformer blocks are leveraged to perform self-attention over the masked frame sequence. Lastly, the video query encoder outputs the enhanced representations of masked frame sequence ℋVm, which reflect the intra-modal interactions across frames. Similarly, in sentence query ...
Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention. Sensors 2024, 24, 4098. https://doi.org/10.3390/s24134098 AMA Style Zhao P, Ye X, Du Z. Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention. Sensors. 2024;...
The stair attention divides the attentive weights into three levels, allowing for better focus on different regions in the search scope. Additionally, CIDEr-based reward reinforcement learning [36] is used to enhance the quality of the generated sentences. Du et al. [37] proposed a Deformable ...
The stair attention divides the attentive weights into three levels, allowing for better focus on different regions in the search scope. Additionally, CIDEr-based reward reinforcement learning [36] is used to enhance the quality of the generated sentences. Du et al. [37] proposed a Deformable ...
In this paper, we propose a cross-modal segmentation network for winter wheat mapping in complex terrain using remote-sensing multi-temporal images and DEM data. First, we propose a diverse receptive fusion (DRF) module, which applies a deformable receptive field to optical images during the ...