论文链接:Cross-Modality Fusion Transformer for Multispectral Object Detection 论文代码:https://github.com/DocF/multispectral-object-detection Motivation 以往CNNs的工作,没有对长距离和全局的信息进行建模。本文提出一种Cross-Modality Fusion Transformer(CFT)模块,通过Transformer的能力充分挖掘全局上下文信息。Attentio...
Cross-Modality Fusion Transformer for Multispectral Object Detection
To fully exploit the different modalities, we present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper. Unlike prior CNNs-based works, guided by the transformer scheme, our network learns long-range dependencies and integrates...
The recently developed modeling methods have also been introduced, such as generative adversarial network (GAN), Transformer, graph neural network, etc. A CapsNet-based machine fault diagnosis method was proposed by Liu et al. [17]. The improved MsR-GAN structure is also developed for enhancing ...
Dual Swin-transformer based mutual interactive network for RGB-D salient object detection 2023, Neurocomputing Citation Excerpt : For the model evaluation, we also provide the performance results on SIP [71] and STEREO [77]. In this section, we conduct experiments to compare the performance of ...
Natural language processing and vision tasks have recently seen large improvements through the rise of Transformer architectures. The high-performing large language models (LLMs) benefit from large textual datasets that are numerously available online. However, action and bidirectional action-language tasks...
et al. Spatial transformer networks. In Advances in Neural Information Processing Systems vol. 28 (2015). 30. Jiang, N. et al. Anti-uav: A large multi-modal benchmark for uav tracking. arXiv preprint arXiv:2101.08466 (2021). 31. Cao, X. et al. Deep learning based ...
Through experiments, we demonstrate that a simple stack of transformer encoder layers can substitute complex fusion modules with better-performing alternatives. We validate the efficacy of our suggested model and exhibit SOTA performance using the benchmark dataset RSVGD. 展开 ...
针对于CNNs的工作,因为CNNs只具有全局感受野,只能在局部区域进行整合,没有对长距离和全局信息进行建模。本文提出一种Cross-Modality Fusion Transformer(CFT)模块,通过Transformer的能力充分挖掘全局上下文信息。Attention的注意力机制可以同时对模态内和模态间进行特征融合,并提取可见光和红外之间的潜在联系。
作者选用的Baseline是RoiTransformer的架构,并在这个基础上修改了网络架构并使其适应跨模态的输入和输出。CMDet由三条分支组成,RGB infrared fusion中。Fusion是两个模态的融合分支。两个模态选用的Backbone分别是对应的ResNet 50。然后将对应两个模态的Feature map,concat后利用1*1卷积进行维数约减的操作。由于Baseline...