一般来说,视觉特征是通过局部连接卷积提取的,这使得它们是孤立的和关系不可知的。Transformer Encoder对图像描述的性能有很大的贡献,因为它可以对输入之间的关系进行建模,通过Self Attention丰富视觉特征。为了更好地建模两种特征的层内关系,作者设计了一个双向的Self Attention(DWSA),它由两个独立的Self Attention模块组成。
This repository contains the reference code for the paperDual-Level Collaborative Transformer for Image CaptioningandImproving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. Experiment setup please refer tom2 transformer ...