region-focustransformerDense captioning is a very critical but under-explored task, which aims to densely detect localized regions-of-interest (RoIs) and describe them with natural language in a given image. Although recent studies tried to fuse multi-scale features from different visual instances ...
73 DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection. Nian Liu, Junwei Han 一种关键性检测的深度网络。DHSNet首先预测物体边缘,接下来采用一种改进的递归网络对边缘进行微调。In this work, we propose a novel end-to-end deep hierarchical saliency network (DHSNet) based on convoluti...
Although the "pseudo" region-text pairs are noisy, they still provide useful information for learning region representations, and thus help to bridge the gap in object detection, as validated by our experiments. We pretrain our RegionCLIP model on image captioning datasets (e.g., Conce...
However, those tasks are focused on obtaining superior discriminative visual features. A global–local captioning model (GLCM) [38] introduced both global and local features into the RSIC model, and their attention-based decoding network was a Transformer block including self-attention and co-...
However, those tasks are focused on obtaining superior discriminative visual features. A global–local captioning model (GLCM) [38] introduced both global and local features into the RSIC model, and their attention-based decoding network was a Transformer block including self-attention and co-...