^[3] Bottom-up and top-down attention for image captioning and visual question answeringhttps://arxiv.org/pdf/1707.07998.pdf ^[4] VL-BERT: PRE-TRAINING OF GENERIC VISUALLINGUISTIC REPRESENTATIONShttps://arxiv.org/pdf/1908.08530v4.pdf ^[5] Vilbert: Pretraining task-agnostic visiolin-guistic...
Visual-Linguistic Verification Module 先将图片输入到CNN网络,在输入到transformer的encoder层,得到 F_{v} ,包括了对图像中物体的特征表示,但是不包含任何语言先验知识,为了解决这个问题,我们提出了 Visual-Linguistic Verification Module,来计算视觉和文本特征之间的细粒度相关性,并将特征聚焦在与文本描述中的语义相关的...
We propose a method that combines visual and linguistic features for automatic information retrieval from receipt images using deep network architectures, which outperforms existing approaches. Our Skip-Rect Embedding (SRE) descriptor is demonstrated in two canonical applications for receipt information ...
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos We propose an unsupervised method for reference res- olution in instructional videos, where the goal is to tem- porally link an entity (e.g., “dressing”) to the action (e.g., “mix yogurt”) that produced it. The ...
背景 这是微软亚研院的工作,将纯文本设定的bert扩展到visual-linguistic场景,从预训练到fine-tune,可以用于多个下游任务。 摘要 作者提出了一个可预训练的用于学习视觉-语言任务通用表征的模型VL-BERT,VL-BERT以transformers为主干,可以同时接受V、L特征作为输入。预训练任务使用了包括visual-language数据集Conceptual ...
VL-BERT is a simple yet powerful pre-trainable generic representation for visual-linguistic tasks. It is pre-trained on the massive-scale caption dataset and text-only corpus, and can be fine-tuned for various down-stream visual-linguistic tasks, such as Visual Commonsense Reasoning, Visual...
This paper takes one step toward this goal. In this paper, we study the transfer performance of pre-trained models on face analysis tasks and introduce a framework, called FaRL, for general Facial Representation Learning in a visual-linguistic manner. On one hand, the frame...
It contains the codes of the Visual-Linguistic Pre-training (VLP), and fine-tuning via Visual-Linguistic Causal Intervention (VLCI) on IU-Xray/MIMIC-CXR dataset. Requirements All the requirements are listed in the requirements.yaml file. Please use this command to create a new environment and ...
We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model ...
Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic ...