With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling by attending to both text context and visual knowledge in images. We evaluate VaLM on various visual knowledge-intensive commonsense reasoning tasks, which require visual ...
Official implementation of our paper "Visually-Augmented Language Modeling". Please cite our paper if you find this repository helpful in your research: @article{valm, title={Visually-augmented language modeling}, author={Wang, Weizhi and Dong, Li and Cheng, Hao and Song, Haoyu and Liu, Xiaodo...