Memory-Augmented Image CaptioningZhengcong FeiNational Conference on Artificial Intelligence
Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discoursebased coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Rec...
we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal...
ACL20|MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning,程序员大本营,技术文章内容聚合第一站。
we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the ne...
Zero-shot Evaluation Our model can also leverage pre-trained weights fromInstructBlipwithout any finetuning to conduct zero-shot evaluation on video datasets. bash run_scripts/${dataset}/test.sh Hyper-parameters One important hyper-parameters memory_bank_length, please change that in the training sc...
In this paper, we propose a retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning (RAMP), which makes full use of the R-best retrieved candidate captions to enhance the image paragraph captioning via adversarial training. Concretely, RAMP treats ...
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioningdoi:10.18653/V1/2020.ACL-MAIN.233Jie LeiLiwei WangYelong ShenDong YuTamara L. BergMohit BansalAssociation for Computational Linguistics