In this paper, we propose a retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning (RAMP), which makes full use of the R-best retrieved candidate captions to enhance the image paragraph captioning via adversarial training. Concretely, RAMP treats ...
Zero-shot image captioning (IC) without well-paired image-text data can be divided into two categories, training-free and text-only-training. The main difference between them is whether using a textual corpus to train the LM. Though achieving attractive performance w.r.t. some metrics, existin...
3.2.4 Image Caption#RNN;LSTM;GRURecurrent Relational Memory Network for Unsupervised Image Captioning IJCAI 2020#transformerMemory-Augmented Image Captioning AAAI 2021 Retrieval-Augmented Transformer for Image Captioning CBMI 2022 Smallcap: Lightweight Image Captioning Prompted with Retrieval Augmentation CVPR ...
EVC AP : Retrieval-Augmented Image Captioningwith External Visual–Name Memory for Open-World ComprehensionJiaxuan Li 1∗ , Duc Minh Vo 1∗ , Akihiro Sugimoto 2 , Hideki Nakayama 11 The University of Tokyo, Japan 2 National Institute of Informatics, Japan{li,vmduc}@nlab.ci.i.u-tokyo.ac...
Memory-Augmented Image CaptioningZhengcong FeiNational Conference on Artificial Intelligence
Introduce a Dual Relation Transformer (DRTran) model for image captioning.Design dual relation enhancement encoder to complement the advantages of grid and pseudo-region features.Devise dynamic memory module to learn prior knowledge about input images.Balance the contributions of two features by cross-...
we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal...
MIRA-CAP: Memory-Integrated Retrieval-Augmented Captioning for State-of-the-Art Image and Video CaptioningCOHERENCE (Physics)STREAMING video & televisionASSISTIVE technologyJUDGMENT (Psychology)GENERALIZATIONGenerating accurate and contextually rich captions for images and videos is essential for various ...
This paper proposes a hybrid model that combines a pre-trained model with a retrieval-based memory mechanism to tackle the Personalized Image Captioning problem. Our method involves two main phases: (1) Constructing User Memory (UM), and (2) generating image descriptions using the pre-trained ...
We propose a Memory Rehearsal augmented recurrent Attention-based Captioning (MRAC) approach to achieve continual image captioning under domain shifts. MRAC employs the attention mechanism to focus a portion of each layer's activation on the task t, restricts the evolution of model weights by ...