Memory-Augmented Image CaptioningZhengcong FeiNational Conference on Artificial Intelligence
we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal...
Attention models used for problems such as image captioning typically depend on the image under consideration, as well as the previous sequence of words that come before the word currently being generated. While these types of models have produced impressive results, they are not able to model ...
Dataset For the long-term video understanding task, we conduct experiments including (LVU) and two standard video summarization datasets (Breakfast,COIN). For the video question answering task, we conduct experiments includingMSRVTT,MSVD, andActivityNet. For the video captioning task, we also conduct...
In this paper, we propose a retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning (RAMP), which makes full use of the R-best retrieved candidate captions to enhance the image paragraph captioning via adversarial training. Concretely, RAMP treats ...