Cross-modal retrievalRecipe retrieval has received great attention in the research community, which focuses on retrieving a textual recipe given a text or an image as the query. However, cooking is an interestin
The present work is focused on using the information present in each modality to create a joint embedding space to perform cross-modal retrieval. This idea has been exploited especially using text and image joint embeddings [9,14,16], but also between other kinds of data, for example creating...
TextVR 2023 YouTube 10,500 Video+Text 15 Weak Cross-modal video retrieval with text reading comprehension EgoCVR 2024 Ego4D 2,295 Video+Text 3.9~8.1 Weak Egocentric dataset for fine-grained composed video retrievalAnomaly DetectionClick to expand Table 19 DatasetYearSource# VideosModalityAvg. len...
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020) [Paper][Homepage] 108,965 queries on 21,793 videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal alignment Video Domain Adaptation EPIC-Kitchens: Multi-Modal Domain Adaptatio...
Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction Int. J. Soc. Robot., 15 (4) (2023), pp. 631-641 10.1007/s12369-021-00842-1 CrossrefView in ScopusGoogle Scholar 8. Z. Parekh, J. Baldridge, D. Cer, A. Waters, Y. Yang Cris...
* Cross-lingual Adaptation for Recipe Retrieval with Mixup* 链接: arxiv.org/abs/2205.0389* 作者: Bin Zhu,Chong-Wah Ngo,Jingjing Chen,Wing-Kwong Chan* 其他: Accepted by ICMR2022* 摘要: 近年来,由于大规模配对数据进行培训,近年来跨模式食谱检索引起了研究的关注。然而,如果不是不可能,获得大多数用于...
To tackle these challenges, we first develop a unified multi-modal model that jointly predicts event boundaries and captions as a single sequence of tokens, as explained in Section 3.1 and Figure 2. Second, we design a pretraining strategy that effectively leverage...
2025-03-10 Blind Video Super-Resolution based on Implicit Kernels Qiang Zhu et.al. 2503.07856 null 2025-03-08 Removing Multiple Hybrid Adverse Weather in Video via a Unified Model Yecong Wan et.al. 2503.06200 null 2025-03-08 DiffVSR: Revealing an Effective Recipe for Taming Robust Video Supe...
Cross Modal Retrieval with Querybank Normalisation 2021 34 CLIP4Clip 43.470.280.62.017.5 CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval 2021 35 ALPRO 35.967.578.83 Align and Prompt: Video-and-Language Pre-training with Entity Prompts ...
19. The system of claim 17, wherein the system is coupled to one or more remote display devices via a packet-based network and wherein the multimedia stream generator provides the encoded multimedia data stream to one or more display devices as a packet-based transmission. 20. A television ...