UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection Ye Liu 1 Siyuan Li 2 Yang Wu 2∗ Chang Wen Chen 1,4 Ying Shan 2 Xiaohu Qie 3 1 Department of Computing, The Hong Kong Polytechnic University 2 ARC Lab, Ten...
In this paper, we present the first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization while can also be easily degenerated for solving individual problems. As far as we are aware, this is the first scheme to integrate multi-modal (...
Moment RetrievalCharades-STAUMT (VA)R@1 IoU=0.548.31# 22 Compare R@1 IoU=0.729.25# 21 Compare R@5 IoU=0.588.79# 3 Compare R@5 IoU=0.756.08# 4 Compare Moment RetrievalCharades-STAUMT (VO)R@1 IoU=0.549.35# 21 Compare R@1 IoU=0.726.16# 23 ...
Name Last commit message Last commit date Latest commit History 16 Commits .github configs datasets models tools .gitignore LICENSE README.md requirements.txt setup.cfg README License Unified Multi-modal Transformers This repository maintains the official implementation of the paperUMT: Unified Multi-mo...
multi-stage fusion module (CMFM) and a bi-directional multi-scale decoder (BMD). Similar to the visual color stage doctrine in the human visual system (HVS), the proposed CMFM aims to explore important feature representations in feature response stage, and integrate them into cross-modal ...
BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805v2 (2018). Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. Preprint at https://arxiv.org/abs/2010.11929v2 (2020)...
We adopted the standard attention analysis strategy for vision transformers. For each layer in the transformer, we averaged the attention weights across multiple heads (as we used multihead self-attention in IRENE) to obtain an attention matrix. To account for residual connections, we added an ide...
Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. CoRR, abs/2004.00849, 2020. URL https://arxiv.org/abs/2004.00849. Huang et al. [2021] Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. Seeing out of the box: End-to...
The effectiveness of Vision Transformers (ViTs) diminishes considerably in multi-modal face anti-spoofing (FAS) under missing modality scenarios. Existing approaches rely on modality-invariant features to alleviate this issue but ignore modality-specific features. To solve this issue, we propose a M ...