This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one ...
CLIP-It! Language-Guided Video Summarization Medhini Narasimhan, Anna Rohrbach, Trevor Darrell https://arxiv.org/abs/2107.00650 Installation cd docker docker build -t clipit . Usage import torch from clip_it import CLIP_IT device = torch.device('cuda') clip_model_name = 'ViT-B/32' num_...
Zero-Shot Video Captioning with Evolving Pseudo-Tokens CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning CLIP4Caption: CLIP for Video Caption CLIP4Caption ++: Multi-CLIP for Video Caption arXiv:2110.05204v3 Video summary CLIP-It! Language-Guided Video Sum...
Transferable Visual Models From Natural Language Supervision展开 相关学科:Video RetrievalAction RecognitionVideo SummarizationEvent DetectionMultimediaRepresentation LearningAction DetectionContrastive LearningVideo ClassificationAudio Fingerprint 学科讨论 暂无讨论内容,你可以...