Benefiting from CMCGAN, we develop a dynamic multimodal classification network to handle the modality missing problem. Abundant experiments have been conducted and validate that CMCGAN obtains the state-of-the-art cross-modal visual-audio generation results. Furthermore, it is shown that the ...
The proposed MT-Net includes three progressive sub-networks: 1) feature learning, 2) cross-modal mapping, and 3) audio generation. First, the feature learning sub-network aims to learn semantic features from image and audio, including image feature learning and audio feature learning. Second, ...
Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an importa...
Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published. Downloading Pre-Trained Weights ...
Although most known examples of cross-modal interactions in audio-visual speech perception involve a dominant visual signal that modifies the apparent audio signal heard by the observer, there may also be cases where an audio signal can alter the vi- sual image seen by the observer. In this ex...
本次语音之家公开课邀请到Wenwu Wang进行分享Audio-Text Cross Modal Translation。 公开课简介 主题:Audio-Text Cross Modal Translation 时间:2023年4月4日16:00-17:00 嘉宾介绍 Wenwu Wang Wenwu Wang is a Professor in Signal Processing and Machine Learning, and a Co-Director of the Machine Audition...
to be mapped together and compared directly for cross-modal search and retrieval. We also show that these jointly-learnt embeddings outperform solo embeddings of any one modality. Thus, our results break ground for a cross-modal Audio Search Engine that permits searching through ad-...
课程摘要 Cross modal translation of audio and texts has emerged as an important research area in artificial intelligence, sitting at the intersection of audio signal processing and natural language processing. Generating a meaningful description for an audio clip is known as automated audio captioning,...
Highlight: 本文针对多模态语音分离任务,提出对分离后的语音和视觉表征的对应关系进行显式的建模,从而提升语音分离效果。具体来说,本文提出了一种cross-modal correspondence loss。动机是分离后的目标人的语音,应该和目标人的视觉模块的输出存在对应关系。如果我们能对这种对应关系建模,那么有助于模型提取更好的视觉表征...
Little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation ...