AudioVisual Person Recognition: An Evaluation of Data Fusion Strategies - Chibelushi, Mason, et al. - 1997Audio-Visual Person Recognition: An Evaluation of Data Fusion Strategies - Chibelushi, Mason, et al. - 1997C.C. Chibelushi, F. Deravi, J.S. Mason "Audio-visual person recognition:an...
[36] Y. Mroueh, E. Marcheret, and V. Goel. Deep multimodal learning for audio-visual speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2130–2134. IEEE, 2015. [37] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, an...
Highlight:这是AAAI 2022里面的一篇多模态文章。本篇文章提出的方法针对的问题是音视频的语音识别,以及多模态的合成和转换,也即标题里的manipulation。相对于传统的方法,本文的特点是提出了一个统一的多模态多任务模型,经过训练后,可以同时完成多个模态任务。对多模态的表征在训练中,按照模态分离成了模态相关的话者表征...
recognitionaudioaffectvisual视听multimodal 424 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 2, FEBRUARY 2007 Audio-Visual Affect Recognition Zhihong Zeng, Jilin Tu, Ming Liu, Thomas S. Huang, Brian Pianfetti, Dan Roth, and Stephen Levinson Abstract—The ability of a computer to detect and appro...
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine...
Multi-modal speaker recognition has received a lotof attention in recent years due to the growing security demands in real applications. In this paper, we present an efficient audio-visual speaker recognition method by fusing face and audio via the multi-modal correlated neural networks. Within our...
While we present an audio-visual recognition task as an application of our approach, our framework is flexible and thus can work with any multimodal dataset, or with any already-existing deep networks that share the common underlying semantics. In this work in progress report, we aim to ...
Support vector machines have also been used in one-modality audio or visual speech recognition, but never in a multimodal audio-visual system. We propose such a hybrid SVM-HMM speech recognizer, and we show how the multimodal approach leads to better performance than that obtained with any of ...
Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one...
With this proposed infrastructure, a bimodal system of big data emotion recognition is proposed, where the modalities consist of speech and face video. Experimental results show that the proposed approach achieves 83.10 % emotion recognition accuracy using bimodal inputs. To show the suitability and ...