[19] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [20] G. Galatas, G. Potamianos, and F. Makedon. Audio-visual speech recognition...
Mason "Audio-visual person recognition:an evaluation of data fusion strategies.", European Conference on Security and Detection, pp. 26-30, 1997.C C Chibelushi, J S D Mason and F Deravi, "Audio-Visual Person Recognition: An Evaluation of Data Fusion Strategies", Proceedings of the European ...
Highlight:这是AAAI 2022里面的一篇多模态文章。本篇文章提出的方法针对的问题是音视频的语音识别,以及多模态的合成和转换,也即标题里的manipulation。相对于传统的方法,本文的特点是提出了一个统一的多模态多任务模型,经过训练后,可以同时完成多个模态任务。对多模态的表征在训练中,按照模态分离成了模态相关的话者表征...
Multi-modal speaker recognition has received a lotof attention in recent years due to the growing security demands in real applications. In this paper, we present an efficient audio-visual speaker recognition method by fusing face and audio via the multi-modal correlated neural networks. Within our...
Traditional speech recognition systems use Gaussian mixture models to obtain the likelihoods of individual phonemes, which are then used as state emission probabilities in hidden Markov models representing the words. In hybrid systems, the Gaussian mixtures are replaced by more discriminant classifiers, ...
Individual modality recognition performances indicate that anger and sadness have comparable accuracies for facial and vocal modalities, while happiness seems to be more accurately transmitted by facial expressions than voice. The neutral state has the lowest performance, possibly due to the vague ...
visual-facial affect recognitionmulti-criteria decision makingIn this paper, we present and discuss a novel approach for the integration of audio-lingual and... M Virvou,GA Tsihrintzis,E Alepis,... - 《International Journal on Artificial Intelligence Tools》 被引量: 4发表: 2012年 EMOTION RE...
While we present an audio-visual recognition task as an application of our approach, our framework is flexible and thus can work with any multimodal dataset, or with any already-existing deep networks that share the common underlying semantics. In this work in progress report, we aim to ...
With this proposed infrastructure, a bimodal system of big data emotion recognition is proposed, where the modalities consist of speech and face video. Experimental results show that the proposed approach achieves 83.10 % emotion recognition accuracy using bimodal inputs. To show the suitability and ...
Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one...