视听语音识别(Audio-visual speech recognition)视听语音识别(AVSR)和唇读的问题紧密相关。Mroueh等[36]使用前馈深度神经网络(DNN)在大型非公共视听数据集上进行音素分类。事实证明,将HMM与手工制作或预先训练的视觉功能结合使用很普遍——[48]使用DBF编码输入图像;[20]使用DCT;[38]使用经过预训练的CNN对音素进行分类...
(2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British ...
and S. Hayamizu. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 575–582. IEEE, 2015. ...
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition perfor- mance. In the ...
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attainin
Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynami... Z Yue,W Hui,J Qiang - 《International Journal of Advanced Robotic Systems》 被引量: 5发表: 2012年 Audio-Visual Speech Recognition Using Co...
The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which con...
Based on multimodal deep belief networks, several applications such as audio-visual emotion recognition [25], audio-visual speech recognition [27], and information trustworthiness estimation [100] have been reported. Also, based on multimodal deep Boltzmann machines, several solutions used for human ...
For instance, we show that we can leverage the speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach learns the analogy-preserving embeddings between the abstract representations learned from ...
Mitsufuji, “PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation,” in Interspeech 2018, Sep. 2, 2018. [12] T. Afouras, J. S. Chung, and A. Zisserman, “The Conversation: Deep Audio-Visual Speech Enhancement,” in Interspeech 2018, Sep. 2, 2018....