视听语音识别(Audio-visual speech recognition)视听语音识别(AVSR)和唇读的问题紧密相关。Mroueh等[36]使用前馈深度神经网络(DNN)在大型非公共视听数据集上进行音素分类。事实证明,将HMM与手工制作或预先训练的视觉功能结合使用很普遍——[48]使用DBF编码输入图像;[20]使用DCT;[38]使用经过预训练的CNN对音素进行分类...
and S. Hayamizu. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 575–582. IEEE, 2015. ...
(2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British ...
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition perfor- mance. In the ...
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attainin
This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our...
Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning start
For instance, we show that we can leverage the speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach learns the analogy-preserving embeddings between the abstract representations learned from ...
The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which con...
Audio-Visual Joint Analysis. Recent years witness the rapid growth in audio-visual joint learning tasks such as audio-visual speech recognition [13, 12], learning audio- visual correspondence [3, 4, 6], localization [51, 40], syn- chronization [14, 35, 29], audio to visual generation [...