Temporal correspondenceVisual searchAttentionHearing synchronous sounds may facilitate the visual search for the concurrently changed visual targets. Evidence for this audiovisual attentional facilitation effect mainly comes from studies using artificial stimuli with relatively simple temporal dynamics, indicating ...
Correspondence toXiaowei Yi. Editor information Editors and Affiliations Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China Xianfeng Zhao Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ, USA ...
In addition, we investigated whether this correspondence extends to touch, i.e., whether children also match auditory pitch to the spatial motion of touch (audio-tactile condition) and the spatial motion of visual objects to touch (visuo-tactile condition). In two experiments, two different ...
2.3. Audio-Visual Consistency Learning For the intrinsic structure of video stream, the audio is naturally paired and synced with the visual component, which means that the audio-visual correspondence can be effectively utilized to draw direct supervisi...
(opens in new tab)witha predetermined taxonomy of semantic concepts(opens in new tab), where there is a high chance of audio-visual correspondence. This severely limits the utility of online videos for self-supervised learning, which begs the question: How can...
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets. The most significant benefit ...
Correspondence to Shanshan Zhang. Ethics declarations Ethics approval and consent to participate The studies involving human participants were reviewed and approved by College of Music, Sookmyung Women’s University Ethics Committee (Approval Number: 2022.49548586). The participants provided their written info...
We present Audio- Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture ...
K. J. Donnelly Corresponding author Correspondence to K. J. Donnelly . Rights and permissions Reprints and permissions Copyright information © 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG About this chapter Cite this chapter Donnelly, K.J. (2025). The Audiovis...
audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation learning. State-of-the-art methods, as well as the remaining challenges of each subfield, are further discussed. Finally, we summarize the commonly used datasets and ...