In our case, the system must connect the acoustic events present in the audio stream to the semantically correlated elements in the visual modality. While it is an easy and intuitive process for humans to learn semantic correlation between images and sounds, it becomes a challenging task for ...