They employ self-supervision to train both a visual encoder and an image generator which is part of a GAN conditioned on the representations obtained from the visual encoder. They then train an audio encoder using contrastive loss to align the audio embedding to the anchored visual latent space....