Zero-shot Singing Voice ConversionShahan NercessianInternational Symposium/Conference on Music Information Retrieval
Currently released model supports zero-shot voice conversion 🔊 , zero-shot real-time voice conversion 🗣️ and zero-shot singing voice conversion 🎶. Without any training, it is able to clone a voice given a reference speech of 1~30 seconds. We support further fine-tuning on custom ...
Zero-shot means that in the inference stage, the proposed model only needs one utterance of the source speaker and the reference speaker to complete the voice conversion, and even the speaker may not appear in the training set, which is called unseen speaker. Dataset We used the CSTR VCTK ...
In the field of Singing Voice Conversion, there is not only one project, SoVitsSvc, but also many other projects, which will not be listed here. The project was officially discontinued for maintenance and Archived. However, there are still other enthusiasts who have created their own branches ...
DiffSinger achieved a state-of-the-art performance in SVS task by generating a high-quality singing voice with a powerful adaptation performance. DDDM-VC [10] significantly improved speech representation disentangle [3] and voice conversion performance by a disentnalged denoising diffusion model and ...
GitHub - Plachtaa/seed-vc: State-of-the-Art zero-shot voice conversion & singing voice conversion with in context learningState-of-the-Art zero-shot voice conversion & singing voice conversion with in context learning - Plachtaa/seed-vc
Non-parallel many-to-many voice conversion, as well as zero-shot voice conversion, remain under-explored areas. Deep style transfer algorithms, such as generative adversarial networks (GAN) and conditional variational autoencoder (CVAE), are being applied as new solutions in this field. However, ...
We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for...
voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompter outperforms...
Zero-shotVQ-VAEConnectionist temporal classificationVector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker ...