SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations 2024.11.26 keywords: VC, 自监督学习, 解耦, 量化出版单位:韩国浦项科技大学Demo page:Demo快速阅读:不断利用残差和瓶颈层来实现解耦,总共解出了语音内容、音频细节和说话人特征。 摘要 单样本语音转换(VC)...
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization 文章地址 相关代码 出处:台湾大学,李宏毅老师团队 摘要 过往有许多关于音色转换的研究主要集中在并行语料集的基础上的,已经能够实现将一种的音色转换成其多种其他...
Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during ...
语音转换(voice conversion或简称VC)的目标是保留语音的语言特征的同时,转换语言的非语言特征。语言的非语言特征包括口音,音色,说话风格等。本文研究的目标是音色转换。 关于VC的研究大致可以分为两类,一种是基于并行数据的监督学习方法,一种是基于非并行数据的非监督方法(本文方案)。使用监督方案的优点在于语料足够的...
语音转换(Voice Conversion,VC)是一种技术,可以将源说话者的声音转换成目标风格,如说话者身份[1]、韵律[2]和情感[3],同时保持语言内容不变。在本文中,我们关注在一次性条件下进行说话者身份转换,即仅给定目标说话者的一段话作为参考。 典型的一次性语音转换方法是从源语音和目标语音中分别解耦内容信息和说话者信...
This paper proposes a novel one-shot voice conversion (VC) method called DS-ESR-StyleGAN-VC, which encompasses several innovations to address the challenges faced by StarGAN-VC. Firstly, we adopt ESR network in the generator to extract deep features, effectively solving the problem of semantic co...
In this work, we propose a variant of STARGAN for many-to-many voice conversion (VC) conditioned on the d-vectors for short-duration (2-15 seconds) speech. We make several modifications to the STARGAN training and employ new network architectures. We employ a transformer encoder in the ...
Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speaker...
Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a challenging work, especially in a one-shot setting. Auto-encoder-based VC methods disentangle the speaker and the...