Then, the convoluted feature map is fed to the 5-layer fully connected network, and finally a score between [0, 1] is calculated to judge the styles similarity of two input voice segments. Figure 4 Discriminator configuration. Full size image VSUGAN training Data preparation The data set ...