Although the hot JS divergence effectively solves the asymmetry problem of KL divergence, when the two distributions do not overlap at all, their JS divergence is 0, so that the gradient is 0, which leads to the inability to update the model. As shown in Equation (38), the Wasserstein ...