Knowledge Distillation (KD) aims at improving the performance of a\nlow-capacity student model by inheriting knowledge from a high-capacity teacher\nmodel. Previous KD methods typically train a student by minimizing a\ntask-related loss and the KD loss simultaneously, using a pre-defined loss\n...