The work of [26] offers both theoretical and em- pirical justifications of using the focal loss [18] for cal- ibrating networks. In particular, it shows that the focal loss minimizes implicitly the Kullback-Leibler (KL) diver- gence between a uniform distributi...