TextBoxes++ [5] adopts irregular 1 × 5 convolutional filters instead of the standard 3 × 3 filters and leverages recognition results to refine the detection results. ITN [6], E2E-MLT [11] and FOTS [18] are end-to-end text instance networks. Liu et al. applied feature pyramid ...