The common complication of CNN-based lightweight architectures is their deficiency of capacity to extract global features. For the purpose of rapidly obtaining global information, ViT [25] brings transformer models tailored to natural language processing tasks to the vision domain, particularly image ...