to_2tuple,trunc_normal_fromtimm.models.registryimportregister_modelfromtimm.models.vision_transformerimportdefault_cfgs,_cfg__all__=['ceit_tiny_patch16_224','ceit_small_patch16_224','ceit_base_patch16_224','ceit_tiny_patch16_384','ceit_small_patch16_384...
transformer 有助于学习 long-range depencencies,conv 有助于捕捉局部特征。结合这两点文章做出了三个改进:改进 tokenization 方式, ( image2token ); 改进 encoder network (Locally-enhanced Feed-Forward, Leff );在所有 transformer 层之后加了一层 layer-wise class token attention层,用来获得更好的全图embedding。