Each colored square represents a 16 × 16 patch token encoded by UNI, with heatmap color corresponding to the attention weight of that patch token to the global [CLS] token of the penultimate layer in UNI. We show MHSA visualizations for resized and center-cropped ROIs at 2242, 4482...
Specifically, the size of convolution kernels is set as 3×33×3, the number of convolution kernels increases layer by layer according to [128,256,512,1024][128,256,512,1024], and the size of kernels in max pooling layers is set as 2×22×2. The linear classification module is ...
Pooling layer. It performs the subsampling operation along the spatial dimensions (width and height, respectively). Fully Connected (FC) layer. This layer calculates the results for each class. Unlike all other layers, the neurons in this layer are connected to all of the previous layers’ neuro...
Our network ends with a global average pooling (GAP) layer and a fully connected (FC) layer, followed by a softmax layer with four classes. Additionally, two types of shortcut connections are inserted, where a solid line denotes an identity shortcut and a dotted line denotes that a 1 ×...