The fully connected layer in the end is used to map the number of feature channels output by the transformer encoder to be consistent with the number of pixels in the masked image patches. Finally, the predicted mask patch pixel value is filled to its original position, thus completing the ...