The aggregation is done in the image plane, and contains a convolution and subsequent maxpool to allow it to pass information across the boundary.You can use it with the following code (ex. NesT-T)import torch from vit_pytorch.nest import NesT nest = NesT( image_size = 224, patch_size...
The advantage about convolution is that we can use methods like maximum-likelihood estimation with backpropagation to learn parameters which lead to meaningful representations which are akin to memories (just like we do in convolutional nets). This is exactly akin to the LNP model with its convolut...