It is possible that the network globalizes information in the last layer to ensure that the CLS token has access to the entire image, but because the CLS token is treated the same as every other patch by the transformer, this seems to be achieved by globalizing across all tokens. >From ...
Github repository for the paper What Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers. - WhatTransformerToFavor/architectures/swiftformer.py at main · tobna/WhatTransformerToFavor