(3) We propose a multi-head cross-sinusoidal threshold attention mechanism that combines convolution kernel spectra and spatial patch tokens, using sine functions to limit the dot product size of Q, K, and V. This ensures that the attention output values fall within the effective range of the...
The RF can be calculated by the following formula: 𝑟𝑓=𝑘+(𝑘−1)∗(𝑟−1)rf=k+(k−1)∗(r−1) (1) where 𝑘k represents the filter size, and 𝑟r is the dilated rate. Compared with purely convolution networks, dilated convolution can capture richer information....