The vanilla feed-forward network consists of two linear transformation layers, where the hidden dimension D′ between two layers is expanded to learn a richer feature representation. In contrast, we intro- duce a depthwise 3D convolution (with BN and ...