尽管有一些方法提出通过简化standard softmax self-attention的计算来解决这个问题,比如sparse attention和low-rank approximation,但它们通常以精度下降为代价,并且加速也是有限的。在本文的工作中,作者通过减少内存效率低下的层来节省内存访问成本。最近的研究表明,内存效率低下的操作主要位于MHSA而不是FFN层。然而,大...
Some efficient transformers have been proposed recently and they fall into two camps: 1) efficient self-attention; and 2) efficient architecture design. Efficient self-attention methods reduce the cost of softmax attention via sparse attention [34, 57, 61, 75] or...
The various attention module designs are denoted by QKV and KVQ as defined in Section 3.2. The low percentage mIoU of 60.3 may be because of the reduction in input and batch size to compensate for the high memory consumption experienced when SBP and QKV are combined in the same model. ...
For example, BitFit [52] trains only the bias terms of the network, whereas other works finetune only pa- rameters of the attention or MLP-layers [16, 51]. Another common theme is to reparameterise learned parameters, for example as the product of low-rank matrices [...
communities, and the resulting community structure is characterized by high positive and low negative connectivity within each community. It should be noted that this community structure was based on an unbiased weighted connectivity matrix, i.e., we did not impose an arbitrary threshold on the ...
2.2. Trade Computation for Memory in Backbone Checkpointing [5] is an effective approach to reduce memory consumption, especially for the intermediate re- sults of low-cost operation. It only stores feature maps of some high-cost operations, such as convolut...