OpenCL Matrix Multiplication Design example contains a high-performance implementation of the fundamental matrix multiplication operation and demonstrates optimizations.
EDIT: One thing I also what to add is that you can also try experimenting with loop unrolling. Loop unrolling (as long as it's data independent) creates essentially multiple instances of the computation under the for loop. However, realize that this could impact memory access efficiency sin...
unrolling fusion 同时,为了更好的从 Relay 转换到 TE, TVM 为常用tensor operator, 例如 conv2d, transpose 提供了pre-difined templates库:Tensor Operator Inventory (TOPI) 第四步: 使用auto-tuning模块搜索最优的schedule策略。引用原文: A schedule specifies the low-level loop optimizations for an operator ...
EDIT: One thing I also what to add is that you can also try experimenting with loop unrolling. Loop unrolling (as long as it's data independent) creates essentially multiple instances of the computation under the for loop. However, realize that this could impact memory access efficiency since...
EDIT: One thing I also what to add is that you can also try experimenting with loop unrolling. Loop unrolling (as long as it's data independent) creates essentially multiple instances of the computation under the for loop. However, realize that this could impact memory access efficiency since...
EDIT: One thing I also what to add is that you can also try experimenting with loop unrolling. Loop unrolling (as long as it's data independent) creates essentially multiple instances of the computation under the for loop. However, realize that this could impact memory access efficiency since...