Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang Sixth Conference on Machine Learning and Systems (MLSys'23)|June 2023 ...
We’re releasing highly-optimized GPU kernels for an underexplored class of neural network architectures: networks with block-sparse weights. Depending on the chosen sparsity, these kernels can run orders of magnitude faster than cuBLAS or cuSPARSE. We’
Our ESIMD optimizations target the Intel Data Center GPU Max 1550. We evaluated performance on the test data set used by previous work, and our implementation outperforms state-of-the-art CUDA implementations on the latest NVIDIA hardware by up to a factor of 6.14. Additionally, our proposed ...
# Sparse GPU Kernels for Deep Learning  This repo accompanies the paper [Sparse GPU Kernels For Deep Learning](https://arxiv....
3. Fine-tuning kernels (beta)We release multiple kernels for sparse-atttention-aware fine-tuning. See seen_attn/kernel/varlen for details.Compress the sequence dimention for both Q, K and V. Similar to current SeerAttention Prefill.k = repeat_kv_varlen(k, self.num_key_value_groups) v ...
Normally, implementing sparse attention would involve slicing query and key matrices in blocks, so to ease experimentation we implemented a set of block-sparse kernels which efficiently perform these operations on the the GPU. We open-source these kernels and provide example sparse attention functi...
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou...
Sputnik is a library of sparse linear algebra kernels and utilities for deep learning. Build Sputnik uses the CMake build system. Sputnik depends on the CUDA toolkit (v10.1+) and supports SM70+. The only additional dependency for the library isgoogle/glog. To build the library, enter the ...
Sparse-dense matrix multiplication (SDMM) operations are useful in a deep learning context. But traditional CPU and GPU instruction set architectures require symmetric inputs having the same density, which limits the ability to gain a performance advantage by taking advantage of the sparsity of a s...
FIG. 4E illustrates an encoding of the positions for the weight values in the four 3×3 convolution kernels shown in FIG. 4D, in accordance with one embodiment; FIG. 4F shows a block diagram for determining the (r,s) weight coordinates, in accordance with one embodiment; FIG. 4G shows ...