Matrix multiplicationParallel algorithmsStrassen'sWinograd's algorithmSeveral implementations of matrix multiplication (MMUL) in Fortran and VAX assembly language are discussed. On a VAX-11/780 computer, the most efficient MMUL is achieved through vector-scalar-multiply-and-add (VSMA) operations, ...
Abstract. The fast matrix multiplication algorithm by Strassen is used to obtain the triangular factorization of a permutation of any nonsingular matrix of ordern in operations, and, hence, the inverse of any nonsingular matrix in 1. Introduction. Strassen [3] has given an algorithm using noncom...
For m ≤ n 1.14 , the new algorithm performs an almost optimal number of only n 2 + o (1) operations. For m ≤ n 1.68 , the new algorithm is also faster than the best known matrix multiplication algorithm for dense matrices which uses O ( n 2.38 ) algebraic operations. The ...
Estimation of the weighted mean and covariance matrix using an online algorithm (Clarke, 1971). Computation of central moments up to fourth order using an online algorithm (Spicer, 1972). Fast computation of Hadamard product using unrolled loops. ...
fast algorithmwave scatteringnumerical methodintegral equationsA new technique is presented for accelerating the fast multipole method, allowing rapid solution of surface integral equations for wave-scattering problems. A nonnested, ray-propagation approach is used to compute a matrix-vector multiply in O...
identify opportunities on commodity hardware for a single-wide (≥8b) multiply to compute a dot product of two vectors with multiple narrow (<8b) elements propose ULPPACK, an efficient implementation of sub-8-bit GEMM (General Matrix Multiplication) computation, by leveraging efficient packing and...
Fast Matrix Multiply with Multi-Dimensional Support MTIMESXis a fast general purpose matrix and scalar multiply routine that has the following features: Supports multi-dimensional (nD, n>2) arrays directly Supports Transpose, Conjugate Transpose, and Conjugate pre-operations ...
This MATLAB function computes the discrete Fourier transform (DFT) of X using a fast Fourier transform (FFT) algorithm.
We propose a data-parallel CG algorithm to run on multiple GPUs and a CPU located on the same board. Rows of the matrix and corresponding vector entries are distributed amongst GPUs. Since MxV takes most of the iteration time, we assign nonzeros of the matrix equally to each GPU, so ...
Valiant showed that Boolean matrix multiplication (BMM) can be used for CFG parsing. We prove a dual result: CFG parsers running in time $O(|G||w|^{3 - \myeps})$ on a grammar $G$ and a string $w$ can be used to multiply $m \times m$ Boolean matrices in time $O(m^{3 -...