No image available
No image available
No image available
Abstract: "We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking and it also replaces a level 2 part in a standard block algorithm by level 3 operations. However, there are significant additional costs for creating and performing the updates which prohibits the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by around 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000 SP nodes; a 120 MHz IBM POWER2 node and one processor of a 4-way 332 MHz IBM PPC604e SMP node. The hybrid recursive algorithm reaches over 90% of the theoretical peak performance of the POWER2 node. Compared to standard type block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a 4-way 332 MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors it shows speedups up to 1.97, 2.99, and 3.97."
No image available
· 1996
Abstract: "This paper describes the design, implementation, and evaluation of a parallel algorithm for the Cholesky factorization of banded matrices. The algorithm is part of IBM's Parallel Engineering and Scientific Subroutine Library version 1.2 and is compatible with ScaLAPACK's banded solver. Analysis, as well as experiments on an IBM SP2 distributed-memory parallel computer, show that the algorithm efficiently factors banded matrices with wide bandwidth. For example, a 31-node SP2 factors a large matrix more than 16 times faster than a single node would factor it using the best sequential algorithm, and more than 20 times faster than a single node would using LAPACK's DPBTRF. The algorithm uses novel ideas in the area of distributed dense matrix computations. These include the use of a dynamic schedule for a blocked systolic-like algorithm, separation of the dynamic scheduler and numerical computation, and the separation of the input and output data layouts from the layout the algorithm uses internally. The algorithm also uses known techniques such as blocking to improve its communication-to-computation ratio and its data- cache behavior."
No image available
Abstract: "We describe some modifications of the LAPACK dense linear algebra algorithms using recursion. Recursion leads to automatic variable blocking. LAPACK's level two versions transform into level three codes by using recursion. The new recursive codes are written in FORTRAN 77 which does not support recursion as a language feature. Gaussian elimination with partial pivoting and Cholesky factorization are considered. Very clear algorithms emerge using recursion. The recursive codes do exactly the same computation as the LAPACK codes. And, a single recursive code replaces both the level 2 and level 3 versions of the corresponding LAPACK codes. We present an analysis of the recursive algorithm both in terms of flop count and storage usage. The matrix operands are more 'squarish' using recursion. The total area of the submatrices used in the recursive algorithm is less than the total area used by the LAPACK level 3 right/left looking algorithms. We quantify by how much. We also quantify how the flops are computed. Also, we exhibit that the algorithms have high performance on RISC type processors. In fact, except for small matrices the recursive version outperforms the level 3 LAPACK versions of DGETRF and DPOTRF on a RS/6000 workstation. For the level 2 versions the performance gain approaches a factor of 3. We also demonstate that a change to LAPACK's DLASWP routine can improve performance of both the recursive version and DGETRF by more than 15%."
No image available
No image available
No author available
· 2008
No image available
No image available
· 1995
Abstract: "A 3-dimensional (3-D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a 'virtual' processing cube with dimensions p1, p2, and p3 proportional to the matrices' dimensions -- M, N, and K. Each processor performs a single local matrix multiplication of size M/p1 x N/p2 x K/p3, on one of the sub-cubes in the computational space. Before the local computation can be carried out, each sub-cube needs to receive a single sub-matrix of A and B. After the single matrix multiplication has completed, K/p3 sub-matrices of this product have to be sent to their respective destination processors. The 3-D parallel matrix multiplication approach has a factor P[superscript 1/6] less communication than the 2-D parallel algorithms. This algorithm has been implemented on IBM Powerparallel SP-2 systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winograd's variant of Strassen's algorithm to achieve 'super-linear' speed- up. The performance achieved exceeds the theoretical peak of the system."
No image available
Abstract: "Portability in Java requires that all architectures produce the same result for a particular floating-point calculation. Conceptually, the best definition for this result is the correctly rounded value of the infinitely precise result. Although this is likely an unattainable goal in general, it should be striven for when attainable with reasonable performance. Although many architectures implement IEEE double-precision arithmetic in hardware, providing the correctly rounded result for the basic operations (+, -, *, /) with good performance, most do not provide hardware support for extended precision arithmetic. The performance and accuracy of the resulting software implementations can be poor. Many machines provide a double-precision fused multiply-add (FMA) instruction that computes d=a*b+c with d correctly rounded. Using the FMA instruction, higher performance algorithms for extended-precision arithmetic can be devised. We present examples of such algorithms for the basic operations, +, -, *, /. The algorithms always compute the exact result for each of the operations. This contributes to the Java goal of portability by providing a large spectrum of architectures on which it is possible to efficiently implement correctly rounded extended precision floating point. The FMA is also valuable in large-scale computation of linear algebra problems, where the accumulation of dot products is a basic operation. Performing this accumulation in extended precision greatly increases the size of problem that may be solved before roundoff error becomes a limiting factor. It also produces a much faster way, over just using IEEE arithmetic, of doing the more accurate computation. Given two integers a and b, there exist unique integers q and r such that a =bq + r with 0 [