# Computational performance

A key operation in linear algebra is the multiplication of two dense matrices, or more generally, the operation

)

for a scalar *a* and matrices A, B, and C of orders (say) *mxk, кxn,* and *mxn. *This can be performed very effectively if B and C are stored by columns (the Fortran way) and there is room in the *i* cache for A and a few columns of B and C. C is computed column by column. For each column, 2m reals are moved into the cache, (2k + 1)m floating-point operations are performed, and m real results are stored to memory and are no longer required in the cache. If к is reasonably large, say 32, the data movement time will be small compared with the arithmetic time. There is no need for the cache to hold B or C—indeed there is no limit on the size of n. The system will bring the columns into cache as they are needed and may indeed be able to overlap the computation and data movement. This is known as streaming.

For the highest computational performance, it is necessary to exploit any parallelism available in the computer. We will make use of it at various levels. For the first, the computation is arranged to involve fully independent subproblems for which code can execute independently. We might permute a sparse set of equations to the form

which allows us to begin by working independently on the four subproblems

where ^^{5}=1 A|_{5} = A55 and ^^{5}=1 b| = Ъб .

Such a block structure can occur though the use of a nested dissection ordering, as discussed in Chapter 9, or when domain decomposition is used in the numerical solution of partial differential equations. Parallelism can also be exploited at the level of the still sparse submatrices, A**j*. Finally, we show in Chapters 11-13 how dense kernels can be used at the heart of a sparse code. For example, consider the matrix multiplication problem of the previous paragraph. If the problem is large enough, the data may be redistributed across the processes so that they share the work and each works on submatrices for which streaming occurs.

When discussing the merits of various algorithms throughout the remainder of this book, we will use the formula (1.4.1) where appropriate. K represents the theoretical peak performance of the computer. Attaining the peak performance would require maximally efficient use of the memory hierarchy, which is rarely possible for sparse problems. The data structures in a sparse problem are unlikely to use the pipelines, buffers, and memory hierarchies without inefficiencies. For this reason, we compare performance on different computer architectures and different problems using cases from the Florida Sparse Matrix Collection (see Section 1.7) and elsewhere. In the case of dense vectors and matrices, the outlook is more predictable. Here the basic manipulations have been tuned to the various architectures using the BLAS (Basic Linear Algebra Subprograms). The Level 1 BLAS (Lawson, Hanson, Kincaid, and Krogh 1979) are for vector operations, the Level 2 BLAS (Dongarra, Du Croz, Hammarling, and Hanson 1988) are for matrix-vector operations, and the Level 3 BLAS (Dongarra, Du Croz, Duff, and Hammarling 1990) are for matrix-matrix operations. Optimized versions of these subroutines are available from vendors. Of particular importance are the Level 3 BLAS because they involve many arithmetic operations for each value that is moved through caches to the registers. Although the BLAS are crucial in the efficient implementation of dense matrix computation, they also play an important role in sparse matrix computation because it is often possible to group much of the computation into dense blocks.