Switching to full form
From our considerations so far, it is clear that complicated data structures and programming are needed to take advantage of sparsity. On most computer architectures an inner loop involving indirect addressing of the form x(row_index(kk)), common in sparse codes (see, for example, Figure 10.5.1), will run slower than the corresponding loop with direct addressing (as used for dense matrices). Some example speeds are shown in Table 10.7.1. Note that the indirect dot product computation a = xTy is faster than the x = x + ay computation, probably because it involves less writing of values to memory.
For very sparse systems, the problems of cache misses or lack of vectorization are minor compared with the saving from exploiting sparsity, but equally clearly it would not be sensible to solve a dense system with a sparse code. However, this is exactly what happens towards the end of Gaussian elimination when fill-in causes the reduced submatrices to become increasingly dense. We illustrate this in Table 10.7.2, where we show examples of the densities of reduced submatrices of various orders. The optimal point at which to switch is dependent on the computer being used. For example, parallel architectures are likely to favour using the dense code, making it desirable to switch at a lower density.
Towards the end of the elimination, we are thus dealing with matrices that would be more efficiently factorized by dense matrix code. This suggests using a hybrid approach in which a switch is made from the use of sparse code to dense code at a point determined by the order and density of the reduced submatrix and by the computing environment. In addition, in a parallel environment, we can use dense codes that are tuned for such architectures.
Dembart and Erisman (1973) discussed such hybrid codes and indicated the possible savings. The first general-purpose sparse code to employ such a switch was the SLMATH package of IBM (1976); it switched at 100% density in the
Table 10.7.2 Examples of the densities of the reduced submatrices in Gaussian elimination. A dense matrix has density 1.0.
goodwin
|
rim 22.6 1 015.0 |
onetonel
|
shyy161
|
|
Order of reduced submatrix |
||||
4000 |
0.05 |
0.16 |
0.02 |
0.07 |
2000 |
0.24 |
0.49 |
0.58 |
0.37 |
1000 |
0.56 |
0.55 |
0.84 |
0.66 |
500 |
0.71 |
0.53 |
0.88 |
0.98 |
200 |
0.73 |
0.60 |
0.94 |
0.97 |
100 |
0.89 |
0.68 |
0.99 |
0.94 |
50 |
0.92 |
0.84 |
1.00 |
0.95 |
20 |
0.99 |
0.99 |
1.00 |
1.00 |
Table 10.7.3 Times (in seconds on a Dual-socket 3.10 GHz Xeon E5-2687W, using the Intel compiler with O2 optimization) for the ANALYSE, FACTORIZE, and SOLVE phases of MA48 when the switch to dense form is made at different densities.
Matrix |
goodwin |
rim |
onetonel |
shyy161 |
||||||||
Density |
A |
F |
S |
A |
F |
S |
A |
F |
S |
A |
F |
S |
0.10 |
1.36 |
0.30 |
0.008 |
8.3 |
2.87 |
0.027 |
0.96 |
0.30 |
0.007 |
2.89 |
0.88 |
0.015 |
0.25 |
1.87 |
0.24 |
0.005 |
15.6 |
2.47 |
0.022 |
4.36 |
0.25 |
0.007 |
4.51 |
0.76 |
0.012 |
0.50 |
2.20 |
0.28 |
0.004 |
20.4 |
3.44 |
0.022 |
5.94 |
0.42 |
0.006 |
6.98 |
1.15 |
0.012 |
0.75 |
2.37 |
0.34 |
0.005 |
22.1 |
4.15 |
0.023 |
7.23 |
0.61 |
0.006 |
10.90 |
2.21 |
0.013 |
1.0 |
2.37 |
0.34 |
0.004 |
22.1 |
4.14 |
0.022 |
8.52 |
0.99 |
0.007 |
11.44 |
2.40 |
0.014 |
Table 10.7.4 Storage requirements (millions of real variables) when the switch to dense form is made at different densities.
Matrix Density |
goodwin |
rim |
onetone1 |
shyy161 |
0.10 |
8.19 |
41.0 |
9.57 |
20.0 |
0.25 |
3.76 |
20.5 |
7.23 |
12.0 |
0.50 |
2.77 |
15.5 |
4.66 |
8.8 |
0.75 |
2.60 |
14.4 |
3.95 |
7.6 |
1.00 |
2.60 |
14.4 |
3.81 |
7.4 |
to dense code. This means that ANALYSE times increase monotonically as the threshold is increased. Indeed, the very low threshold of 0.1 can substantially reduce the ANALYSE time, as is illustrated by the results for rim, onetonel, and shyy161 in Table 10.7.3. If only a single matrix is to be factorized, the reduction in ANALYSE time can more than compensate for the increase in factorization time.
A multifrontal approach (see Section 12.2) or a similar technique always works with dense submatrices, so needs no special switch. Such a method is advantageous on a computer with high overheads for indirect addressing, but we are not able to draw general conclusions.
- [1] ANALYSE entry only. The HSL code MA28 was modified to include a switch in allphases (Duff 1984c) and this was included in the design of the MA48 code. Weillustrate the benefit of performing this switch on the ANALYSE, FACTORIZE, andSOLVE phases of MA48 in Table 10.7.3. Extensive experiments by Duff and Reid(1996a) showed that the choice of switch density is not critical although a valueof 0.5 seemed best from the tests they conducted and has been set as the defaultvalue for the MA48 code. Our recent tests, using bigger matrices, indicate that alower density is usually better if factorization time is of critical importance. Thisis illustrated by the results for rim, onetonel, and shyy161 in Table 10.7.3. Of course, when the switch is made at a low density, more storage is requiredand this is indicated in Table 10.7.4. It illustrates the potential storage penaltyto be paid for switching to dense form at lower densities. We found that thefactorization time almost always increases above 0.5 and the storage overheadat 0.5 is modest so this seems a reasonable default choice. Because we do not compute numerical factors during the MA48 ANALYSE(Section 10.4), computation ceases in this phase as soon as the switch is made
- [2] ANALYSE entry only. The HSL code MA28 was modified to include a switch in allphases (Duff 1984c) and this was included in the design of the MA48 code. Weillustrate the benefit of performing this switch on the ANALYSE, FACTORIZE, andSOLVE phases of MA48 in Table 10.7.3. Extensive experiments by Duff and Reid(1996a) showed that the choice of switch density is not critical although a valueof 0.5 seemed best from the tests they conducted and has been set as the defaultvalue for the MA48 code. Our recent tests, using bigger matrices, indicate that alower density is usually better if factorization time is of critical importance. Thisis illustrated by the results for rim, onetonel, and shyy161 in Table 10.7.3. Of course, when the switch is made at a low density, more storage is requiredand this is indicated in Table 10.7.4. It illustrates the potential storage penaltyto be paid for switching to dense form at lower densities. We found that thefactorization time almost always increases above 0.5 and the storage overheadat 0.5 is modest so this seems a reasonable default choice. Because we do not compute numerical factors during the MA48 ANALYSE(Section 10.4), computation ceases in this phase as soon as the switch is made