The /pdense folder contains BLAS3 and LAPACK-like algorithms that work upon /dense containers like Matrix and LowerMatrix, but are internally multithreaded for improved performance. Beyond their importance as building blocks for /multifrontal solvers, these functions are also motivated by the need to manipulate the large, dense Schur complement matrices that arise in substructuring and block preconditioners. Asymptotically speaking, factoring a D-dimensional sparse matrix and factoring a D-1 dimensional Schur complement (like a dirichlet-to-neumann map on a boundary) carry the same space/time cost. To multithread one of these calculations but neglect the other would lead to an "unbalance" in the overall solution process.
All of these functions are layout compatible with their sequential equivalents in /dense, so they can be used essentially as drop-in replacements. Often, additional parameters (algorithmic blocksize, number of applied threads) can be passed in through a trailing pdense::Options argument. A bit of caution is warranted, however, because some of the algorithms for linear system solution (psytrf() and phetrf() in particular) use weaker pivoting strategies to enhance parallelism, and the resulting decomposition might require backwards refinement to yield accurate results. Some experimentation is recommended.
BLAS3 routines are good candidates to parallelize via multiple threads. Because they perform O(n^3) work upon O(n^2) data, large problem instances are CPU-bound. Within /pdense is a complete thread-parallel BLAS3 implementation that is signature compatible (and even layout compatible) with the sequential routines inside /dense.
The following example (tutorial/pdense/pgemm.cpp) performs a runtime comparison between sequential gemm() and parallel pgemm(), and then verifies that the error between the two is negligible. Note that sequential routines and parallel routines do not always yield the exact same answer! Although both routines perform the same operations, they may perform them in different order. So they can yield slightly different results due to the intrinsic non-commutativity of floating point arithmetic.
These routines are structured such that almost all their work occurs inside gemm(), plus some small amount (asymptotically speaking) of non-gemm() work. They should all perform well as long as your underlying BLAS implements a high-quality single-threaded gemm(). For tuning purposes, the algorithmic blocksize can be varied via a pdense::Options structure (see Appendix).
LAPACK-like routines.
Although /pdense provides a feature-complete BLAS3 layer, the support for LAPACK-like routines is more limited. Only "single-sided" decompositions related to linear system solution are provided.
The following parallel LAPACK-like routines are available:
The following example (tutorial/pdense/psytrf.cpp) performs a runtime comparison between sequential and parallel L*D*L' decomposition (sytrf() versus psytrf()). The parallel version is faster, but also less accurate because it uses a weaker tile-pivoting strategy. Special caution is warranted when using psytrf() and phetrf(), especially in single precision. Backwards refinement can also be used to improve the solution (see the /iterative tutorial). The Cholesky and LU decompositions (ppotrf() and pgetrf(), respectively) do not have these deficiencies. The former is naturally very stable, and the latter implements the same partial-pivoting strategy as sequential LAPACK (thus favoring stability over speed/parallelism).
Returns elapsed time in seconds since last call to reset()
Definition: Timer.cpp:18
---- sequential ----
factor A = L*D*L' time: 11.1376 sec
solve A*X=B time: 20.9472 sec
|A*X-B| = 2.06205e-09
---- parallel ----
factor A = L*D*L' time: 1.04306 sec
solve A*X = B time: 2.46214 sec
|A*X-B| = 1.17805e-09
Each of these /pdense routines is layout compatible with the corresponding /dense routine. You may also freely mix thread-parallel and sequential routines for different phases of the linear solution process. For instance, you may factor A=L*U in parallel using pgetrf() and then backsolve x=A\b using the sequential trsm(). This would be appropriate if A is large but x only has one column (backsolving just one column is memory-bound, so adding additional threads may actually slow it down due to cache eviction issues).
Appendix: Optional parameters.
Most of the routines in /pdense can take a pdense::Options structure at the end of their signature. It generally controls threading behavior. The following properties are available for user tuning:
nthreads: specifies how many threads should be used, defaults to getenv("MYRAMATH_NUM_THREADS")
progress: pointer to a ProgressMeter for visual feedback during long calculations, defaults to nullptr (no progress metering)
blocksize: sets algorithmic blocksize for tuning purposes, defaults to 256
Note that pdense::Options can be used to choose thread-parallelism and algorithmic blocking on a call-by-call basis. The following example (tutorial/pdense/options.cpp) performs a runtime comparison between getrf() and pgetrf(), using 2 threads to factor and backsolve. Each call provides visual feedback on std::cout, by injecting a user-defined ProgressMeter.