Overview

The /pdense folder contains BLAS3 and LAPACK-like algorithms that work upon /dense containers like Matrix and LowerMatrix, but are internally multithreaded for improved performance. Beyond their importance as building blocks for /multifrontal solvers, these functions are also motivated by the need to manipulate the large, dense Schur complement matrices that arise in substructuring and block preconditioners. Asymptotically speaking, factoring a D-dimensional sparse matrix and factoring a D-1 dimensional Schur complement (like a dirichlet-to-neumann map on a boundary) carry the same space/time cost. To multithread one of these calculations but neglect the other would lead to an "unbalance" in the overall solution process.

All of these functions are layout compatible with their sequential equivalents in /dense, so they can be used essentially as drop-in replacements. Often, additional parameters (algorithmic blocksize, number of applied threads) can be passed in through a trailing pdense::Options argument. A bit of caution is warranted, however, because some of the algorithms for linear system solution (psytrf() and phetrf() in particular) use weaker pivoting strategies to enhance parallelism, and the resulting decomposition might require backwards refinement to yield accurate results. Some experimentation is recommended.

Below are links to the sections of this page:

BLAS3-like routines.

BLAS3 routines are good candidates to parallelize via multiple threads. Because they perform O(n^3) work upon O(n^2) data, large problem instances are CPU-bound. Within /pdense is a complete thread-parallel BLAS3 implementation that is signature compatible (and even layout compatible) with the sequential routines inside /dense.

The following example (tutorial/pdense/pgemm.cpp) performs a runtime comparison between sequential gemm() and parallel pgemm(), and then verifies that the error between the two is negligible. Note that sequential routines and parallel routines do not always yield the exact same answer! Although both routines perform the same operations, they may perform them in different order. So they can yield slightly different results due to the intrinsic non-commutativity of floating point arithmetic.

 #include <myramath/utility/Timer.h>
 
 #include <myramath/dense/Matrix.h>
 #include <myramath/dense/MatrixRange.h>
 #include <myramath/dense/gemm.h>
 #include <myramath/dense/frobenius.h>
 
 #include <myramath/pdense/pgemm.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 
 ADD_TEST("pdense_pgemm","[pdense]") // int main()
   {
   int I = 2048;
   int J = 2048;
   int K = 2048;
   // Inputs for gemm()
   auto C1 = Matrix<double>::random(I,J);
   auto A1 = Matrix<double>::random(I,K);
   auto B1 = Matrix<double>::random(K,J);    
   // Inputs for pgemm()
   auto C2 = C1;
   auto A2 = A1;
   auto B2 = B1;
   // Measure time to gemm() versus time to pgemm()
   double t1 = ticktock([&](){gemm_inplace(C1,A1,'N',B1,'N',1.0,1.0);});
   double t2 = ticktock([&](){pgemm_inplace(C2,A2,'N',B2,'N',1.0,1.0);});
   // Compare runtime.
   std::cout << "C1 += A1*B1, gemm(): " << t1 << " sec" << std::endl;
   std::cout << "C2 += A2*B2, pgemm(): " << t2 << " sec" << std::endl;
   std::cout << "speedup = " << t1/t2 << std::endl;
   // Compare accuracy.
   std::cout << "|C1-C2| = " << frobenius(C1-C2) << std::endl;
   }
 

C1 += A1*B1, gemm(): 6.9724 sec
C2 += A2*B2, pgemm(): 0.437025 sec
speedup = 15.9542
|C1-C2| = 0

The following parallel BLAS3-like routines are available:

pdense/pgemm.h, performs general matrix-matrix multiplication, like dense/gemm.h
pdense/ptrmm.h, multiplies by a triangular matrix, like dense/trmm.h
pdense/ptrsm.h, backsolves by a triangular matrix, like dense/trsm.h
pdense/psyrk.h, performs a symmetric rank-k update, like dense/syrk.h
pdense/psyr2k.h, performs a symmetric rank-2k update, like dense/syr2k.h
pdense/pherk.h, performs a hermitian rank-k update, like dense/herk.h
pdense/pher2k.h, performs a hermitian rank-2k update, like dense/her2k.h
pdense/phemm.h, multiplies by a hermitian matrix, like dense/hemm.h
pdense/psymm.h, multiplies by a symmetric matrix, like dense/symm.h

These routines are structured such that almost all their work occurs inside gemm(), plus some small amount (asymptotically speaking) of non-gemm() work. They should all perform well as long as your underlying BLAS implements a high-quality single-threaded gemm(). For tuning purposes, the algorithmic blocksize can be varied via a pdense::Options structure (see Appendix).

LAPACK-like routines.

Although /pdense provides a feature-complete BLAS3 layer, the support for LAPACK-like routines is more limited. Only "single-sided" decompositions related to linear system solution are provided.

The following parallel LAPACK-like routines are available:

pdense/pgetrf.h, performs L*U decomposition of a general matrix, like dense/getrf.h
pdense/ppotrf.h, performs L*L' of a symmetric/hermitian positive definite matrix, like dense/potrf.h
pdense/psytrf.h, performs L*D*L' decomposition of a symmetric matrix, like dense/sytrf.h
pdense/phetrf.h, performs L*D*L' decomposition of a hermitian matrix, like dense/hetrf.h

The following example (tutorial/pdense/psytrf.cpp) performs a runtime comparison between sequential and parallel L*D*L' decomposition (sytrf() versus psytrf()). The parallel version is faster, but also less accurate because it uses a weaker tile-pivoting strategy. Special caution is warranted when using psytrf() and phetrf(), especially in single precision. Backwards refinement can also be used to improve the solution (see the /iterative tutorial). The Cholesky and LU decompositions (ppotrf() and pgetrf(), respectively) do not have these deficiencies. The former is naturally very stable, and the latter implements the same partial-pivoting strategy as sequential LAPACK (thus favoring stability over speed/parallelism).

 #include <myramath/utility/Timer.h>
 
 #include <myramath/dense/Matrix.h>
 #include <myramath/dense/MatrixRange.h>
 #include <myramath/dense/LowerMatrix.h>
 #include <myramath/dense/LowerMatrixRange.h>
 #include <myramath/dense/tril.h>
 #include <myramath/dense/trsm.h>
 #include <myramath/dense/gemm.h>
 #include <myramath/dense/sytrf.h>
 #include <myramath/dense/transpose.h>
 #include <myramath/dense/frobenius.h>
 #include <myramath/dense/swaps.h>
 
 #include <myramath/pdense/pgemm.h>
 #include <myramath/pdense/ptrsm.h>
 #include <myramath/pdense/psytrf.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 
 ADD_TEST("pdense_psytrf","[pdense]") // int main()
   {
   // Construct inputs A and B.  
   typedef double Precision;
   int I = 4096;
   int J = 512;
   auto A = Matrix<Precision>::random(I,I);
   A = A + transpose(A);
   auto B = Matrix<Precision>::random(I,J);
   // Perform sequential sytrf().. 
   std::cout << "---- sequential ----" << std::endl;
   LowerMatrix<Precision> L1(tril(A));
   auto B1 = B;
   Timer timer1;
   auto result1 = sytrf_inplace(L1);
   double tfactor1 = timer1.elapsed_time();
   // .. use it to solve A*X1=B1
   swap_rows(result1.P_swaps,B1);
   trsm_inplace('L','N',L1,B1);
   result1.R.solve(B1,'L','N');
   swap_rows(result1.Q_swaps,B1);
   B1.bottom(result1.n_minus) *= -1.0;
   iswap_rows(result1.Q_swaps,B1);  
   result1.R.solve(B1,'L','T');
   trsm_inplace('L','T',L1,B1);
   iswap_rows(result1.P_swaps,B1);
   double tsolve1 = timer1.elapsed_time();
   // .. examine residual R1=B1-A*X1
   auto R1 = B;
   gemm_inplace(R1,A,'N',B1,'N',-1.0,+1.0);
   std::cout << "  factor A = L*D*L' time: " << tfactor1 << " sec" << std::endl;
   std::cout << "  solve A*X=B time: " << tsolve1 << " sec" << std::endl;
   std::cout << "  |A*X-B| = " << frobenius(R1)/frobenius(B) << std::endl;
   std::cout << std::endl;
   // Perform parallel psytrf()..
   std::cout << "---- parallel ----" << std::endl;
   LowerMatrix<Precision> L2(tril(A));
   auto B2 = B;
   Timer timer2;
   auto result2 = psytrf_inplace(L2);
   double tfactor2 = timer2.elapsed_time();
   // .. use it to solve A*X1=B1
   swap_rows(result2.P_swaps,B2);
   ptrsm_inplace('L','N',L2,B2);
   result2.R.solve(B2,'L','N');
   swap_rows(result2.Q_swaps,B2);
   B2.bottom(result2.n_minus) *= -1.0;
   iswap_rows(result2.Q_swaps,B2);  
   result2.R.solve(B2,'L','T');
   ptrsm_inplace('L','T',L2,B2);
   iswap_rows(result2.P_swaps,B2);
   double tsolve2 = timer2.elapsed_time();
   // .. examine residual R1=B1-A*X1
   auto R2 = B;
   pgemm_inplace(R2,A,'N',B2,'N',-1.0,+1.0);
   std::cout << "  factor A = L*D*L' time: " << tfactor2 << " sec" << std::endl;  
   std::cout << "  solve A*X = B time: " << tsolve2 << " sec" << std::endl;
   std::cout << "  |A*X-B| = " << frobenius(R2)/frobenius(B);
   std::cout << std::endl;
   }
 

---- sequential ----
  factor A = L*D*L' time: 11.1376 sec
  solve A*X=B time: 20.9472 sec
  |A*X-B| = 2.06205e-09
---- parallel ----
  factor A = L*D*L' time: 1.04306 sec
  solve A*X = B time: 2.46214 sec
  |A*X-B| = 1.17805e-09

Each of these /pdense routines is layout compatible with the corresponding /dense routine. You may also freely mix thread-parallel and sequential routines for different phases of the linear solution process. For instance, you may factor A=L*U in parallel using pgetrf() and then backsolve x=A\b using the sequential trsm(). This would be appropriate if A is large but x only has one column (backsolving just one column is memory-bound, so adding additional threads may actually slow it down due to cache eviction issues).

Appendix: Optional parameters.

Most of the routines in /pdense can take a pdense::Options structure at the end of their signature. It generally controls threading behavior. The following properties are available for user tuning:

nthreads: specifies how many threads should be used, defaults to getenv("MYRAMATH_NUM_THREADS")
progress: pointer to a ProgressMeter for visual feedback during long calculations, defaults to nullptr (no progress metering)
blocksize: sets algorithmic blocksize for tuning purposes, defaults to 256

Note that pdense::Options can be used to choose thread-parallelism and algorithmic blocking on a call-by-call basis. The following example (tutorial/pdense/options.cpp) performs a runtime comparison between getrf() and pgetrf(), using 2 threads to factor and backsolve. Each call provides visual feedback on std::cout, by injecting a user-defined ProgressMeter.

 #include <myramath/utility/Timer.h>
 
 #include <myramath/jobgraph/ProgressMeter.h>
 #include <myramath/jobgraph/UserProgressMeter.h>
 
 #include <myramath/dense/Matrix.h>
 #include <myramath/dense/MatrixRange.h>
 #include <myramath/dense/trsm.h>
 #include <myramath/dense/gemm.h>
 #include <myramath/dense/getrf.h>
 #include <myramath/dense/frobenius.h>
 #include <myramath/dense/swaps.h>
 
 #include <myramath/pdense/ptrsm.h>
 #include <myramath/pdense/pgemm.h>
 #include <myramath/pdense/pgetrf.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 typedef pdense::Options Options;
 
 namespace {
 
 // Sequential A*X=B solve.
 void test_sequential(const Matrix<double>& A, const Matrix<double>& B)
   {
   // Make deep copies of original A and B, will be overwritten with LU and X.
   Timer total;
   auto LU = A;
   auto X = B;
   Timer timer;  
   std::cout << "---- sequential ----" << std::endl;
   // Factor A = L*U
   auto P = getrf_inplace(LU);
   std::cout << "  factor A = L*U: " << timer.reset() << " sec" << std::endl;
   // Solve A*X=B
   swap_rows(P,X);
   trsm_inplace('L','L','N',LU,X,'U',1.0);
   trsm_inplace('L','U','N',LU,X,'N',1.0);
   std::cout << "  solve A*X = B: " << timer.reset() << " sec" << std::endl;
   // Form residual R = B-A*X
   auto R = B;  
   gemm_inplace(R,A,'N',X,'N',-1.0,+1.0);
   std::cout << "  form R = B-A*X: " << timer.reset() << " sec" << std::endl;
   std::cout << "  |R|/|B| = " << frobenius(R)/frobenius(B) << std::endl;
   std::cout << "  total: " << total.reset() << std::endl;
   std::cout << std::endl;
   }
 
 // Example ProgressMeter.
 class MyProgressMeter
   {
   public:
 
     // Emits a little blurb.
     void begin(const std::string& name, uint64_t total)
       {
       std::cout << "  (";
       std::cout.flush();
       t = total;
       w = 0;
       tenths = 0;
       }
 
     // Prints something every 10% 
     void increment(uint64_t delta)
       {
       w += delta;
       while (w > tenths*t/10)
         {
         std::cout << tenths*10 << "% ";
         std::cout.flush();
         ++tenths;
         }
       }
       
     // Prints 100%
     void end()
       { std::cout << "100%)" << std::endl; }
 
     // Return false.
     bool kill()
       { return false; }
     
   private:
   
     uint64_t t;      // total "work" of this calculation
     uint64_t w;      // accumulates "work-delta"
     uint64_t tenths; // counts how many "tenths" have completed
 
   };
 
 // Parallel A*X=B solve.
 void test_parallel(const Matrix<double>& A, const Matrix<double>& B)
   {
   Timer total;
   // Make deep copies of original A and B, will be overwritten with LU and X.
   auto LU = A;
   auto X = B;
   Timer timer;
   // Build non-default options to inject into all /pdense routines.
   ProgressMeter progress = make_UserProgressMeter(MyProgressMeter());
   Options options = Options::create().set_nthreads(2).set_blocksize(256).set_progress(progress);
   std::cout << "---- parallel ----" << std::endl;
   // Factor A = L*U
   auto P = pgetrf_panel(LU,options);
   std::cout << "  factor A = L*U: " << timer.reset() << " sec" << std::endl;
   // Solve A*X=B
   swap_rows(P,X);
   ptrsm_inplace('L','L','N',LU,X,'U',1.0,options);
   ptrsm_inplace('L','U','N',LU,X,'N',1.0,options);
   std::cout << "  solve A*X = B: " << timer.reset() << " sec" << std::endl;
   // Form residual R = B-A*X
   auto R = B;
   pgemm_inplace(R,A,'N',X,'N',-1.0,+1.0,options);
   std::cout << "  form R = B-A*X: " << timer.reset() << " sec" << std::endl;
   std::cout << "  |R|/|B| = " << frobenius(R)/frobenius(B) << std::endl;
   std::cout << "  total: " << total.reset() << std::endl;  
   std::cout << std::endl;
   }
 
 } // namespace
 
 ADD_TEST("pdense_options","[pdense]") // int main()
   {
   int I = 2048;
   int J = 512;
   auto A = Matrix<double>::random(I,I);
   auto B = Matrix<double>::random(I,J);
   test_sequential(A,B);  
   test_parallel(A,B);
   }
 

---- sequential ----
  factor A = L*U: 2.04412 sec
  solve A*X = B: 3.28619 sec
  form R = B-A*X: 1.8061 sec
  |R|/|B| = 8.77886e-12
  total: 7.16741
  
---- parallel ----
  (0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
  factor A = L*U: 1.31708 sec
  (0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
  (0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
  solve A*X = B: 0.944054 sec
  (0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
  form R = B-A*X: 0.811046 sec
  |R|/|B| = 8.60419e-12
  total: 3.09918

If no pdense::Options are explicitly provided, suitable defaults are used instead.

Go back to API Tutorials