MyraMath
Tutorial for /pdense routines

Overview

The /pdense folder contains BLAS3 and LAPACK-like algorithms that work upon /dense containers like Matrix and LowerMatrix, but are internally multithreaded for improved performance. Beyond their importance as building blocks for /multifrontal solvers, these functions are also motivated by the need to manipulate the large, dense Schur complement matrices that arise in substructuring and block preconditioners. Asymptotically speaking, factoring a D-dimensional sparse matrix and factoring a D-1 dimensional Schur complement (like a dirichlet-to-neumann map on a boundary) carry the same space/time cost. To multithread one of these calculations but neglect the other would lead to an "unbalance" in the overall solution process.

All of these functions are layout compatible with their sequential equivalents in /dense, so they can be used essentially as drop-in replacements. Often, additional parameters (algorithmic blocksize, number of applied threads) can be passed in through a trailing pdense::Options argument. A bit of caution is warranted, however, because some of the algorithms for linear system solution (psytrf() and phetrf() in particular) use weaker pivoting strategies to enhance parallelism, and the resulting decomposition might require backwards refinement to yield accurate results. Some experimentation is recommended.

Below are links to the sections of this page:

BLAS3-like routines.

BLAS3 routines are good candidates to parallelize via multiple threads. Because they perform O(n^3) work upon O(n^2) data, large problem instances are CPU-bound. Within /pdense is a complete thread-parallel BLAS3 implementation that is signature compatible (and even layout compatible) with the sequential routines inside /dense.

The following example (tutorial/pdense/pgemm.cpp) performs a runtime comparison between sequential gemm() and parallel pgemm(), and then verifies that the error between the two is negligible. Note that sequential routines and parallel routines do not always yield the exact same answer! Although both routines perform the same operations, they may perform them in different order. So they can yield slightly different results due to the intrinsic non-commutativity of floating point arithmetic.

2 
5 #include <myramath/dense/gemm.h>
7 
9 
10 #include <tests/myratest.h>
11 
12 #include <iostream>
13 
14 using namespace myra;
15 
16 ADD_TEST("pdense_pgemm","[pdense]") // int main()
17  {
18  int I = 2048;
19  int J = 2048;
20  int K = 2048;
21  // Inputs for gemm()
22  auto C1 = Matrix<double>::random(I,J);
23  auto A1 = Matrix<double>::random(I,K);
24  auto B1 = Matrix<double>::random(K,J);
25  // Inputs for pgemm()
26  auto C2 = C1;
27  auto A2 = A1;
28  auto B2 = B1;
29  // Measure time to gemm() versus time to pgemm()
30  double t1 = ticktock([&](){gemm_inplace(C1,A1,'N',B1,'N',1.0,1.0);});
31  double t2 = ticktock([&](){pgemm_inplace(C2,A2,'N',B2,'N',1.0,1.0);});
32  // Compare runtime.
33  std::cout << "C1 += A1*B1, gemm(): " << t1 << " sec" << std::endl;
34  std::cout << "C2 += A2*B2, pgemm(): " << t2 << " sec" << std::endl;
35  std::cout << "speedup = " << t1/t2 << std::endl;
36  // Compare accuracy.
37  std::cout << "|C1-C2| = " << frobenius(C1-C2) << std::endl;
38  }
39 
Interface class for representing subranges of dense Matrix&#39;s.
Routines for computing Frobenius norms of various algebraic containers.
static Matrix< Number > random(int I, int J)
Generates a random Matrix of specified size.
Definition: Matrix.cpp:353
Simplistic timing class, might dispatch to platform specific timers depending upon environment...
Definition: syntax.dox:1
double ticktock(const Lambda &f)
Measures the time (in seconds) required to execute a given Lambda.
Definition: Timer.h:47
General purpose dense matrix container, O(i*j) storage.
Thread-parallel version of dense/gemm.h, Matrix*Matrix multiplication.
Variety of routines all for dense Matrix*Matrix multiplies. Delegates to the BLAS.
C1 += A1*B1, gemm(): 6.9724 sec
C2 += A2*B2, pgemm(): 0.437025 sec
speedup = 15.9542
|C1-C2| = 0

The following parallel BLAS3-like routines are available:

These routines are structured such that almost all their work occurs inside gemm(), plus some small amount (asymptotically speaking) of non-gemm() work. They should all perform well as long as your underlying BLAS implements a high-quality single-threaded gemm(). For tuning purposes, the algorithmic blocksize can be varied via a pdense::Options structure (see Appendix).

LAPACK-like routines.

Although /pdense provides a feature-complete BLAS3 layer, the support for LAPACK-like routines is more limited. Only "single-sided" decompositions related to linear system solution are provided.

The following parallel LAPACK-like routines are available:

The following example (tutorial/pdense/psytrf.cpp) performs a runtime comparison between sequential and parallel L*D*L' decomposition (sytrf() versus psytrf()). The parallel version is faster, but also less accurate because it uses a weaker tile-pivoting strategy. Special caution is warranted when using psytrf() and phetrf(), especially in single precision. Backwards refinement can also be used to improve the solution (see the /iterative tutorial). The Cholesky and LU decompositions (ppotrf() and pgetrf(), respectively) do not have these deficiencies. The former is naturally very stable, and the latter implements the same partial-pivoting strategy as sequential LAPACK (thus favoring stability over speed/parallelism).

2 
7 #include <myramath/dense/tril.h>
8 #include <myramath/dense/trsm.h>
9 #include <myramath/dense/gemm.h>
10 #include <myramath/dense/sytrf.h>
13 #include <myramath/dense/swaps.h>
14 
15 #include <myramath/pdense/pgemm.h>
16 #include <myramath/pdense/ptrsm.h>
17 #include <myramath/pdense/psytrf.h>
18 
19 #include <tests/myratest.h>
20 
21 #include <iostream>
22 
23 using namespace myra;
24 
25 ADD_TEST("pdense_psytrf","[pdense]") // int main()
26  {
27  // Construct inputs A and B.
28  typedef double Precision;
29  int I = 4096;
30  int J = 512;
31  auto A = Matrix<Precision>::random(I,I);
32  A = A + transpose(A);
33  auto B = Matrix<Precision>::random(I,J);
34  // Perform sequential sytrf()..
35  std::cout << "---- sequential ----" << std::endl;
36  LowerMatrix<Precision> L1(tril(A));
37  auto B1 = B;
38  Timer timer1;
39  auto result1 = sytrf_inplace(L1);
40  double tfactor1 = timer1.elapsed_time();
41  // .. use it to solve A*X1=B1
42  swap_rows(result1.P_swaps,B1);
43  trsm_inplace('L','N',L1,B1);
44  result1.R.solve(B1,'L','N');
45  swap_rows(result1.Q_swaps,B1);
46  B1.bottom(result1.n_minus) *= -1.0;
47  iswap_rows(result1.Q_swaps,B1);
48  result1.R.solve(B1,'L','T');
49  trsm_inplace('L','T',L1,B1);
50  iswap_rows(result1.P_swaps,B1);
51  double tsolve1 = timer1.elapsed_time();
52  // .. examine residual R1=B1-A*X1
53  auto R1 = B;
54  gemm_inplace(R1,A,'N',B1,'N',-1.0,+1.0);
55  std::cout << " factor A = L*D*L' time: " << tfactor1 << " sec" << std::endl;
56  std::cout << " solve A*X=B time: " << tsolve1 << " sec" << std::endl;
57  std::cout << " |A*X-B| = " << frobenius(R1)/frobenius(B) << std::endl;
58  std::cout << std::endl;
59  // Perform parallel psytrf()..
60  std::cout << "---- parallel ----" << std::endl;
61  LowerMatrix<Precision> L2(tril(A));
62  auto B2 = B;
63  Timer timer2;
64  auto result2 = psytrf_inplace(L2);
65  double tfactor2 = timer2.elapsed_time();
66  // .. use it to solve A*X1=B1
67  swap_rows(result2.P_swaps,B2);
68  ptrsm_inplace('L','N',L2,B2);
69  result2.R.solve(B2,'L','N');
70  swap_rows(result2.Q_swaps,B2);
71  B2.bottom(result2.n_minus) *= -1.0;
72  iswap_rows(result2.Q_swaps,B2);
73  result2.R.solve(B2,'L','T');
74  ptrsm_inplace('L','T',L2,B2);
75  iswap_rows(result2.P_swaps,B2);
76  double tsolve2 = timer2.elapsed_time();
77  // .. examine residual R1=B1-A*X1
78  auto R2 = B;
79  pgemm_inplace(R2,A,'N',B2,'N',-1.0,+1.0);
80  std::cout << " factor A = L*D*L' time: " << tfactor2 << " sec" << std::endl;
81  std::cout << " solve A*X = B time: " << tsolve2 << " sec" << std::endl;
82  std::cout << " |A*X-B| = " << frobenius(R2)/frobenius(B);
83  std::cout << std::endl;
84  }
85 
Interface class for representing subranges of dense Matrix&#39;s.
Routines for computing Frobenius norms of various algebraic containers.
static Matrix< Number > random(int I, int J)
Generates a random Matrix of specified size.
Definition: Matrix.cpp:353
Thread-parallel version of dense/trsm.h, triangular Matrix \ dense Matrix backsolution.
Simplistic timing class, might dispatch to platform specific timers depending upon environment...
Range construct for a lower triangular matrix stored in rectangular packed format.
Definition: syntax.dox:1
Measures elapsed time.
Definition: Timer.h:19
Routines for backsolving by a triangular Matrix or LowerMatrix.
LDL&#39; decompositions for real symmetric Matrix A (indefinite is fine).
Specialized container for a lower triangular matrix, O(N^2/2) storage. Used by symmetry exploiting ma...
Returns a transposed copy of a Matrix. The inplace version only works on a square operand...
Returns the lower triangle of a dense Matrix.
Thread-parallel version of dense/sytrf.h, LDL&#39; decomposition of a symmetric indefinite Matrix...
Routines related to swap sequences, often used during pivoting.
General purpose dense matrix container, O(i*j) storage.
Thread-parallel version of dense/gemm.h, Matrix*Matrix multiplication.
Variety of routines all for dense Matrix*Matrix multiplies. Delegates to the BLAS.
double elapsed_time() const
Returns elapsed time in seconds since last call to reset()
Definition: Timer.cpp:18
---- sequential ----
factor A = L*D*L' time: 11.1376 sec
solve A*X=B time: 20.9472 sec
|A*X-B| = 2.06205e-09
---- parallel ----
factor A = L*D*L' time: 1.04306 sec
solve A*X = B time: 2.46214 sec
|A*X-B| = 1.17805e-09

Each of these /pdense routines is layout compatible with the corresponding /dense routine. You may also freely mix thread-parallel and sequential routines for different phases of the linear solution process. For instance, you may factor A=L*U in parallel using pgetrf() and then backsolve x=A\b using the sequential trsm(). This would be appropriate if A is large but x only has one column (backsolving just one column is memory-bound, so adding additional threads may actually slow it down due to cache eviction issues).

Appendix: Optional parameters.

Most of the routines in /pdense can take a pdense::Options structure at the end of their signature. It generally controls threading behavior. The following properties are available for user tuning:

Note that pdense::Options can be used to choose thread-parallelism and algorithmic blocking on a call-by-call basis. The following example (tutorial/pdense/options.cpp) performs a runtime comparison between getrf() and pgetrf(), using 2 threads to factor and backsolve. Each call provides visual feedback on std::cout, by injecting a user-defined ProgressMeter.

2 
5 
8 #include <myramath/dense/trsm.h>
9 #include <myramath/dense/gemm.h>
10 #include <myramath/dense/getrf.h>
12 #include <myramath/dense/swaps.h>
13 
14 #include <myramath/pdense/ptrsm.h>
15 #include <myramath/pdense/pgemm.h>
16 #include <myramath/pdense/pgetrf.h>
17 
18 #include <tests/myratest.h>
19 
20 #include <iostream>
21 
22 using namespace myra;
23 typedef pdense::Options Options;
24 
25 namespace {
26 
27 // Sequential A*X=B solve.
28 void test_sequential(const Matrix<double>& A, const Matrix<double>& B)
29  {
30  // Make deep copies of original A and B, will be overwritten with LU and X.
31  Timer total;
32  auto LU = A;
33  auto X = B;
34  Timer timer;
35  std::cout << "---- sequential ----" << std::endl;
36  // Factor A = L*U
37  auto P = getrf_inplace(LU);
38  std::cout << " factor A = L*U: " << timer.reset() << " sec" << std::endl;
39  // Solve A*X=B
40  swap_rows(P,X);
41  trsm_inplace('L','L','N',LU,X,'U',1.0);
42  trsm_inplace('L','U','N',LU,X,'N',1.0);
43  std::cout << " solve A*X = B: " << timer.reset() << " sec" << std::endl;
44  // Form residual R = B-A*X
45  auto R = B;
46  gemm_inplace(R,A,'N',X,'N',-1.0,+1.0);
47  std::cout << " form R = B-A*X: " << timer.reset() << " sec" << std::endl;
48  std::cout << " |R|/|B| = " << frobenius(R)/frobenius(B) << std::endl;
49  std::cout << " total: " << total.reset() << std::endl;
50  std::cout << std::endl;
51  }
52 
53 // Example ProgressMeter.
54 class MyProgressMeter
55  {
56  public:
57 
58  // Emits a little blurb.
59  void begin(const std::string& name, uint64_t total)
60  {
61  std::cout << " (";
62  std::cout.flush();
63  t = total;
64  w = 0;
65  tenths = 0;
66  }
67 
68  // Prints something every 10%
69  void increment(uint64_t delta)
70  {
71  w += delta;
72  while (w > tenths*t/10)
73  {
74  std::cout << tenths*10 << "% ";
75  std::cout.flush();
76  ++tenths;
77  }
78  }
79 
80  // Prints 100%
81  void end()
82  { std::cout << "100%)" << std::endl; }
83 
84  // Return false.
85  bool kill()
86  { return false; }
87 
88  private:
89 
90  uint64_t t; // total "work" of this calculation
91  uint64_t w; // accumulates "work-delta"
92  uint64_t tenths; // counts how many "tenths" have completed
93 
94  };
95 
96 // Parallel A*X=B solve.
97 void test_parallel(const Matrix<double>& A, const Matrix<double>& B)
98  {
99  Timer total;
100  // Make deep copies of original A and B, will be overwritten with LU and X.
101  auto LU = A;
102  auto X = B;
103  Timer timer;
104  // Build non-default options to inject into all /pdense routines.
105  ProgressMeter progress = make_UserProgressMeter(MyProgressMeter());
106  Options options = Options::create().set_nthreads(2).set_blocksize(256).set_progress(progress);
107  std::cout << "---- parallel ----" << std::endl;
108  // Factor A = L*U
109  auto P = pgetrf_panel(LU,options);
110  std::cout << " factor A = L*U: " << timer.reset() << " sec" << std::endl;
111  // Solve A*X=B
112  swap_rows(P,X);
113  ptrsm_inplace('L','L','N',LU,X,'U',1.0,options);
114  ptrsm_inplace('L','U','N',LU,X,'N',1.0,options);
115  std::cout << " solve A*X = B: " << timer.reset() << " sec" << std::endl;
116  // Form residual R = B-A*X
117  auto R = B;
118  pgemm_inplace(R,A,'N',X,'N',-1.0,+1.0,options);
119  std::cout << " form R = B-A*X: " << timer.reset() << " sec" << std::endl;
120  std::cout << " |R|/|B| = " << frobenius(R)/frobenius(B) << std::endl;
121  std::cout << " total: " << total.reset() << std::endl;
122  std::cout << std::endl;
123  }
124 
125 } // namespace
126 
127 ADD_TEST("pdense_options","[pdense]") // int main()
128  {
129  int I = 2048;
130  int J = 512;
131  auto A = Matrix<double>::random(I,I);
132  auto B = Matrix<double>::random(I,J);
133  test_sequential(A,B);
134  test_parallel(A,B);
135  }
136 
Adapts user code (encapsulated in a class) into a ProgressMeter.
Interface class for representing subranges of dense Matrix&#39;s.
std::vector< int > pgetrf_panel(const MatrixRange< NumberS > &A, pdense::Options options=pdense::Options::create())
Overwrites A with its P&#39;*L*U factorization, returns pivoting data P_swaps.
Definition: pgetrf_panel.cpp:234
Tabulates an IxJ matrix. Allows random access, has column major layout to be compatible with BLAS/LAP...
Definition: bdsqr.h:20
Routines for computing Frobenius norms of various algebraic containers.
static Matrix< Number > random(int I, int J)
Generates a random Matrix of specified size.
Definition: Matrix.cpp:353
Options pack for routines in /pdense.
Definition: Options.h:24
Thread-parallel version of dense/trsm.h, triangular Matrix \ dense Matrix backsolution.
Simplistic timing class, might dispatch to platform specific timers depending upon environment...
Definition: syntax.dox:1
Interface for measuring progress via callbacks. Wraps an underlying polymorphic ProgressMeterBase.
Definition: ProgressMeter.h:55
Measures elapsed time.
Definition: Timer.h:19
Routines for backsolving by a triangular Matrix or LowerMatrix.
Routines related to swap sequences, often used during pivoting.
double reset()
Resets timer to zero, returns elapsed_time()
Definition: Timer.cpp:25
General purpose dense matrix container, O(i*j) storage.
Thread-parallel version of dense/gemm.h, Matrix*Matrix multiplication.
Variety of routines all for dense Matrix*Matrix multiplies. Delegates to the BLAS.
ProgressMeter make_UserProgressMeter(const UserClass &u)
Helper function to adapt some user code (encapsulated in a class) into a ProgressMeter.
Definition: UserProgressMeter.h:57
Interface type for progress measurement and control within JobGraph&#39;s.
General purpose A = P&#39;*L*U decomposition for square Matrix&#39;s.
Thread-parallel versions of dense/getrf.h, LU decomposition of a Matrix.
---- sequential ----
factor A = L*U: 2.04412 sec
solve A*X = B: 3.28619 sec
form R = B-A*X: 1.8061 sec
|R|/|B| = 8.77886e-12
total: 7.16741
---- parallel ----
(0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
factor A = L*U: 1.31708 sec
(0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
(0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
solve A*X = B: 0.944054 sec
(0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%)
form R = B-A*X: 0.811046 sec
|R|/|B| = 8.60419e-12
total: 3.09918

If no pdense::Options are explicitly provided, suitable defaults are used instead.

Go back to API Tutorials