Overview

The /jobgraph folder contains tools for writing thread-parallel code based upon MyraMath's JobGraph concept. A JobGraph is a container of executable code statements, along with their captured referential environment, much like a closure or a C++11 lambda function. However, in contrast to a C++11 lambda, a JobGraph executes in parallel over multiple threads, not serially.

All of MyraMath's parallel algorithms (like the BLAS3/LAPACK routines in /pdense and the Solver's in /multifrontal) are implemented in terms of JobGraph's. Users can compose their own higher-level JobGraph's by combining the low-level ones already defined in Myramath (see ParallelJobGraph, SequentialJobGraph and FusedJobGraph). Alternatively, users can create their own low-level JobGraph's from the bottom-up using UserJobGraph and LambdaJobGraph. Tools for debugging and visualizing JobGraph's are also available.

Below are links to the sections of this page:

What is a JobGraph?

The JobGraph class (and the /jobgraph ecosystem) is a small toolkit for building thread-parallel C++ programs. Like OpenMP, JobGraph is a shared-memory programming model designed for multicore machines. OpenMP is typically used to parallelize embarrassingly parallel for-loops of homogeneous code using "bolted-on" #pragma's (a SIMD-like model). In contrast, JobGraph is task-based, where heterogeneous fragments of code (Job's) are processed asynchronously according to their declared dependencies (a MIMD-like model). JobGraph is more intrusive, but can be applied to a wider variety of workloads and "scales up" more naturally using a language of composition.

Consider the example psuedocode below, which performs A=L*L' (Cholesky) decomposition of a 4x4 block matrix:

 +-------------------------------------------------------+-----------------------------------------+
 | Serial program                                        | OpenMP program                          |
 +-------------------------------------------------------+-----------------------------------------+
 | for k=[0 to 4)                                        | for k=[0 to 4)                          |
 |   {                                                   |   {                                     |
 |   factor A(k,k) into L(k,k)*L(k,k)'     [potrf(k)]    |   factor A(k,k) into L(k,k)*L(k,k)'     |
 |   for i=[k+1 to 4)                                    |   #pragma omp parallel for              |
 |     solve L(i,k)*L(k,k) = A(i,k)        [trsm(k,i)]   |   for i=[k+1 to 4)                      |
 |   for j=[k+1 to 4)                                    |     solve L(i,k)*L(k,k) = A(i,k)        |
 |     {                                                 |   for j=[k+1 to 4)                      |
 |     downdate A(j,j) -= L(j,k)*L(j,k)'   [syrk(k,j)]   |     {                                   |
 |     for i=[j+1 to 4)                                  |     downdate A(j,j) -= L(j,k)*L(j,k)'   |
 |       downdate A(i,j) -= L(i,k)*L(j,k)' [gemm(k,i,j)] |     # pragma omp parallel for           |
 |     }                                                 |     for i=[j+1 to 4)                    |
 |   }                                                   |       downdate A(i,j) -= L(i,k)*L(j,k)' |
 |                                                       |     }                                   |
 |                                                       |   }                                     |
 +-------------------------------------------------------+-----------------------------------------+

In its current form, this code is rather resistant to bolt-on parallelization using #pragma omp parallel for directives. Each iteration of the outermost loop (k) carries many dependencies from the previous iteration, disqualifying it from parallelization. And the body of the k-loop has a heterogeneous payload of various sub-loops, but no single for-loop has an especially large iteration count. Assuming all tasks take equal time, adding pragma's to the two embarrassingly parallel loops (the trsm() loop and the gemm() loop) will yield the following speedup:

+------+----------------+----------------------------------+---------------------------------------------------------------------+
| Time | Serial program | OpenMP program, #pragma-for SIMD | JobGraph program, task-based MIMD                                   |
+------+----------------+----------------------------------+---------------------------------------------------------------------+
|   1  |  potrf(0)      |  potrf(0)                        |  potrf(0)                                                           |
|   2  |  trsm(0,1)     |  trsm(0,1) trsm(0,2) trsm(0,3)   |  trsm(0,1) trsm(0,2) trsm(0,3)                                      |
|   3  |  trsm(0,2)     |  syrk(0,1)                       |  syrk(0,3) gemm(0,3,1) syrk(0,1) gemm(0,3,2) gemm(0,2,1) syrk(0,2)  |
|   4  |  trsm(0,3)     |  gemm(0,2,1) gemm(0,3,1)         |  potrf(1)                                                           |
|   5  |  syrk(0,1)     |  syrk(0,2)                       |  trsm(1,2) trsm(1,3)                                                |
|   6  |  gemm(0,2,1)   |  gemm(0,3,2)                     |  syrk(1,3) gemm(1,3,2) syrk(1,2)                                    |
|   7  |  gemm(0,3,1)   |  syrk(0,3)                       |  potrf(2)                                                           |
|   8  |  syrk(0,2)     |  potrf(1)                        |  trsm(2,3)                                                          |
|   9  |  gemm(0,3,2)   |  trsm(1,2) trsm(1,3)             |  syrk(2,3)                                                          |
|  10  |  syrk(0,3)     |  syrk(1,2)                       |  potrf(3)                                                           |
|  11  |  potrf(1)      |  gemm(1,3,2)                     |                                                                     |
|  12  |  trsm(1,2)     |  syrk(1,3)                       |                                                                     |
|  13  |  trsm(1,3)     |  potrf(2)                        |                                                                     |
|  14  |  syrk(1,2)     |  trsm(2,3)                       |                                                                     |
|  15  |  gemm(1,3,2)   |  syrk(2,3)                       |                                                                     |
|  16  |  syrk(1,3)     |  potrf(3)                        |                                                                     |
|  17  |  potrf(2)      |                                  |                                                                     |
|  18  |  trsm(2,3)     |                                  |                                                                     |
|  19  |  syrk(2,3)     |                                  |                                                                     |
|  20  |  potrf(3)      |                                  |                                                                     |
+------+----------------+----------------------------------+---------------------------------------------------------------------+

The total time to complete the algorithm was reduced from 20 units (left column) to 16 units (middle column), not great but perhaps commensurate with the level of programming effort. A particularly painful artifact of the current implementation is that the potrf(1) task, which only depends upon three other tasks (potrf(0), trsm(0,1), and syrk(0,1)), is only performed at time t=8 even with OpenMP parallelization. Delays like this lead to a needlessly long "tail" of computation. It would be preferable to start that task much sooner, but the code (in its current form) does a poor job of expressing the minimal critical path of dependencies to execute potrf(1).

This discussion is admittedly unfair to OpenMP, as there are smarter ways to parallelize this algorithm by reorganizing it and using more advanced OpenMP directives (the tasking model). JobGraph is exactly that sort of reorganization. A careful tabulation of all the read/write dependencies of the algorithm into a directed acyclic graph yields the following result:

Click here to view fullsize image.

If every Job was executed as soon as its critical path allowed it, considerable time savings could be accumulated (right column). Since potrf(0) has no prior dependencies, execution would begin() there immediately at t=1. All of the trsm(0,*) Job's can then run at t=2, just like the #pragma case. However, the JobGraph approach pulls ahead at t=3, as all of the syrk(0,*) and gemm(0,*,*) Job's can be executed in parallel (despite their heterogeneity), a concept that is difficult to express using bolt-on #pragma's. This results in higher resource utilization (up to six active processing units instead of three) and enables potrf(1) to execute as early as t=4. Further gains accumulate for k=1 and k=2. As the potrf(3) has no subsequent dependencies, execution will end() there at t=10, in roughly half the time as the original serial algorithm.

Executing existing JobGraph's

All of MyraMath's parallel algorithms are implemented in terms of JobGraph. For example, pdense/pgemm.h defines an "immediate" routine, pgemm_inplace(), that calculates C := alpha*A*B + beta*C in parallel, but it also defines a "deferred" version, pgemm_jobgraph(). Calling pgemm_jobgraph() requires the same arguments as pgemm_inplace(), but it doesn't actually calculate anything. Instead, it captures the arguments to the function (like a C++11 lambda) and returns a JobGraph that can "fulfill" the calculation at some point in the future, when you call execute() upon it.

All of the algorithms in /pdense follow the same pattern: exposing an "immediate" version and a "deferred" version that does nothing but return a JobGraph, denoted with a _jobgraph() suffix in the function name. The following _jobgraph() routines are exposed:

pgemm_jobgraph(), performs general matrix-matrix multiplication
ptrmm_jobgraph(), multiplies by a triangular matrix
ptrsm_jobgraph(), backsolves by a triangular matrix
psyrk_jobgraph(), performs a symmetric rank-k update
psyr2k_jobgraph(), performs a symmetric rank-2k update
pherk_jobgraph(), performs a hermitian rank-k update
pher2k_jobgraph(), performs a hermitian rank-2k update
phemm_jobgraph(), multiplies by a hermitian matrix
psymm_jobgraph(), multiplies by a symmetric matrix
ppotrf_jobgraph(), performs L*L' decomposition of a symmetric positive matrix
psytrf_jobgraph(), performs L*D*L' decomposition of a symmetric matrix
phetrf_jobgraph(), performs L*D*L' decomposition of a hermitian matrix
pgetrf_jobgraph(), performs L*U decompositions of a general matrix

The parallel algorithms in /multifrontal are typically member functions of a Solver class, not free-standing non-member functions. Each Solver offers both an "immediate" version and "deferred" _jobgraph() version of the following algorithms:

Solver.factor_jobgraph(), decomposes a sparse A into triangular factors
Solver.solve_jobgraph(), performs sparse backsolution
Solver.schur_jobgraph(), forms a sparse B'*inv(A)*B schur complement
Solver.inverse_jobgraph(), samples inverse(A)
Solver.partialsolve_jobgraph(), performs sparse backsolution

Consult the source level documentation for each of these algorithms for further details.

Composing JobGraph's

On their own, the _jobgraph() versions of all these algorithms might seem superfluous. In particular, why bother capturing the JobGraph of an algorithm only to execute() it later, when you could just call the "immediate" version at that time instead? The answer is that these atomic JobGraph's can be composed together into larger ones, in such a way that they expose more opportunities for parallelism and therefore run faster.

The main tool for this kind of composition is ParallelJobGraph (and the parallelize() helper function). It's deceptively simple: given a collection of independent JobGraph's (in the sense that they do not have any read/write hazards among their captured arguments, embarrassingly parallel), ParallelJobGraph combines them into a larger one. Consider tutorial/jobgraph/ParallelJobGraph.cpp, listed below, which computes the Cholesky decomposition of two Matrix's (A1 and A2) at the same time:

 #include <myramath/utility/Timer.h>
 
 #include <myramath/jobgraph/ParallelJobGraph.h>
 #include <myramath/jobgraph/execute.h>
 #include <myramath/jobgraph/graphviz.h>
 
 #include <myramath/dense/Matrix.h>
 #include <myramath/dense/LowerMatrix.h>
 
 #include <myramath/pdense/psyrk.h>
 #include <myramath/pdense/ppotrf.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 
 ADD_TEST("jobgraph_ParallelJobGraph","[jobgraph]") // int main()
   {
   // Make two random symmetric positive Matrix's, A1 and A2.
   int N1 = 1152;
   int N2 = 1523;
   auto G1 = Matrix<double>::random(N1,N1);
   auto G2 = Matrix<double>::random(N2,N2);
   auto A1 = psyrk(G1);
   auto A2 = psyrk(G2);
   // Measure the time to complete two "immediate" ppotrf_inplace() calls.
   auto A1_immediate = A1;
   auto A2_immediate = A2;
   Timer timer1;
   ppotrf_inplace(A1_immediate);
   ppotrf_inplace(A2_immediate);
   double t1 = timer1.elapsed_time();
   std::cout << "t1 (immediate) = " << t1 << std::endl;
   // Measure the time required by a ParallelJobGraph that performs the same task.
   auto A1_jobgraph = A1;
   auto A2_jobgraph = A2;
   Timer timer2;
   execute( parallelize( ppotrf_jobgraph(A1), ppotrf_jobgraph(A2) ) );
   double t2 = timer2.elapsed_time();
   std::cout << "t2 (jobgraph) = " << t2 << std::endl;
   // Evaluate speedup of the ParallelJobGraph approach.
   std::cout << "speedup = " << t1/t2 << std::endl;
   }

t1 (immediate) = 0.677039
t2 (jobgraph) = 0.510029
speedup = 1.32745

Composing JobGraph's like this can yield meaningful speedups (say 10-20%) with little programming effort required. The figure below depicts the resulting ParallelJobGraph and helps explain why it can be faster: by aggregating the two ppotrf_jobgraph() calls into one, multiple processing units (at least two) can consistently be put to use, even as each individual factorization draws to a close.

Click here to view fullsize image.

The complementary class to ParallelJobGraph is SequentialJobGraph (and the sequentialize() helper function). This class is used to compose two (or more) JobGraph's that (unfortunately) have some sort of data hazard. Perhaps the outputs of one JobGraph are the inputs to the next? Or multiple JobGraph's would "race" to update the same output quantity? Both of these cases would require the use of a SequentialJobGraph to enforce the dependency. Consider tutorial/jobgraph/SequentialJobGraph.cpp, which builds on the previous example:

 #include <myramath/utility/Timer.h>
 
 #include <myramath/jobgraph/ParallelJobGraph.h>
 #include <myramath/jobgraph/SequentialJobGraph.h>
 #include <myramath/jobgraph/execute.h>
 #include <myramath/jobgraph/graphviz.h>
 
 #include <myramath/dense/Matrix.h>
 #include <myramath/dense/LowerMatrix.h>
 #include <myramath/dense/frobenius.h>
 
 #include <myramath/pdense/psyrk.h>
 #include <myramath/pdense/ppotrf.h>
 #include <myramath/pdense/ptrsm.h>
 #include <myramath/pdense/psymm.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 
 ADD_TEST("jobgraph_SequentialJobGraph","[jobgraph]") // int main()
   {
   // Make two random symmetric positive Matrix's, A1 and A2.
   int N1 = 1152;
   int N2 = 1523;
   auto G1 = Matrix<double>::random(N1,N1);
   auto G2 = Matrix<double>::random(N2,N2);
   auto A1 = psyrk(G1);
   auto A2 = psyrk(G2);
   auto L1 = A1;
   auto L2 = A2;
   // Make random forcing data B1 and B2.
   int J1 = 322;
   int J2 = 507;
   auto B1 = Matrix<double>::random(N1,J1);
   auto B2 = Matrix<double>::random(N2,J2);
   // Form workspaces R1 and R2 to measure residuals.
   auto R1 = B1;
   auto R2 = B2;
   double one = 1.0;
   // Build SequentialJobGraph's that factor/solve/check each 
   // Ai*Xi=Bi problem, then combine them into a ParallelJobGraph.
   JobGraph g = 
     parallelize(
       sequentialize(
         ppotrf_jobgraph(L1),                  // factor A1 = L1*L1'
         ptrsm_jobgraph('L','N',L1,B1),        // update B1 = L1\B1
         ptrsm_jobgraph('L','T',L1,B1),        // update B1 = L1'\B1
         psymm_jobgraph('L',R1,A1,B1,-one,one) // form R1 = A1*X1-B1
         ),
       sequentialize(
         ppotrf_jobgraph(L2),                  // factor A2 = L2*L2'
         ptrsm_jobgraph('L','N',L2,B2),        // update B2 = L2\B2
         ptrsm_jobgraph('L','T',L2,B2),        // update B2 = L2'\B2
         psymm_jobgraph('L',R2,A2,B2,-one,one) // form R2 = A2*X2-B2
         )
       );
   execute(g);
   // Verify A1*X1-B1=0, and A2*X2-B2=0
   double error1 = frobenius(R1);
   double error2 = frobenius(R2);
   std::cout << "|A1*X1-B1| = " << error1 << std::endl;
   std::cout << "|A2*X2-B2| = " << error2 << std::endl;
   }

|A1*X1-B1| = 8.17575e-07

|A2*X2-B2| = 2.53889e-06

Instead of just factoring A1 and A2, this example parallelizes a pair of "factor/solve/check" problems. The two problems are independent, but the various substages require sequence points (you must factor A1 before you can solve X1=A1\B1, and you must solve for X1 before you can form the residual R1=A1*X1-B1). SequentialJobGraph and/or sequentialize() can be used to enforce these sequence points, but still returns JobGraph's suitable for parallelize()'ing later. Conceptually, sequentialize()'ing two JobGraph's g1 and g2 is straightforward: just add a dependency (an edge) from every end() Job of g1 into every begin() Job of g2. The figure below depicts the final JobGraph:

Click here to view fullsize image.

The last compositional tool to discuss is FusedJobGraph. The tools introduced so far have been "black-or-white" in their approach to describing dependencies between subgraphs. At one extreme, ParallelJobGraph is used to indicate two JobGraph's g1 and g2 are embarassingly parallel (none of the Job's in g2 depend upon any Job in g1). On the other hand, SequentialJobGraph is used to indicate total dependency (none of the Job's in g2 can begin until every Job in g1 is finished). To cover the "gray" area between these two extremes, there is FusedJobGraph.

To motivate FusedJobGraph, consider the psuedocode snippets below. They encode the four substages to solving A*X=B for symmetric positive A: (i) Cholesky factorization for A=L*L', (ii) forward solution to find X=L\B, (iii) back solution to find X=L'\B, and (iv) multiplication to form the residual R=B-A*X.

 +---------------------------------------+--------------------------------------+---------------------------------------+--------------------------------------+
 | (i) Cholesky factor A=L*L'            | (ii) Forward solve X=L\B             | (iii) Back solve X=L'\B               | (iv) Form R = B-A*X                  |
 +---------------------------------------+--------------------------------------+---------------------------------------+--------------------------------------+
 | for k=[0 to N)                        | for j=[0 to J)                       | for j=[0 to J)                        | for i=[0 to N)                       |
 |   {                                   |   for k=[0 to N)                     |   for k=(N downto 0]                  |   for j=[0 to J)                     |
 |   factor A(k,k) = L(k,k)*L(k,k)'      |     {                                |     {                                 |     for k=[0 to N)                   |
 |   for i=[k+1 to N)                    |     solve L(k,k)*X(k,j) = B(k,j)     |     solve L(k,k)'*X(k,j) = B(k,j)     |       update R(i,j) -= A(i,k)*X(k,j) |
 |     solve L(i,k)*L(k,k)' = A(i,k)     |     for i=[k+1 to N)                 |     for i=[0 to k)                    |                                      |
 |   for j=[k+1 to N)                    |       update B(i,j) -= L(i,k)*X(k,j) |       update B(i,j) -= L(k,i)'*X(k,j) |                                      |
 |     {                                 |     }                                |     }                                 |                                      |
 |     update A(j,j) -= L(j,k)*L(j,k)'   |                                      |                                       |                                      |
 |     for i=[j+1 to N)                  |                                      |                                       |                                      |
 |       update A(i,j) -= L(i,k)*L(j,k)' |                                      |                                       |                                      |
 |     }                                 |                                      |                                       |                                      |
 |   }                                   |                                      |                                       |                                      |
 +---------------------------------------+--------------------------------------+---------------------------------------+--------------------------------------+

In the previous section, it was pointed out that the substages (i) through (iv) are not embarrassingly parallel, so a SequentialJobGraph was used to sequence them. However, a closer inspection reveals that assuming this kind of "total dependency" is a pessimization. For example, the first "row" of the forward solution stage, X(0,:)=L(0,0)\B(0,:), could be started as soon as the first pivot factorization A(0,0)=L(0,0)*L(0,0)' is complete. The code listing below, tutorial/jobgraph/FusedJobGraph.cpp carefully encodes the minimum set of fine-grained dependencies between (i)-(iv) into a FusedJobGraph.

 #include <myramath/utility/Timer.h>
 
 #include <myramath/jobgraph/SequentialJobGraph.h>
 #include <myramath/jobgraph/ParallelJobGraph.h>
 #include <myramath/jobgraph/FusedJobGraph.h>
 #include <myramath/jobgraph/verify.h>
 #include <myramath/jobgraph/JobID.h>
 #include <myramath/jobgraph/execute.h>
 
 #include <myramath/dense/LowerMatrix.h>
 #include <myramath/dense/Matrix.h>
 #include <myramath/dense/frobenius.h>
 
 #include <myramath/pdense/Options.h>
 #include <myramath/pdense/ppotrf.h>
 #include <myramath/pdense/psymm.h>
 #include <myramath/pdense/psyrk.h>
 #include <myramath/pdense/ptrsm.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 
 ADD_TEST("jobgraph_FusedJobGraph","[jobgraph]") // int main()
   {
   // Create symmetric positive A, and random forcing data B.
   int asize = 1024;
   int bsize = 1024;
   int blocksize = 256;
   auto G = Matrix<double>::random(asize,asize);
   auto B = Matrix<double>::random(asize,bsize);
   auto A = psyrk(G);
   double one = 1.0;
   typedef pdense::Options Options;
   Options option = Options::create().set_blocksize(blocksize);
   // Factor A1=L1*L1', solve L1*L1'*X1=B1 and measure R1=B1-A1*X1 using SequentialJobGraph (reference).
   auto L1 = A;
   auto X1 = B;
   auto R1 = B;
   SequentialJobGraph graph1;
   graph1.push_back( ppotrf_jobgraph(L1) );
   graph1.push_back( ptrsm_jobgraph('L','N',L1,X1) );
   graph1.push_back( ptrsm_jobgraph('L','T',L1,X1) );
   graph1.push_back( psymm_jobgraph('L',R1,A,X1,-one,one) );
   // Factor A2=L2*L2', solve L2*L2'*X2=B2 and measure R2=B2-A2*X2 using FusedJobGraph.
   auto L2 = A;
   auto X2 = B;
   auto R2 = B;  
   FusedJobGraph graph2;
   int g1 = graph2.insert( ppotrf_jobgraph(L2) );
   int g2 = graph2.insert( ptrsm_jobgraph('L','N',L2,X2) );
   int g3 = graph2.insert( ptrsm_jobgraph('L','T',L2,X2) );
   int g4 = graph2.insert( psymm_jobgraph('L',R2,A,X2,-one,one,'B') );
   // Fusing (g1,g2,g3,g4) together depends upon packing JobID's 
   // based on these algorithmic block sizes (NxN system, J rhs) ..
   int N = asize/blocksize;
   int J = bsize/blocksize;
   // .. packing schemes for (g1,g2,g3,g4)
   Pack<int,3> p1 = pack(N,N,N);
   Pack<int,3> p2 = pack(N,N,J);
   Pack<int,3> p3 = pack(N,N,J);
   Pack<int,3> p4 = pack(N,N,J);
   // Add fine-grained dependencies between g1->g2 ..
   for (int j = 0; j < J; ++j)
     for (int k = 0; k < N; ++k)
       {
       // .. must factor A(k,k)=L(k,k)*L'(k,k) before solving X(k,:)=L(k,k)\B(k,:)
       graph2.add_edge( g1,JobID(p1,pack(k,k,k)), g2,JobID(p2,pack(k,k,j)) );      
       for (int i = k+1; i < N; ++i)
         // .. must solve L(i,k)*L'(k,k)=A(i,k) before updating B(i,:)-=L(i,k)*X(k,j)
         graph2.add_edge( g1,JobID(p1,pack(k,i,k)), g2,JobID(p2,pack(k,i,j)) );
       }
   // Add fine-grained dependencies between g2->g3 ..
   for (int j = 0; j < J; ++j)
     // .. must forward solve L\X(end,:) before back solving L'\X(end,:)
     graph2.add_edge( g2,JobID(p2,pack(N-1,N-1,j)), g3,JobID(p3,pack(N-1,N-1,j)) );
   // Add fine-grained dependencies between g3->g4 ..
   for (int j = 0; j < J; ++j)
     for (int i = 0; i < N; ++i)
       for (int k = 0; k < N; ++k)
         // .. must back solve X(k,j) before accumulating R(i,j) -= A(i,k)*X(k,j)
         graph2.add_edge( g3,JobID(p3,pack(k,k,j)), g4,JobID(p4,pack(k,i,j)) );    
   // Execute graph1 and graph2, measure speedup.
   double t1 = ticktock([&graph1](){execute(graph1);});
   double t2 = ticktock([&graph2](){execute(graph2);});
   std::cout << "SequentialJobGraph time = " << t1 << "sec" << std::endl;
   std::cout << "FusedJobGraph time = " << t2 << "sec" << std::endl;  
   std::cout << "speedup = " << t1/t2 << std::endl;
   // Verify residuals of X1 and X2.
   std::cout << "|R1=A*X1-B| = " << frobenius(R1) << std::endl;
   std::cout << "|R2=A*X2-B| = " << frobenius(R2) << std::endl;
   }
 
 

SequentialJobGraph time = 1.8001sec
FusedJobGraph time = 1.30707sec
speedup = 1.3772
|R1=A*X1-B| = 9.83119e-09
|R2=A*X2-B| = 9.84249e-09

The resulting FusedJobGraph is shown below (left). The various substages are color-coded for ease of reading. For sake of comparison, a SequentialJobGraph that performs the exact same work is shown below, too (right). As demonstrated by the output of the test program above, the FusedJobGraph runs about 20-30% faster than the SequentialJobGraph, because removing these "total dependency" pessimizations shrinks the critical path of execution considerably.

Click here to view fullsize image.

For better or worse, FusedJobGraph is a specialist's tool. Like the other compositional JobGraph's, FusedJobGraph is nothing more than a container for other subgraphs. However, the user is required to correctly encode all of the fine-grained inter-subgraph dependencies explicitly, a task that demands fairly deep knowledge of the underlying algorithms and how they are mapped onto JobGraph's. That said, MyraMath does use this technique internally (/multifrontal algorithms are basically fused collections of /pdense algorithms), so it is exposed for the advanced user.

User-defined JobGraph's

To complement the top-down compositional tools described above, there is also a UserJobGraph, useful for constructing arbitrary JobGraph's from the bottom-up. Arbitrary code (Job's) in the form of C++11 lambda's can be injected using UserJobGraph.insert(), then UserJobGraph.add_edge() is used to specify dependencies. Once finished, the graph can either be execute()'d, or further composed using parallelize() or sequentialize(), just like any other JobGraph. The code example below, tutorial/jobgraph/UserJobGraph.cpp shows the idea in action, implementing a single level of the Strassen algorithm to parallelize Matrix*Matrix multiplication:

 #include <myramath/utility/Timer.h>
 
 #include <myramath/expression/Expression.h>
 #include <myramath/expression/arithmetic.h>
 
 #include <myramath/jobgraph/UserJobGraph.h>
 #include <myramath/jobgraph/execute.h>
 
 #include <myramath/dense/Matrix.h>
 #include <myramath/dense/gemm.h>
 #include <myramath/dense/frobenius.h>
 #include <myramath/dense/expr.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 
 ADD_TEST("jobgraph_UserJobGraph","[jobgraph]") // int main()
   {
   // Define random Matrix's A and B, and zero-initialize C to accumulate A*B. Each is size 2Nx2N.
   int N = 1024;
   auto A = Matrix<double>::random(2*N,2*N);
   auto B = Matrix<double>::random(2*N,2*N);
   auto C = Matrix<double>::zeros(2*N,2*N);
   auto AB = Matrix<double>::zeros(2*N,2*N);
   // Measure time to form reference A*B using gemm()
   Timer timer;
   gemm_inplace(AB,A,'N',B,'N');
   double t0 = timer.reset();
   // Partition A into NxN sub-Range's [A11 A12; A21 A22]
   typedef MatrixRange<double> Range;
   Range A11 = A.top(N).left(N);
   Range A12 = A.top(N).right(N);
   Range A21 = A.bottom(N).left(N);
   Range A22 = A.bottom(N).right(N);
   // Partition B.
   Range B11 = B.top(N).left(N);
   Range B12 = B.top(N).right(N);
   Range B21 = B.bottom(N).left(N);
   Range B22 = B.bottom(N).right(N);
   // Partition C.
   Range C11 = C.top(N).left(N);
   Range C12 = C.top(N).right(N);
   Range C21 = C.bottom(N).left(N);
   Range C22 = C.bottom(N).right(N);
   // Strassen algorithm requires 7 workspace Matrix's of size NxN ..
   std::vector< Matrix<double> > M;
   for (int i = 0; i < 7; ++i)
     M.push_back( Matrix<double>::zeros(N,N) );
   // Insert 7 lambda's into a UserJobGraph, each will calculate one of the M[i]'s ..
   UserJobGraph G;
   uint64_t j0 = G.insert([&](){ gemm_inplace(M[0],A11+A22,'N',B11+B22,'N'); });
   uint64_t j1 = G.insert([&](){ gemm_inplace(M[1],A21+A22,'N',B11,'N'); });
   uint64_t j2 = G.insert([&](){ gemm_inplace(M[2],A11,'N',B12-B22,'N'); });
   uint64_t j3 = G.insert([&](){ gemm_inplace(M[3],A22,'N',B21-B11,'N'); });
   uint64_t j4 = G.insert([&](){ gemm_inplace(M[4],A11+A12,'N',B22,'N'); });
   uint64_t j5 = G.insert([&](){ gemm_inplace(M[5],A21-A11,'N',B11+B12,'N'); });
   uint64_t j6 = G.insert([&](){ gemm_inplace(M[6],A12-A22,'N',B21+B22,'N'); });
   // Insert 4 more lambda's into G, each populates one block of C ..
   uint64_t j11 = G.insert([&](){ C11.assign( expr(M[0])+expr(M[3])-expr(M[4])+expr(M[6]) ); });
   uint64_t j12 = G.insert([&](){ C12.assign( expr(M[2])+expr(M[4]) ); });
   uint64_t j21 = G.insert([&](){ C21.assign( expr(M[1])+expr(M[3]) ); });
   uint64_t j22 = G.insert([&](){ C22.assign( expr(M[0])-expr(M[1])+expr(M[2])+expr(M[5]) ); });
   // Enforce required dependencies between Job's ..
   // .. C11 depends upon M[0], M[3], M[4] and M[6]
   G.add_edge(j0,j11);
   G.add_edge(j3,j11);  
   G.add_edge(j4,j11);
   G.add_edge(j6,j11);
   // .. C12 depends upon M[2] and M[4]
   G.add_edge(j2,j12);
   G.add_edge(j4,j12);
   // .. C21 depends upon M[1] and M[3]
   G.add_edge(j1,j21);
   G.add_edge(j3,j21);
   // .. C22 depends upon M[0], M[1], M[2] and M[5]
   G.add_edge(j0,j22);
   G.add_edge(j1,j22);
   G.add_edge(j2,j22);
   G.add_edge(j5,j22);
   // Execute G, measure speedup versus calling gemm() directly.
   execute(G);
   double t1 = timer.reset();
   std::cout << "time for gemm(A,B) = " << t0 << std::endl;
   std::cout << "time for UserJobGraph = " << t1 << std::endl;  
   std::cout << "speedup = " << t0 / t1 << std::endl;
   // Verify that C = A*B.
   std::cout << "|C-A*B| = " << frobenius(C-AB)/frobenius(AB) << std::endl;
   }

time for gemm(A,B) = 6.73238
time for UserJobGraph = 5.51832
speedup = 1.22001
|C-A*B| = 2.62102e-15

The JobGraph produced by this example is shown below:

LambdaJobGraph (and the make_LambdaJobGraph() helper function) are close relatives of UserJobGraph. They provide syntactic sugar for a UserJobGraph that contains exactly one Job (again, in the form of a C++11 lambda). The code example below, tutorial/jobgraph/LambdaJobGraph.cpp, shows them in action, by stringing together a few messages:

 #include <myramath/jobgraph/LambdaJobGraph.h>
 #include <myramath/jobgraph/SequentialJobGraph.h>
 #include <myramath/jobgraph/execute.h>
 
 #include <tests/myratest.h>
 
 #include <iostream>
 
 using namespace myra;
 
 ADD_TEST("jobgraph_LambdaJobGraph","[jobgraph]") // int main()
   {
   SequentialJobGraph g;
   g.push_back( make_LambdaJobGraph([](){ std::cout << "Hello "; }) );
   g.push_back( make_LambdaJobGraph([](){ std::cout << "many "; }) );
   g.push_back( make_LambdaJobGraph([](){ std::cout << "worlds! "; }) );
   g.push_back( make_LambdaJobGraph([](){ std::cout << std::endl; }) );  
   execute(g);
   }
 

Hello many worlds!

The JobGraph produced by this example is shown below:

Debugging Tools

Debugging parallel algorithms can be problematic, mainly because their non-deterministic order of execution impedes reproducibility. JobGraph can be unforgiving in this regard, but there are a few tools available to aid debugging. The first is verify(const JobGraph& g), which can probe a JobGraph g for some common bugs (the presence of a cycle of dependencies, check reflexivity of parent/child relationships between Job's, etc). If verify(g) reports any problems, it's likely that execute(g) will deadlock, crash or otherwise misbehave. Unfortunately the converse isn't necessarily true. That is, even if verify(g) reports no problems, that's not sufficient to guarantee correctness/termination of execute(g), because g could have failed to encode/report every dependency between Job's. Put another way, execute() might reveal new bugs that verify() cannot detect.

The next debugging tool that's worth mentioning is graphviz(const JobGraph& g). It traverses all the Job's in g, and encodes the dependencies among them as a DAG in .dot format. This the native input format for Graphviz, a powerful graph drawing tool that can typeset/visualize the JobGraph into a variety of output formats (.pdf, .png, many more). All the JobGraph figures on this page were generated using graphviz(g). In practice, graphviz(g) is particularly useful when one is trying to reason about g in order to incorporate it into a higher-level FusedJobGraph.

When verify(g) reports no problems, and graphviz(g) looks as expected, but execute(g) misbehaves in some way (for instance, it terminates but computes the wrong answer), it might be worth trying execute_serial(g). This has the exact same signature/semantics as execute(g), but always uses a single thread. It is particularly useful for detecting "write" races, where two Job's modify the same output but there is no dependency/sequence enforced between them. Under such circumstances, execute_serial(g) will give the right answer (because it will tacitly eliminate the race) but execute(g) will not.

A final piece of advice: when developing new JobGraph's, try to separate their topological aspects from their computational aspects as much as possible. For instance, bake the topological aspects (parent/child relationships) into a base class that implements only dummy printf()'s for the computational payload, then implement the real computational routines via overrides within a derived class, later. This allows the topological part (the dependencies) to be tested independently and quickly using verify(g), graphviz(g). You can even execute(g) or execute_serial(g) with the payload of printf() statements, to examine a few permissible sequences of execution. All of the existing JobGraph's maintain this topological/computational separation.

Go back to API Tutorials