Sophie: gap-system-4.4.12-5mdv2010.0 i586

gap-system-4.4.12-5mdv2010.0.i586.rpm

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%
%W  pargap4.tex            ParGAP documentation            Gene Cooperman
%%
%H  $Id: pargap4.tex,v 1.5 2001/07/28 23:49:06 gap Exp $
%%
%Y  Copyright (C) 1999-2001  Gene Cooperman
%Y    See included file, COPYING, for conditions for copying
%%

\Chapter{MasterSlave Tutorial}

\atindex{tutorial!TOP-C}{@tutorial!TOP-C}
\atindex{tutorial!MasterSlave()}{@tutorial!MasterSlave()}
This chapter assumes the background  knowledge  in  section~"Basic  TOP-C
(Master-Slave) commands". {\ParGAP} must be invoked through a script like
`pargap.sh'  generated  in  {\ParGAP}'s  `bin'  during  installation.  In
addition, there must be a `procgroup' file in the current directory  when
{\ParGAP} is called, or alternatively, `-p4pg <FULL_PATH>/procgroup'  may
be included on the `pargap.sh' command line. See Section~"Running ParGAP"
for the details. You should find a sample `procgroup' file in {\ParGAP}'s
`bin'   directory   after   installation   of    {\ParGAP}.    See    the
section~"Invoking ParGAP with Remote Slaves" for further details. Many of
the examples of this section  can  be  found  in  {\ParGAP}'s  `examples'
directory.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{A simple example}

A master-slave computation is invoked when a {\ParGAP} program issues the
command `MasterSlave()'. This command is an example  of  what  is  called
``collective communication'' in MPI (although the command is not part  of
MPI). It is also sometimes called SPMD (Single Program,  Multiple  Data),
since all processes see the same code, although different  processes  may
execute different parts of the code. The `MasterSlave()' command must  be
invoked on all  processes  before  execution  can  begin.  The  following
trivial example does this. (Note that the final `\\' on a  line  that  is
still inside a string allows continuation of a string to the next  line.)
We illustrate these principles first in their simplest form,  making  all
variables global variables.  Later,  we  introduce  additional  {\ParGAP}
utilities that allow one to write in better style.

\beginexample
#Shared Data: none
#TaskInput:   counter
#TaskOutput:  counter^2 (square of counter)
#Task:        compute counter^2 from counter
ParEval( "result := []" );
ParEval( "counter := 0; \
          SubmitTaskInput := function() \
            counter := counter + 1; \
            if counter <= 10 then return counter; else return NOTASK; fi; \
          end;");
ParEval( "DoTask := x->x^2" );
ParEval( "CheckTaskResultVers1 := function( input, output ) \
            result[input] := [input, output]; \
            return NO_ACTION; \
          end;" );
ParEval( "MasterSlave( SubmitTaskInput, DoTask, CheckTaskResultVers1 )" );
Print(result);
\endexample

By default, `ParTrace = true' (see~"ParTrace"), causing the execution  to
display each input, `x', as it is passed from the master to a slave,  and
each output, `x^2', as it is passed from the slave back  to  the  master.
This behavior can be turned off by  setting:  `ParTrace  :=  false;'  The
fourth argument of `MasterSlave()', `Print', is a dummy argument that  is
never invoked in this example.

Note that the result list is filled in only on the local process, and was
never defined  or  modified  on  the  slave  processes.  To  remedy  this
situation, we introduce the concept of *shared data*, a globally  shared,
application-defined data structure. A  central  principle  of  the  TOP-C
model in {\ParGAP} is that any routine may ``read'' the shared data,  but
it  may  be   modified   only   by   the   application-defined   routine,
`<UpdateSharedData>()'. Hence,  if  we  wanted  the  result  list  to  be
recorded on all processes (perhaps as  a  lookup  table),  we  would  now
write:

\beginexample
ParBroadcast(PrintToString("result = ", result));
\endexample

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{ParSquare}

Until now, we have been using global variables.  It is better style to
use local variables, where possible.  We rewrite the above routine in
the improved style:

\beginexample
#Shared Data: result [ result is shared among all processes ]
#TaskInput:   counter
#TaskOutput:  counter^2 (square of counter)
#Task:        compute counter^2 from counter
#UpdateSharedData:  record [counter, counter^2] at index, counter, of result
MSSquare := function( range ) # expects as input a range, like [1..10]
  local counter, result,
      SubmitTaskInput, DoTask, CheckTaskResultVers2, UpdateSharedData;
  counter := range[1]; # Reset counter for use in SubmitTaskInput()
  result := [];
  SubmitTaskInput := function()
    counter := counter + 1;
    if counter <= range[Length(range)] then return counter;
    else return NOTASK;
    fi;
  end;
  DoTask := x->x^2;
  CheckTaskResultVers2 := function( input, output )
    return UPDATE_ACTION;
  end;
  UpdateSharedData := function( input, output )
    result[input] := [input, output];
  end;
  MasterSlave( SubmitTaskInput, DoTask, CheckTaskResultVers2, UpdateSharedData );
  return result;
end;

#ParSquare() is the main calling function;  It must define MSSquare on
#  all slaves before calling it in parallel.
ParSquare := function( range ) # expects as input a range, like [1..10]
  ParEval( PrintToString( "MSSquare := ", MSSquare ) );
  return ParEval( PrintToString( "MSSquare( ", range, ")" ) );
end;
\endexample

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{ParInstallTOPCGlobalFunction() and TaskInputIterator()
         (ParSquare revisited)}

This example can be written more compactly by using some of the
convenience functions provided by {\ParGAP}.  Specifically, we would
rewrite this as:

\indextt{ParInstallTOPCGlobalFunction!example}
\atindex{example!ParInstallTOPCGlobalFunction}%
{@example!\noexpand`ParInstallTOPCGlobalFunction'}
\indextt{TaskInputIterator!example}
\atindex{example!TaskInputIterator}{@example!\noexpand`TaskInputIterator'}
\beginexample
ParInstallTOPCGlobalFunction( "ParSquare", function( range )
  local result;
  result := [];
  MasterSlave( TaskInputIterator( range ),
               x->x^2,
               function( input, output ) return UPDATE_ACTION; end, 
               function( input, output ) result[input] := [input, output]; end
             );
  return result;
end );
\endexample

The usage above demonstrates the use of two utilities.

% \>ParInstallTOPCGlobalFunction( <gvar>, <function> ) F
\>ParInstallTOPCGlobalFunction( <string>, <function> )!{definition} F

This defines <gvar> as a function on the master and on the slaves.
On each slave, the definition of <gvar> is given by <function>.
However, on the master, <gvar> is defined as a function that first
calls <gvar> on all slaves with the arguments originally passed
to <gvar>, and then on the master, <function> is called with the
original arguments.  This is exactly the behavior that is wanted
in order to compress an invocation of `MasterSlave()' so that the
right things happen on both the master and on the slaves.  This
is exactly what we saw in the previous definition of `ParSquare',
above.

\>TaskInputIterator( <collection> ) F

This function provides the functionality of a common case of
`SubmitTaskInput()', by turning it into a {\GAP} iterator
(see~"ref:iterators" in the Reference Manual).  Its meaning is best 
understood from its definition:

\begintt
TaskInputIterator := function( collection )
  local iter;
  iter := Iterator( collection );
  return function()
           if IsDoneIterator(iter) then return NOTASK;
           else return NextIterator(iter);
           fi;
         end;
end;
\endtt

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{ParMultMat}

Let us now write a matrix-matrix
multiplication routine in this style.  Since matrix multiplication for
dimension $n$ requires $n^3$ operations, we can afford to spend $n^2$ time
doing any sequential work.  (A finer analysis would also consider the
number of slaves, $k$, resulting in up to $k*n^2$ time to send all
messages, depending on the MPI broadcast algorithm.)  So, a sequential
matrix multiplication program might be written as follows.  (The style
emphasizes clarity over efficiency.)

\atindex{example!SeqMultMat}{@example!\noexpand`SeqMultMat'}
\beginexample
SeqMultMat := function(m1, m2)               # sequential code
  local i, j, k, n, m2t, sum, result;
  n := Length(m1);
  result := [];
  m2t := TransposedMat(m2);
  for i in [1..n] do
    result[i] := [];
    for j in [1..n] do
      sum := 0;
      for k in [1..n] do
        sum := sum + m1[i][k]*m2t[j][k];
      od;
      result[i][j] := sum;
    od;
  od;
  return result;
end;
\endexample

We choose to define the task as the computation of a single row of the
result matrix.  This corresponds to the body of the outermost loop.

\atindex{example!ParMultMat}{@example!\noexpand`ParMultMat'}
\beginexample
#Shared Data: m1, m2t, result (three matrices)
#TaskInput:   i (row index of result matrix)
#TaskOutput:  result[i] (row of result matrix)
#Task:        Compute result[i] from i, m1, and m2
#UpdateSharedData:  Given result[i] and i, modify result on all processes.

ParInstallTOPCGlobalFunction( "ParMultMat", function(m1, m2)
  local i, n, m2t, result, DoTask, CheckTaskResult, UpdateSharedData;
  n := Length(m1);
  result := [];
  m2t := TransposedMat(m2);

  DoTask := function(i) # i is task input
    local j, k, sum;
    result[i] := [];
    for j in [1..n] do
      sum := 0;
      for k in [1..n] do
        sum := sum + m1[i][k]*m2t[j][k];
      od;
      result[i][j] := sum;
    od;
    return result[i]; # return task output, row_i
  end;
  # CheckTaskResult executes only on the master
  CheckTaskResult := function(i, row_i) # task output is row_i
    return UPDATE_ACTION; # Pass on output and input to UpdateSharedData
  end;
  # UpdateSharedData executes on the master and on all slaves
  UpdateSharedData := function(i, row_i) # task output is row_i
    result[i] := row_i;
  end;
  # We're done defining the task.  Let's do it now.
  MasterSlave( TaskInputIterator([1..n]), DoTask, CheckTaskResult,
               UpdateSharedData );
  # result is defined on all processes;  return local copy of result
  return result;
end );
\endexample

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{DefaultCheckTaskResult (as illustrated by ParSemiEchelonMatrix)}

Now that the basic principles of the TOP-C model are clear, we
investigate an example that requires most of the basic features of
{\ParGAP}, including the use of `IsUpToDate()' and
`REDO_ACTION'.  Recall the standard idiom for
`<CheckTaskResult>()'.  These issues were discussed in
the section~"Efficient Parallelism in MasterSlave() using CheckTaskResult()".

\begintt
DefaultCheckTaskResult := function( taskOutput, taskInput )
  if taskOutput = false then return NO_ACTION;
  elif not IsUpToDate() then return REDO_ACTION;
  else return UPDATE_ACTION;
  fi;
end;
\endtt

The next example is a parallelization of the function
`SemiEchelonMat()' (a form of Gaussian elimination) in the {\GAP}
library, `lib/matrix.gi'.  Unlike the previous examples, parallelizing
Gaussian elimination efficiently is a non-trivial undertaking.  This
is because a naive parallelization has poor load balancing.  A slave
executing a task in the middle will have to <sift> a row vector
through many previous row vectors, while a slave executing a task
toward the beginning or end will have little work to do.  We will
begin with a naive parallelization based on the sequential code, and
then migrate the code in a natural manner toward a more efficient
form, by analyzing the inefficiencies and applying the TOP-C model.

The reader may wish to stop and read the original code in
`lib/matrix.gi' first.  The logic of `SemiEchelonMat()' is to examine
each row vector of an input matrix, in order, reduce it by a list of
basis vectors stored in `vectors', and then add the row to `vectors'.
Upon completion, the number of leading zeroes of the row vectors in
`vectors' may not increase monotonically, but each element of
`vectors' will have a unique number of leading zeroes.  Some rows of
the input matrix may reduce to the zero matrix, in which case they are
not added to `vectors'.

For the reader's convenience, the original sequential code is
reproduced here.

\beginexample
SemiEchelonMat := function( mat )

    local zero,      # zero of the field of <mat>
          nrows,     # number of rows in <mat>
          ncols,     # number of columns in <mat>
          vectors,   # list of basis vectors
          heads,     # list of pivot positions in 'vectors'
          i,         # loop over rows
          j,         # loop over columns
          x,         # a current element
          nzheads,   # list of non-zero heads
          row;       # the row of current interest

    mat:= List( mat, ShallowCopy );
    nrows:= Length( mat );
    ncols:= Length( mat[1] );

    zero:= Zero( mat[1][1] );

    heads:= ListWithIdenticalEntries( ncols, 0 );
    nzheads := [];
    vectors := [];

    for i in [ 1 .. nrows ] do
        row := mat[i];
        # Reduce the row with the known basis vectors.
        for j in [ 1 .. Length(nzheads) ] do
            x := row[nzheads[j]];
            if x <> zero then
                AddRowVector( row, vectors[ j ], - x );
            fi;
        od;
        j := PositionNot( row, zero );
        if j <= ncols then
            # We found a new basis vector.
            MultRowVector(row, Inverse(row[j]));
            Add( vectors, row );
            Add( nzheads, j);
            heads[j]:= Length( vectors );
        fi;
    od;

    return rec( heads   := heads,
                vectors := vectors );
    end;
\endexample

\atindex{example!ParSemiEchelonMat}{@example!\noexpand`ParSemiEchelonMat'}
\atindex{example!parallel Gaussian elimination}%
{@example!parallel Gaussian elimination}
\atindex{Gaussian elimination!parallel}{@Gaussian elimination!parallel}
Although {\GAP}'s Gaussian elimination algorithm appears to be
awkward to parallelize (since the next row vector in `vectors' depends
on row reduction by all previous vectors, we will see how the original
code of `SemiEchelonMat()' can be modified in a natural manner to
produce useful parallelism.  This illustrates the general TOP-C
paradigm for ``naturally'' parallelizing a sequential algorithm.

\beginexample
#Shared Data: vectors (basis vectors), heads, nzheads, mat (matrix)
#TaskInput:   i (row index of matrix)
#TaskOutput:  List of (1) j and (2) row i of matrix, mat, reduced by vectors
#               j is the first non-zero element of row i
#Task:        Compute reduced row i from mat, vectors, heads
#UpdateSharedData:  Given i, j, reduced row i, add new basis vector
#               to vectors and update heads[j] to point to it

ParInstallTOPCGlobalFunction( "ParSemiEchelonMat", function( mat )
    local zero,      # zero of the field of <mat>
          nrows,     # number of rows in <mat>
          ncols,     # number of columns in <mat>
          vectors,   # list of basis vectors
          heads,     # list of pivot positions in 'vectors'
          i,         # loop over rows
          nzheads,   # list of non-zero heads
          DoTask, UpdateSharedData;

    mat:= List( mat, ShallowCopy );
    nrows:= Length( mat );
    ncols:= Length( mat[1] );

    zero:= Zero( mat[1][1] );

    heads:= ListWithIdenticalEntries( ncols, 0 );
    nzheads := [];
    vectors := [];

    DoTask := function( i ) # taskInput = i
      local j,         # loop over columns
            x,         # a current element
            row;       # the row of current interest
      row := mat[i];
      # Reduce the row with the known basis vectors.
      for j in [ 1 .. Length(nzheads) ] do
          x := row[nzheads[j]];
          if x <> zero then
              AddRowVector( row, vectors[ j ], - x );
          fi;
      od;
      j := PositionNot( row, zero );
      if j <= ncols then return [j, row]; # return taskOutput
      else return fail; fi;
    end;
    UpdateSharedData := function( i, taskOutput )
      local j, row;
      j := taskOutput[1];
      row := taskOutput[2];
      # We found a new basis vector.
      MultRowVector(row, Inverse(row[j]));
      Add( vectors, row );
      Add( nzheads, j);
      heads[j]:= Length( vectors );
    end;
    
    MasterSlave( TaskInputIterator( [1..nrows] ), DoTask, DefaultCheckTaskResult,
                  UpdateSharedData );

    return rec( heads   := heads,
                vectors := vectors );
end );
\endexample

The next section describes how to make this code more efficient.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{Caching slave task outputs (ParSemiEchelonMat revisited)}

The code above is inefficient unless `nrows >> ncols'.  This is
because if `nrows' is comparable to `ncols', it will be rare for
`DoTask()' to return `fail'.  If most slaves return a result distinct
from `fail', then `DefaultCheckTaskResult()' will return an
`UPDATE_ACTION' upon receiving the output from the first slave, and it
will return a `REDO_ACTION' to all other slaves, until those slaves
execute `UpdateSharedData()'.  The inefficiency arose because a
`REDO_ACTION' caused the original slave process to re-compute
`DoTask()' from the beginning.

\atindex{example!ParSemiEchelonMat revisited}%
{@example!\noexpand`ParSemiEchelonMat' revisited}
In the case of a `REDO_ACTION', we can fix this by taking advantage of
information that was already computed.  To accomplish this, a global
variable should be defined on all slaves:
\beginexample
ParEval("globalTaskOutput := [[-1]]");
\endexample
the routine `DoTask()' in the previous example should be modified to:

\beginexample
  DoTask := function( i )
      local j,         # loop over columns
            x,         # a current element
            row;       # the row of current interest
    if i = globalTaskOutput[1] then
      # then this is a REDO_ACTION
      row := globalTaskOutput[2]; # recover last row value
    else row := mat[i];
    fi;
    # Reduce the row with the known basis vectors.
    for j in [ 1 .. Length(nzheads) ] do
      x := row[nzheads[j]];
      if x <> zero then
        AddRowVector( row, vectors[ j ], - x );
      fi;
    od;
    j := PositionNot( row, zero );
    # save row in case of new REDO_ACTION
    globalTaskOutupt[1] := i;
    globalTaskOutput[2] := row;
    if j <= ncols then return [j, row]; # return taskOutput
    else return fail; fi;
  end;
\endexample

(A perceptive reader will have noticed that it was not necessary to
also save and restore `row' from `globalTaskOutput', since this can
be found again based on the saved variable value `i'.  However, the
additional cost is small, and it illustrates potentially greater
generality of the method.)

The next section describes how to make this code more efficient.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{Agglomerating tasks for efficiency
         (ParSemiEchelonMat revisited again)}

\atindex{taskAgglom}{@\noexpand`taskAgglom'}\index{agglomerating tasks}
\indextt{TaskAgglomIndex!context}
\atindex{example!taskAgglom}{@example!\noexpand`taskAgglom'}
A more efficient parallelization would partition the matrix into sets
of adjacent rows, and send an entire set as a single `taskInput'.
This would minimize the communication overhead, since the network latency
varies only slowly with message sizxe, but linearly with the number
of messages.  To minimize network latency, one adds an extra parameter
to `MasterSlave()' in order to bundle, perhaps, up to 5 tasks at a time.

\beginexample
MasterSlave( TaskInputIterator( [1..n] ), DoTask, DefaultCheckTaskResult,
             UpdateSharedData, 5 );
\endexample

Now the task input will be a list of the next 5 tasks returned by
`GetTaskInput()', or in this case by `TaskInputIterator( [1..nrows] )'.
If fewer than 5 tasks are produced before `NOTASK' is returned, then
the task input will be correspondingly shorter.  If the first input
task is `NOTASK' (yielding a list of tasks of length 0), then this
will be interpreted as a traditional `NOTASK'.  The task output
corresponding to this task input is whatever the application routine,
`DoTask()' produces as task output.  The routine `DoTask()' will be
unchanged, and `MasterSlave()' will arrange to repeatedly call
`DoTask()', once for each input task and produce a list of task
outputs.

Hence, this new variation requires us to rewrite
`UpdateSharedData()' in the obvious manner, to handle a list of input
and output tasks.  Here is one solution to patch the earlier code.

\>`TaskAgglomIndex' V

This global variable is provided for use inside `DoTask()'.  It allows
the application code to inquire about the index of the input task in
the full list of tasks created when agglomTask is used.  The variable
is most useful in the case of a `REDO_ACTION' or
`CONTINUATION_ACTION()', as illustrated below.

\atindex{example!ParSemiEchelonMat revisited again}%
{@example!\noexpand`ParSemiEchelonMat' revisited again}
\beginexample
ParEval("globalTaskOutput := [[-1]]");
ParEval("globalTaskOutputs := []");

#Shared Data: vectors (basis vectors), heads, mat (matrix)
#TaskInput:   i (row index of matrix)
#TaskOutput:  List of (1) j and (2) row i of matrix, mat, reduced by vectors
#               j is the first non-zero element of row i
#Task:        Compute reduced row i from mat, vectors, heads
#UpdateSharedData:  Given i, j, reduced row i, add new basis vector
#               to vectors and update heads[j] to point to it

ParInstallTOPCGlobalFunction( "ParSemiEchelonMat", function( mat )
  local zero,      # zero of the field of <mat>
        nrows,     # number of rows in <mat>
        ncols,     # number of columns in <mat>
        vectors,   # list of basis vectors
        heads,     # list of pivot positions in 'vectors'
        i,         # loop over rows
        nzheads,   # list of non-zero heads
        DoTask, UpdateSharedDataWithAgglom;

  mat:= List( mat, ShallowCopy );
  nrows:= Length( mat );
  ncols:= Length( mat[1] );

  zero:= Zero( mat[1][1] );

  heads:= ListWithIdenticalEntries( ncols, 0 );
  nzheads := [];
  vectors := [];

  DoTask := function( i )
      local j,         # loop over columns
            x,         # a current element
            row;       # the row of current interest
    if IsBound(globalTaskOutputs[TaskAgglomIndex])
        and i = globalTaskOutputs[TaskAgglomIndex][1] then
      # then this is a REDO_ACTION
      row := globalTaskOutputs[TaskAgglomIndex][2][2]; # recover last row value
    else row := mat[i];
    fi;
    # Reduce the row with the known basis vectors.
    for j in [ 1 .. Length(nzheads) ] do
      x := row[nzheads[j]];
      if x <> zero then
        AddRowVector( row, vectors[ j ], - x );
      fi;
    od;
    j := PositionNot( row, zero );

    # save [input, output] in case of new REDO_ACTION
    globalTaskOutputs[TaskAgglomIndex] := [ i, [j, row] ];

    if j <= ncols then return [j, row]; # return taskOutput
    else return fail; fi;
  end;
  
  # This version of UpdateSharedData() expects a list of taskOutput's
  UpdateSharedDataWithAgglom := function( listI, taskOutputs )
    local j, row, idx, tmp;
    for idx in [1..Length( taskOutputs )] do
      j := taskOutputs[idx][1];
      row := taskOutputs[idx][2];
      
      if idx > 1 then
        globalTaskOutputs[1] := [-1, [j, row] ];
        tmp := DoTask( -1 ); # Trick DoTask() into a REDO_ACTION
        if tmp <> fail then
          j := tmp[1];
          row := tmp[2];
        fi;
      fi;

      # We found a new basis vector.
      MultRowVector(row, Inverse(row[j]));
      Add( vectors, row );
      Add( nzheads, j);
      heads[j]:= Length( vectors );
    od;
  end;
    
  MasterSlave( TaskInputIterator( [1..nrows] ), DoTask, DefaultCheckTaskResult,
                UpdateSharedDataWithAgglom, 5 ); #taskAgglom set to 5 tasks

  return rec( heads   := heads,
              vectors := vectors );
end );
\endexample

Note that in this simple example, we were able to re-use most of the
code from the previous version, at the cost of adding an additional
global variable, `globalTaskOutputs'.  In fact, the last `DoTask()' is
backward compatible to the first version of the code, for which
`agglomTasks' is not used.  If we wanted to run the latest code
without agglomeration of tasks, it would suffice either to set
the `taskAgglom' parameter to 1, or else to remove it entirely and
replace `UpdateSharedDataWithAgglom()' by `UpdateSharedData()'.

It is useful to experiment with the above code by substituting a
variable, `taskAgglom', for the number 5, and trying it out with
remote slaves on your own network for different values of `taskAgglom'
and for different size matrices.  You can call `MasterSlaveStats()'
to see the effect of different parameters.  Suitable pseudo-random matrices
can be quickly generated via `mat := PseudoRandom( GL( 30, 5 )' and similar
commands.

The paper~\cite{Coo98} is suggested as further reading to see
a still more efficient parallel implementation
of `ParSemiEchelonMatrix' (also known as Gaussian elimination)
using the TOP-C model.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{Raw MasterSlave (ParMultMat revisited)} 

\atindex{raw MasterSlave()!definition}{@raw \noexpand`MasterSlave()'!definition}
\indextt{MasterSlave()!raw}
\atindex{example!ParMultMat revisited}%
{@example!\noexpand`ParMultMat' revisited}
Finally, we given an example of a variation of `MasterSlave()', based
on a ``raw'' `MasterSlave()'.  These versions are designed for the
common case of legacy code that contains deeply nested parentheses.
The `taskInput' may be generated inside several nested loops, making
it awkward and error-prone to produce a function, `SubmitTaskInput()',
that will generate instances of `taskInput' in the appropriate
sequence.

Effectively, when we wish to return successive values from several deeply
nested loops, we are in the situation of programming the ``opposite of a
{\GAP} iterator'' (see~"ref:iterators" in the Reference Manual).  
We are already producing successive iterator values, and we wish to 
``stuff them back into some iterator''.  Until {\GAP} develops such a
language construct `:-)' , the following example of a ``raws'' 
`MasterSlave()' demonstrates a solution.
Before studying this example, please review the sequential version,
`SeqMultMat()' near the beginning of section~"ParMultMat".

\atindex{raw MasterSlave()!example}{@raw \noexpand`MasterSlave()'!example}
\atindex{example!raw MasterSlave()}{@example!raw \noexpand`MasterSlave()'}
We make use of the following three new {\ParGAP} functions.

\>BeginRawMasterSlave( <DoTask>, <CheckTaskResult>, <UpdateSharedData> ) F
\>RawSubmitTaskInput( <taskInput> ) F
\>EndRawMasterSlave() F

Their use will be obvious in the next example.  This time, we
parallelize `SeqMultMat()' by defining the task as the computation of a
single entry in the result matrix.  Hence, the task will be the
computation of the appropriate inner product.  For dimension $n$,
$n^2$ tasks are now generated, and each task is generated inside a
doubly nested loop.

\beginexample
#Shared Data: m1, m2t, result (three matrices)
#TaskInput:   [i,j] (indices of entry in result matrix)
#TaskOutput:  result[i][j] (value of entry in result matrix)
#Task:        Compute inner produce of row i of m1 by colum j of m1
              ( Note that column j of m1 is also row j of m2t, the transpose )
#UpdateSharedData:  Given result[i][j] and [i,j], modify result everywhere

ParInstallTOPCGlobalFunction( "ParRawMultMat", function(m1, m2)
  local i, j, k, n, m2t, sum, result, DoTask, CheckTaskResult, UpdateSharedData;
  n := Length(m1);
  m2t := TransposedMat(m2);
  result := ListWithIdenticalEntries( Length(m2t), [] );

  DoTask := function( input )
    local i,j,k,sum;
    i:=input[1]; j:=input[2];
    sum := 0;
    for k in [1..n] do
      sum := sum + m1[i][k]*m2t[j][k];
    od;
    return sum;
  end;

  CheckTaskResult := function( input, output )
    return UPDATE_ACTION;
  end;

  UpdateSharedData := function( input, output )
    local i, j;
    i := input[1]; j := input[2];
    result[i][j] := output;
    # result[i,j] := sum;
  end;

  BeginRawMasterSlave( DoTask, CheckTaskResult, UpdateSharedData );
  for i in [1..n] do
    result[i] := [];
    for j in [1..n] do
      RawSubmitTaskInput( [i,j] );
      # sum := 0;
      # for k in [1..n] do
      #   sum := sum + m1[i][k]*m2t[j][k];
      # od;
      # result[i][j] := sum;
    od;
  od;
  EndRawMasterSlave();

  return result;
end );
\endexample

% Should also add section with continuation example.
% Should also add section with search example, and example with
%    coset enumeration (coset enum requires loading additional modified
%    C to check coset table and report work to do without changing things
%    since the coset table will be the shared data.
% Possibilities for search example include naive breadth-first search
%    parallelization, which is fine if have a long wavefront and each slave
%    takes a piece of the wavefront, or else CILK-style work-stealing
%    paradigm, maybe even with CILK 8-queens example.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%
%E