
   Description of the test
   ========================

   This test executes the parallel matrix-matrix multiplication
   using two-dimensional grid of processes.

   The matrix distribution used is two-dimensional 
   heterogeneous block-cyclic distribution of matrices

   The test uses "HMPI_Timeof" to find the optimal generalized block size.

   The algorithm uses a memory efficient version. Only the portion of matrices
   A, B, and C that belong to the processor are stored at that processor.

   CONDITIONS
   ----------
   N must be a multiple of r.

   Files
   -----
   ParallelAxB.mpc ----> Performance model definition
   mxm_i.h         ----> header containing the function declarations and variable declarations
   mxm_i.c         ----> Contains the algorithm of parallel matrix-matrix multiplication 
                         using heterogeneous block-cyclic distribution of matrices
                         Uses HMPI_Timeof to find the optimal generalized block size
   mxm.c           ----> contains the main
   Load_balance.c  ----> Matrices A, B, and C are distributed amongst the processes
                         (Proportional to the speeds of the processors)
   counter.h       ----> Contains the parameters
                         N=Size of the matrix to solve
                         r=granularity or communication-to-computation ratio (values of 16, 32 typical)
                         p=Number of processes along the row
                         q=Number of processes along the column

   HOW TO RUN
   ----------
shell$ hmpicc ParallelAxB.mpc

shell$ hmpibcast mxm.c mxm_i.c mxm_i.h ParallelAxB.c counter.h Load_balance.c

shell$ hmpiload -o mxm mxm.c

shell$ hmpirun mxm
Processor performances refreshed
Performances are: 475944 475944 475944 475944 
=========row block size=2, column block size=2=============
TIMEOF = 0.269
===================================
=========row block size=2, column block size=3=============
TIMEOF = 0.269
===================================
=========row block size=2, column block size=4=============
TIMEOF = 0.538
===================================
=========row block size=2, column block size=6=============
TIMEOF = 0.605

...

Optimal generalised block size row = 2, Optimal generalised block size col = 12

Starting the matrix-matrix multiplication
N=384, p=2, q=2, r=8, grow=2, gcol=12, time(sec)=0.386351
