
   Description of the test
   ========================

   This test executes the parallel matrix-matrix multiplication
   using two-dimensional square grid of processes. 
   The matrix distribution used is two-dimensional 
   heterogeneous block-cyclic distribution of matrices proposed by 
   Kalinov and Lastovetsky in their paper

   Kalinov A., and Lastovetsky A. (2001), "Heterogeneous Distribution of 
   Computations Solving Linear Algebra Problems on Networks of 
   Heterogeneous Computers", Journal of Parallel and Distributed Computing, 61, 4, pp. 520-535.

   The function HMPI_Timeof is used to find the optimal generalized block size.

   CONDITIONS
   ----------
   N must be a multiple of r.
   
   Files
   -----

   ParallelAxB.mpc ----> Performance model definition for the algorithm
                         of parallel matrix-matrix multiplication using 
                         2D heterogeneous block-cyclic distribution of matrices
   ParallelAxB.c   ----> Generated code of the performance model definition
   Load_balance.c  ----> Contains calls to the HPDL partitioning library to partition the 
                         matrix given the speeds of the processors
   mxm_i.h         ----> header containing the function declarations and variable declarations
   mxm_i.c         ----> Contains the call to HMPI_Timeof, which determines the optimal
                         generalized block sizes
   mxm.c           ----> contains the main
   counter.h       ----> Contains the parameters
                         N=Size of the matrix to solve
                         r=granularity or communication-to-computation ratio (values of 16, 32 typical)
                         p=Number of processes along the row/column in a square grid

   HOW TO RUN
   ----------

   shell$ hmpicc ParallelAxB.mpc

   shell$ hmpibcast mxm.c mxm_i.c mxm_i.h ParallelAxB.c Load_balance.c counter.h

   shell$ hmpiload -o mxm mxm.c

   shell$ hmpirun mxm
Processor performances after HMPI_Recon are: 2840873 2840873 2840873 2840873
=========Block size=2=============
TIMEOF: time=0.072574, bsize=2
===================================
=========Block size=4=============
TIMEOF: time=0.060384, bsize=4
===================================
=========Block size=8=============
TIMEOF: time=0.194073, bsize=8
===================================
=========Block size=16=============
TIMEOF: time=0.726392, bsize=16
===================================
=========Block size=32=============
TIMEOF: time=2.891312, bsize=32
===================================


me=0, Optimal generalised block size = 4

