Difference between revisions of "Grid5000"
(71 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
− | https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home | + | https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home |
+ | |||
+ | [https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY] - Very important, after booking nodes (oarsub ...) run the command: <source lang="">outofchart</source> This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | == Login, job submission, deployment of image == | ||
+ | |||
+ | *Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] | ||
+ | *Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site: | ||
− | |||
− | |||
− | |||
<source lang="bash"> | <source lang="bash"> | ||
access_$ ssh frontend.SITE2 | access_$ ssh frontend.SITE2 | ||
− | </source> | + | </source> |
− | * There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. | + | |
− | * Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. | + | *There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. |
− | * Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS | + | *Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. |
− | ** '''oarstat''' - queue status | + | *Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: |
− | ** '''oarsub''' - job submission | + | **'''oarstat''' - queue status |
− | ** '''oardel''' - job removal | + | **'''oarsub''' - job submission |
− | Interactive job on deployed images: | + | **'''oardel''' - job removal |
− | <source lang="bash"> | + | |
− | fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"'] | + | Interactive job on deployed images: <source lang="bash"> |
− | </source> | + | fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"'] |
− | Batch job on installed images: | + | </source> Batch job on installed images: <source lang="bash"> |
− | <source lang="bash"> | + | fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"'] |
− | fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"'] | + | </source> Specifying cluster name to reserve: <source lang="bash"> |
− | </source> | + | oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'" |
− | * The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here] | + | </source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time. |
− | Loading: | + | |
− | <source lang="bash"> | + | *The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here] |
− | fontend_$ kadeploy3 - | + | |
− | </source> | + | Loading: <source lang="bash"> |
− | A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000. | + | fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES |
− | * Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] | + | </source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000. |
+ | |||
+ | == Compiling and running MPI applications == | ||
+ | |||
+ | *Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) | ||
+ | *Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] | ||
+ | **mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) | ||
+ | |||
+ | == Setting up new deploy image == | ||
+ | |||
+ | List available images | ||
+ | |||
+ | kaenv3 -l | ||
+ | |||
+ | Then book node and launch: | ||
+ | |||
+ | oarsub -I -t deploy -l nodes=1,walltime=12 | ||
+ | kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k | ||
+ | ssh root@`head -n 1 $OAR_NODEFILE` | ||
+ | |||
+ | default password: grid5000 | ||
+ | |||
+ | edit /etc/apt/sources.list | ||
+ | |||
+ | apt-get update | ||
+ | apt-get upgrade | ||
+ | |||
+ | apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev | ||
+ | |||
+ | Possibly also install (for using extrae): | ||
+ | |||
+ | apt-get install libxml2-dev binutils-dev libunwind7-dev | ||
+ | |||
+ | <br> Compiled for sources by us: | ||
+ | |||
+ | *<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/) </strike> ''Now with squeeze it is in repository.'' | ||
+ | |||
+ | <strike>./configure && make && make install</strike> | ||
+ | |||
+ | *mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads) | ||
+ | |||
+ | ./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd | ||
+ | make && make install | ||
+ | |||
+ | Mpich2 installed to: | ||
+ | |||
+ | Installing MPE2 include files to /usr/local/include | ||
+ | Installing MPE2 libraries to /usr/local/lib | ||
+ | Installing MPE2 utility programs to /usr/local/bin | ||
+ | Installing MPE2 configuration files to /usr/local/etc | ||
+ | Installing MPE2 system utility programs to /usr/local/sbin | ||
+ | Installing MPE2 man to /usr/local/share/man | ||
+ | Installing MPE2 html to /usr/local/share/doc/ | ||
+ | Installed MPE2 in /usr/local | ||
+ | |||
+ | *hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/) | ||
+ | |||
+ | compile from sources. To get xml support install libxml2-dev and pkg-config | ||
+ | |||
+ | apt-get install libxml2-dev pkg-config | ||
+ | tar -xzvf hwloc-1.1.1.tar.gz | ||
+ | cd hwloc-1.1.1 | ||
+ | ./configure && make && make install | ||
+ | |||
+ | Change root password. | ||
+ | |||
+ | rm sources from root dir. | ||
+ | |||
+ | Edit the "message of the day" | ||
+ | |||
+ | vi /etc/motd.tail | ||
+ | |||
+ | echo 90 > /proc/sys/vm/overcommit_ratio | ||
+ | echo 2 > /proc/sys/vm/overcommit_memory | ||
+ | date >> release | ||
+ | |||
+ | Cleanup | ||
+ | |||
+ | apt-get clean | ||
+ | rm /etc/udev/rules.d/*-persistent-net.rules | ||
+ | |||
+ | Make image | ||
+ | |||
+ | ssh root@'''node''' tgz-g5k > $HOME/grid5000/'''imagename'''.tgz | ||
+ | |||
+ | make appropriate .env file. | ||
+ | |||
+ | kaenv3 -p lenny-x64-nfs -u deploy > lenny-x64-custom-2.3.env | ||
+ | |||
+ | <br> | ||
+ | |||
+ | == GotoBLAS2 == | ||
+ | |||
+ | http://www.tacc.utexas.edu/tacc-projects/gotoblas2 When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz | ||
+ | --2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz | ||
+ | Resolving www.netlib.org... 160.36.58.108 | ||
+ | Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out. | ||
+ | Retrying. | ||
+ | |||
+ | --2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz | ||
+ | Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out. | ||
+ | Retrying. | ||
+ | ...</source> | ||
+ | |||
+ | Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile | ||
+ | |||
+ | 184c184 | ||
+ | < -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz | ||
+ | --- | ||
+ | > # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz | ||
+ | |||
+ | <br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc | ||
+ | |||
+ | export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'` | ||
+ | LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH | ||
+ | export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH | ||
+ | |||
+ | Run the following script once on each cluster: | ||
+ | |||
+ | <source lang="bash">#! /bin/bash | ||
+ | echo "Compiling gotoblas for cluster: $CLUSTER" | ||
+ | cd $HOME/src | ||
+ | if [ ! -d "$CLUSTER" ]; then | ||
+ | mkdir $CLUSTER | ||
+ | fi | ||
+ | cd $CLUSTER | ||
+ | tar -xzf ../Goto*.tar.gz | ||
+ | cd Goto* | ||
+ | make &> m.log | ||
+ | |||
+ | |||
+ | if [ ! -d "$HOME/lib/$CLUSTER" ]; then | ||
+ | mkdir $HOME/lib/$CLUSTER | ||
+ | fi | ||
+ | |||
+ | cp libgoto2.so $HOME/lib/$CLUSTER | ||
+ | |||
+ | echo results | ||
+ | ls -d $HOME/src/$CLUSTER | ||
+ | ls $HOME/src/$CLUSTER | ||
+ | |||
+ | ls -d $HOME/lib/$CLUSTER | ||
+ | ls $HOME/lib/$CLUSTER</source> | ||
+ | |||
+ | note: for newer processors this may fail. If it is a NEHALEM processor try: | ||
+ | |||
+ | make clean | ||
+ | make TARGET=NEHALEM | ||
+ | |||
+ | == Paging and the OOM-Killer == | ||
+ | |||
+ | When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. | ||
+ | |||
+ | == Example of experiment setup across several sites == | ||
+ | |||
+ | Sources of all files mentioned below is available at: [[Grid5000:sources]]. | ||
+ | |||
+ | Pick one head node as the main head node (I use grenoble, but any will do). Setup sources | ||
+ | |||
+ | cd dave/fupermod-1.1.0 | ||
+ | make clean | ||
+ | ./configure --with-cblas=goto --prefix=/usr/local/ | ||
+ | |||
+ | Reserve 2 nodes from all clusters on a 3 cluster site: | ||
+ | |||
+ | oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00 | ||
+ | |||
+ | Automate with: | ||
+ | |||
+ | for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done | ||
+ | |||
+ | Then on each site, where xxx is site name: | ||
+ | |||
+ | kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx | ||
+ | |||
+ | Gather deployed files to a head node: | ||
+ | |||
+ | for i in `cat ~/sites `; do echo $i; scp $i:deployed* . ; done | ||
+ | cat deployed.* > deployed.all | ||
+ | |||
+ | Copy cluster specific libs to each deployed node /usr/local/lib dir with script | ||
+ | |||
+ | copy_local_libs.sh deployed.all | ||
+ | |||
+ | Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) | ||
+ | |||
+ | for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i: ; done | ||
+ | for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0 ; make all install" ; done | ||
+ | |||
+ | ssh to the first node | ||
+ | |||
+ | ssh `head -n1 deployed.all` | ||
+ | n=$(cat deployed.all |wc -l) | ||
+ | mpdboot --totalnum=$n --file=$HOME/deployed.all | ||
+ | mpdtrace | ||
+ | |||
+ | cd dave/data/ | ||
+ | mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100 | ||
+ | |||
+ | Cleanup after: | ||
+ | |||
+ | for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.* ; done | ||
+ | |||
+ | == Check network speed == | ||
+ | |||
+ | apt-get install iperf | ||
+ | |||
+ | == Choose which network interface to use == | ||
+ | |||
+ | mpirun --mca btl self,openib ... | ||
+ | |||
+ | or | ||
+ | |||
+ | mpirun --mca btl self,tcp ... | ||
+ | |||
+ | == Installing Gadget-2.0.7 == | ||
+ | |||
+ | # apt-get install hdf5-openmpi-dev sfftw-dev | ||
+ | $ tar -xzvf gadget2.tar.gz | ||
+ | $ cd Gadget-2.0.7/Gadget2 | ||
+ | $ make CFLAGS="-DH5_USE_16_API | ||
+ | $ make clean; make | ||
+ | |||
+ | == Installing Wrekavoc == | ||
+ | |||
+ | Download from http://wrekavoc.gforge.inria.fr/ | ||
+ | |||
+ | # apt-get install libxml2-dev pkg-config | ||
+ | # tar -xzvf wrekavoc-1.1.tar.gz | ||
+ | # cd wrekavoc-1.1/ | ||
+ | # ./configure | ||
+ | # make | ||
+ | # ./src/burn 50 | ||
+ | |||
+ | == Installing Extrae == | ||
+ | |||
+ | (on grid5000 wheezy big) | ||
+ | |||
+ | First install [http://www.dyninst.org/ Dyninst] | ||
+ | |||
+ | # apt-get install libelf-dev libdwarf-dev | ||
+ | # tar -xzvf DyninstAPI-8.1.2.tgz | ||
+ | # cd DyninstAPI-8.1.2 | ||
+ | # ./configure --with-libdwarf-static | ||
+ | # make | ||
+ | # make install | ||
+ | |||
+ | Then Extrae | ||
+ | |||
+ | # apt-get install | ||
+ | # ./configure --with-mpi=/usr --with-mpi-libs=/usr/lib --with-papi=/usr/local --with-unwind=/usr --with-dyninst=/usr/local --with-dwarf=/usr |
Latest revision as of 19:16, 22 July 2013
https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
USAGE POLICY - Very important, after booking nodes (oarsub ...) run the command:outofchart
Contents
- 1 Login, job submission, deployment of image
- 2 Compiling and running MPI applications
- 3 Setting up new deploy image
- 4 GotoBLAS2
- 5 Paging and the OOM-Killer
- 6 Example of experiment setup across several sites
- 7 Check network speed
- 8 Choose which network interface to use
- 9 Installing Gadget-2.0.7
- 10 Installing Wrekavoc
- 11 Installing Extrae
Login, job submission, deployment of image
- Select sites and clusters for experiments, using information on the Grid5000 network and the Status page
- Access is provided via access nodes access.SITE.grid5000.fr marked here as accessible from everywhere via ssh with keyboard-interactive authentication method. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:
access_$ ssh frontend.SITE2
- There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available.
- Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it scp, sftp, rsync between access or frontend nodes.
- Jobs are run from the frondend nodes, using a PBS-like system OAR. Basic commands:
- oarstat - queue status
- oarsub - job submission
- oardel - job removal
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"
- The image to deploy can be created and loaded with help of a Systemimager-like system Kadeploy. Creating: described here
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES
Compiling and running MPI applications
- Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)
- Running MPI applications is described here
- mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)
Setting up new deploy image
List available images
kaenv3 -l
Then book node and launch:
oarsub -I -t deploy -l nodes=1,walltime=12 kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k ssh root@`head -n 1 $OAR_NODEFILE`
default password: grid5000
edit /etc/apt/sources.list
apt-get update apt-get upgrade
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev
Possibly also install (for using extrae):
apt-get install libxml2-dev binutils-dev libunwind7-dev
Compiled for sources by us:
gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)Now with squeeze it is in repository.
./configure && make && make install
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd make && make install
Mpich2 installed to:
Installing MPE2 include files to /usr/local/include Installing MPE2 libraries to /usr/local/lib Installing MPE2 utility programs to /usr/local/bin Installing MPE2 configuration files to /usr/local/etc Installing MPE2 system utility programs to /usr/local/sbin Installing MPE2 man to /usr/local/share/man Installing MPE2 html to /usr/local/share/doc/ Installed MPE2 in /usr/local
- hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)
compile from sources. To get xml support install libxml2-dev and pkg-config
apt-get install libxml2-dev pkg-config tar -xzvf hwloc-1.1.1.tar.gz cd hwloc-1.1.1 ./configure && make && make install
Change root password.
rm sources from root dir.
Edit the "message of the day"
vi /etc/motd.tail
echo 90 > /proc/sys/vm/overcommit_ratio echo 2 > /proc/sys/vm/overcommit_memory date >> release
Cleanup
apt-get clean rm /etc/udev/rules.d/*-persistent-net.rules
Make image
ssh root@node tgz-g5k > $HOME/grid5000/imagename.tgz
make appropriate .env file.
kaenv3 -p lenny-x64-nfs -u deploy > lenny-x64-custom-2.3.env
GotoBLAS2
http://www.tacc.utexas.edu/tacc-projects/gotoblas2 When compiling gotoblas on a node without direct internet access get this error:wget http://www.netlib.org/lapack/lapack-3.1.1.tgz
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz
Resolving www.netlib.org... 160.36.58.108
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.
Retrying.
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.
Retrying.
...
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile
184c184 < -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz --- > # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz
GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'` LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH
Run the following script once on each cluster:
#! /bin/bash
echo "Compiling gotoblas for cluster: $CLUSTER"
cd $HOME/src
if [ ! -d "$CLUSTER" ]; then
mkdir $CLUSTER
fi
cd $CLUSTER
tar -xzf ../Goto*.tar.gz
cd Goto*
make &> m.log
if [ ! -d "$HOME/lib/$CLUSTER" ]; then
mkdir $HOME/lib/$CLUSTER
fi
cp libgoto2.so $HOME/lib/$CLUSTER
echo results
ls -d $HOME/src/$CLUSTER
ls $HOME/src/$CLUSTER
ls -d $HOME/lib/$CLUSTER
ls $HOME/lib/$CLUSTER
note: for newer processors this may fail. If it is a NEHALEM processor try:
make clean make TARGET=NEHALEM
Paging and the OOM-Killer
When doing exhaustion of available memory experiments, problems can occur with over-commit. See HCL cluster#Paging_and_the_OOM-Killer for more detail.
Example of experiment setup across several sites
Sources of all files mentioned below is available at: Grid5000:sources.
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources
cd dave/fupermod-1.1.0 make clean ./configure --with-cblas=goto --prefix=/usr/local/
Reserve 2 nodes from all clusters on a 3 cluster site:
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00
Automate with:
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done
Then on each site, where xxx is site name:
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx
Gather deployed files to a head node:
for i in `cat ~/sites `; do echo $i; scp $i:deployed* . ; done cat deployed.* > deployed.all
Copy cluster specific libs to each deployed node /usr/local/lib dir with script
copy_local_libs.sh deployed.all
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel)
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i: ; done for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0 ; make all install" ; done
ssh to the first node
ssh `head -n1 deployed.all` n=$(cat deployed.all |wc -l) mpdboot --totalnum=$n --file=$HOME/deployed.all mpdtrace
cd dave/data/ mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100
Cleanup after:
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.* ; done
Check network speed
apt-get install iperf
Choose which network interface to use
mpirun --mca btl self,openib ...
or
mpirun --mca btl self,tcp ...
Installing Gadget-2.0.7
# apt-get install hdf5-openmpi-dev sfftw-dev $ tar -xzvf gadget2.tar.gz $ cd Gadget-2.0.7/Gadget2 $ make CFLAGS="-DH5_USE_16_API $ make clean; make
Installing Wrekavoc
Download from http://wrekavoc.gforge.inria.fr/
# apt-get install libxml2-dev pkg-config # tar -xzvf wrekavoc-1.1.tar.gz # cd wrekavoc-1.1/ # ./configure # make # ./src/burn 50
Installing Extrae
(on grid5000 wheezy big)
First install Dyninst
# apt-get install libelf-dev libdwarf-dev # tar -xzvf DyninstAPI-8.1.2.tgz # cd DyninstAPI-8.1.2 # ./configure --with-libdwarf-static # make # make install
Then Extrae
# apt-get install # ./configure --with-mpi=/usr --with-mpi-libs=/usr/lib --with-papi=/usr/local --with-unwind=/usr --with-dyninst=/usr/local --with-dwarf=/usr