Grid5000

From HCL
Revision as of 15:39, 21 July 2011 by Davepc (talk | contribs) (Example of experiment setup across several sites)

Jump to: navigation, search

https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

USAGE POLICY

Login, job submission, deployment of image

  • Select sites and clusters for experiments, using information on the Grid5000 network and the Status page
  • Access is provided via access nodes access.SITE.grid5000.fr marked here as accessible from everywhere via ssh with keyboard-interactive authentication method. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:
access_$ ssh frontend.SITE2
  • There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available.
  • Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it scp, sftp, rsync between access or frontend nodes.
  • Jobs are run from the frondend nodes, using a PBS-like system OAR. Basic commands:
    • oarstat - queue status
    • oarsub - job submission
    • oardel - job removal
Interactive job on deployed images:
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']
Batch job on installed images:
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']
Loading:
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES
A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000.

Compiling and running MPI applications

  • Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)
  • Running MPI applications is described here
    • mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)

Setting up new deploy image

oarsub -I -t deploy -l nodes=1,walltime=12
kadeploy3 -e lenny-x64-nfs -f $OAR_FILE_NODES -k
ssh root@`head -n 1 $OAR_NODEFILE`

edit /etc/apt/sources.list

apt-get update
apt-get upgrade
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion


Compiled for sources by us:

./configure && make && make install
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd
make && make install

Mpich2 installed to:

Installing MPE2 include files to /usr/local/include
Installing MPE2 libraries to /usr/local/lib
Installing MPE2 utility programs to /usr/local/bin
Installing MPE2 configuration files to /usr/local/etc
Installing MPE2 system utility programs to /usr/local/sbin
Installing MPE2 man to /usr/local/share/man
Installing MPE2 html to /usr/local/share/doc/
Installed MPE2 in /usr/local

compile from sources. To get xml support install libxml2-dev and pkg-config

apt-get install libxml2-dev pkg-config
tar -xzvf hwloc-1.1.1.tar.gz
cd hwloc-1.1.1
./configure && make && make install

Cleanup

apt-get clean
rm /etc/udev/rules.d/*-persistent-net.rules

Make image

ssh root@node tgz-g5k > $HOME/grid5000/imagename.tgz

make appropriate .env file.

kaenv3 -p lenny-x64-nfs -u deploy > lenny-x64-custom-2.3.env


GotoBLAS2

When compiling gotoblas on a node without direct internet access get this error:
wget http://www.netlib.org/lapack/lapack-3.1.1.tgz
--2011-05-19 03:11:03--  http://www.netlib.org/lapack/lapack-3.1.1.tgz
Resolving www.netlib.org... 160.36.58.108
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.
Retrying.

--2011-05-19 03:14:13--  (try: 2)  http://www.netlib.org/lapack/lapack-3.1.1.tgz
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.
Retrying.
...

Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile

184c184
< 	-wget http://www.netlib.org/lapack/lapack-3.1.1.tgz
---
> #	-wget http://www.netlib.org/lapack/lapack-3.1.1.tgz


GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc

export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH

Run the following script once on each cluster:

#! /bin/bash
echo "Compiling gotoblas for cluster: $CLUSTER"
cd $HOME/src
if [ ! -d "$CLUSTER" ]; then
        mkdir $CLUSTER
fi
cd $CLUSTER
tar -xzf ../Goto*.tar.gz
cd Goto*
make &> m.log


if [ ! -d "$HOME/lib/$CLUSTER" ]; then
        mkdir $HOME/lib/$CLUSTER
fi

cp libgoto2.so $HOME/lib/$CLUSTER

echo results
ls -d $HOME/src/$CLUSTER
ls $HOME/src/$CLUSTER

ls -d $HOME/lib/$CLUSTER
ls $HOME/lib/$CLUSTER

Paging and the OOM-Killer

When doing exhaustion of available memory experiments, problems can occur with over-commit. See HCL_cluster#Paging_and_the_OOM-Killer for more detail.

Example of experiment setup across several sites

-Note: All these steps should probably be scripted into one command in the future.

Pick one head node as the main head node (I use grenoble, but any will do). Setup sources

cd dave/fupermod-1.1.0
make clean
./configure --with-cblas=goto --prefix=/usr/local/

Reserve 2 nodes from all clusters on a 3 cluster site:

oarsub -r "2011-07-25 11:01:01" -t deploy  -l cluster=3/nodes=2,walltime=11:59:00

Automate with:

for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done

Then on each site:

kadeploy3 -a $HOME/grid5000/lenny-dave.env  -f $OAR_NODE_FILE --output-ok-nodes deployed.rennes

or:

for i in `cat sites`; do echo $i; ssh $i kadeploy3 -a $HOME/grid5000/lenny-dave.env  -f $OAR_NODE_FILE --output-ok-nodes deployed.$i; done

Gather deployed files to a head node:

for i in `cat ~/sites `; do echo $i; scp $i:deployed* . ; done
cat deployed.* > deployed.all

Copy cluster specific libs to each deployed node /usr/local/lib dir with script

copy_local_libs.sh deployed.all

Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel)

for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i: ; done
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0 ; make all install" ; done