https://hcl.ucd.ie/wiki/api.php?action=feedcontributions&user=Davepc&feedformat=atomHCL - User contributions [en]2024-03-28T08:39:07ZUser contributionsMediaWiki 1.27.1https://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=826HCL cluster/hcl node install configuration log2013-09-26T21:33:01Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Before:<br />
uname -a<br />
Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux<br />
After:<br />
Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
reboot<br />
Had to manually boot hcl12 hcl13 becuase of tempiture warning.<br />
apt-get autoremove<br />
<br />
NFS not starting on boot, so added to /etc/rc.local<br />
mount /home<br />
Changed /etc/fstab line to:<br />
192.168.20.254:/home /home nfs rw,rsize=4096,wsize=4096,hard,intr 0 0<br />
<br />
==Questions:==<br />
Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.<br />
A:keep your currently-installed version<br />
<br />
Q:A new version of configuration file /etc/default/nfs-common is available<br />
A:Keep installed version<br />
<br />
Q:A new version of configuration file /etc/default/grub is available<br />
A:Install package managers version.<br />
<br />
==Problems==<br />
On every package getting the following warnings:<br />
ldconfig: Can't link /opt/lib//opt/lib/libf77blas.so to libf77blas.so<br />
ldconfig: Can't link /opt/lib//opt/lib/libcblas.so to libcblas.so<br />
ldconfig: Can't link /opt/lib//opt/lib/libatlas.so to libatlas.so<br />
<br />
==Installation of PAPI - failed==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
Make test failed. Hardware too old, would need to patch kernel. Abandoned installing Papi on HCL for now.<br />
<br />
==Installation of Extrae<br />
apt-get install <br />
apt-get install libunwind7-dev<br />
apt-get install binutils-dev<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi --prefix=/usr/local<br />
Note - this is a lot of good features of extrae turned off because if missing packages.<br />
make <br />
make install</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=825HCL cluster/hcl node install configuration log2013-09-26T19:43:58Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Before:<br />
uname -a<br />
Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux<br />
After:<br />
Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
reboot<br />
Had to manually boot hcl12 hcl13 becuase of tempiture warning.<br />
apt-get autoremove<br />
<br />
Questions: <br />
Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.<br />
A:keep your currently-installed version<br />
<br />
Q:A new version of configuration file /etc/default/nfs-common is available<br />
A:Keep installed version<br />
<br />
Q:A new version of configuration file /etc/default/grub is available<br />
A:Install package managers version.<br />
<br />
<br />
<br />
==Installation of PAPI - failed==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
Make test failed. Hardware too old, would need to patch kernel. Abandoned installing Papi on HCL for now.<br />
<br />
==Installation of Extrae<br />
apt-get install <br />
apt-get install libunwind7-dev<br />
apt-get install binutils-dev<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi --prefix=/usr/local<br />
Note - this is a lot of good features of extrae turned off because if missing packages.<br />
make <br />
make install</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=824HCL cluster/hcl node install configuration log2013-09-26T19:42:26Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Before:<br />
uname -a<br />
Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux<br />
After:<br />
Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
reboot<br />
<br />
Questions: <br />
Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.<br />
A:keep your currently-installed version<br />
<br />
Q:A new version of configuration file /etc/default/nfs-common is available<br />
A:Keep installed version<br />
<br />
Q:A new version of configuration file /etc/default/grub is available<br />
A:Install package managers version.<br />
<br />
<br />
<br />
==Installation of PAPI - failed==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
Make test failed. Hardware too old, would need to patch kernel. Abandoned installing Papi on HCL for now.<br />
<br />
==Installation of Extrae<br />
apt-get install <br />
apt-get install libunwind7-dev<br />
apt-get install binutils-dev<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi --prefix=/usr/local<br />
Note - this is a lot of good features of extrae turned off because if missing packages.<br />
make <br />
make install</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=823HCL cluster/hcl node install configuration log2013-09-26T19:38:25Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Before:<br />
uname -a<br />
Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux<br />
After:<br />
Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
reboot<br />
<br />
Questions: <br />
Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.<br />
A:keep your currently-installed version<br />
<br />
Q:A new version of configuration file /etc/default/nfs-common is available<br />
A:Keep installed version<br />
<br />
Q:A new version of configuration file /etc/default/grub is available<br />
A:Install package managers version.<br />
<br />
<br />
<br />
==Installation of PAPI==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
make install<br />
make test<br />
Make test failed. Hardware too old, would need to patch kernel. Installing Papi on HCL abandoned for now.<br />
<br />
==Installation of Extrae<br />
apt-get install <br />
apt-get install libunwind7-dev<br />
apt-get install binutils-dev<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi<br />
Note - this is a lot of good features of extrae turned off because if missing packages.</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=822HCL cluster/hcl node install configuration log2013-09-26T18:15:31Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Before:<br />
uname -a<br />
Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux<br />
After:<br />
Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
reboot<br />
<br />
Questions: <br />
Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.<br />
A:keep your currently-installed version<br />
<br />
Q:A new version of configuration file /etc/default/nfs-common is available<br />
A:Keep installed version<br />
<br />
Q:A new version of configuration file /etc/default/grub is available<br />
A:Install package managers version.<br />
<br />
<br />
<br />
==Installation of PAPI==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
make install<br />
<br />
==Installation of Extrae<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
<br />
./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=821HCL cluster/hcl node install configuration log2013-09-26T18:08:23Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Before:<br />
uname -a<br />
Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux<br />
After:<br />
Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
reboot<br />
<br />
<br />
==Installation of PAPI==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
make install<br />
<br />
==Installation of Extrae<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
./configure</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=820HCL cluster/hcl node install configuration log2013-09-26T18:07:04Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux<br />
<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
reboot<br />
<br />
==Installation of PAPI==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
make install<br />
<br />
==Installation of Extrae<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
./configure</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=819HCL cluster/hcl node install configuration log2013-09-26T18:03:48Z<p>Davepc: /* Upgrade September 2013 */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI<br />
<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade<br />
<br />
<br />
==Installation of PAPI==<br />
get papi-5.2.0.tar.gz<br />
tar -xzvf papi-5.2.0.tar.gz<br />
cd papi-5.2.0/src/<br />
./configure --prefix=/usr/local<br />
make<br />
make test<br />
make install<br />
<br />
==Installation of Extrae<br />
get extrae-2.4.0.tar.bz2<br />
tar -xjvf extrae-2.4.0.tar.bz2<br />
cd extrae-2.4.0<br />
./configure</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=818HCL cluster/hcl node install configuration log2013-09-26T17:50:34Z<p>Davepc: </p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35<br />
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16<br />
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09<br />
<br />
<br />
vi /etc/apt/sources.list<br />
replace squeeze with wheezy<br />
<br />
In parallel on all with Cluster ssh (cssh)<br />
apt-get update<br />
apt-get dist-upgrade</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster/hcl_node_install_configuration_log&diff=817HCL cluster/hcl node install configuration log2013-09-26T16:43:18Z<p>Davepc: /* Complications */</p>
<hr />
<div>HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.<br />
<br />
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]<br />
<br />
=General Installation=<br />
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.<br />
<br />
Install long list of packages.<br />
<br />
==Networking==<br />
Configure network interface as follows:<br />
<br />
<source lang="text"># This file describes the network interfaces available on your system<br />
# and how to activate them. For more information, see interfaces(5).<br />
<br />
# The loopback network interface<br />
auto lo eth0 eth1<br />
<br />
iface lo inet loopback<br />
<br />
# The primary network interface<br />
allow-hotplug eth0<br />
iface eth0 inet dhcp<br />
<br />
allow-hotplug eth1<br />
iface eth1 inet dhcp<br />
</source><br />
<br />
===Routing Tables===<br />
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/sh<br />
<br />
# Static Routes<br />
<br />
# route ganglia broadcast t<br />
route add -host 239.2.11.72 dev eth0<br />
<br />
# all traffic to heterogeneous gate goes through eth0<br />
route add -host 192.168.20.254 dev eth0<br />
<br />
# all subnet traffic goes through specific interface<br />
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0<br />
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1<br />
</source><br />
<br />
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.<br />
<br />
===Hosts===<br />
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.<br />
<source lang="text"><br />
127.0.0.1 localhost<br />
<br />
# The following lines are desirable for IPv6 capable hosts<br />
::1 localhost ip6-localhost ip6-loopback<br />
fe00::0 ip6-localnet<br />
ff00::0 ip6-mcastprefix<br />
ff02::1 ip6-allnodes<br />
ff02::2 ip6-allrouters<br />
</source><br />
<br />
==Ganglia==<br />
Install the ganglia-monitor package. <br />
<br />
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:<br />
<br />
<source lang="text">cluster {<br />
name = "HCL Cluster"<br />
owner = "University College Dublin"<br />
latlong = "unspecified"<br />
url = "http://hcl.ucd.ie/"<br />
}<br />
</source><br />
And ...<br />
<source lang="text"><br />
/* Feel free to specify as many udp_send_channels as you like. Gmond<br />
used to only support having a single channel */<br />
udp_send_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
ttl = 1<br />
}<br />
<br />
/* You can specify as many udp_recv_channels as you like as well. */<br />
udp_recv_channel {<br />
mcast_join = 239.2.11.72<br />
port = 8649<br />
bind = 239.2.11.72<br />
}<br />
</source><br />
After all packages are complete execute:<br />
<source lang="text"><br />
service ganglia-monitor restart<br />
</source><br />
<br />
==NIS Client==<br />
<br />
Install nis package.<br />
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code><br />
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.<br />
192.168.21.254 heterogeneous.ucd.ie heterogeneous<br />
Make sure the file <code>/etc/nsswitch.conf</code> contains:<br />
passwd: compat<br />
group: compat<br />
shadow: compat<br />
Append to the file: <code>/etc/passwd</code> the line <code>+::::::</code><br />
Append to the file: <code>/etc/group</code> the line <code>+:::</code><br />
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code><br />
<br />
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)<br />
ypserver 192.168.20.254<br />
<br />
Start the nis service:<br />
service nis start<br />
Check that nis is operating correctly by running the following command:<br />
ypcat passwd<br />
<br />
==NFS==<br />
apt-get install nfs-common portmap<br />
Add the line to <code>/etc/fstab</code><br />
192.168.20.254:/home /home nfs soft,retrans=6 0 0<br />
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)<br />
NEED_IDMAPD=yes<br />
Then:<br />
service nfs-common restart<br />
mount /home<br />
<br />
==Torque PBS==<br />
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:<br />
./torque-package-mom-linux-i686.sh --install<br />
./torque-package-clients-linux-i686.sh --install<br />
update-rc.d pbs_mom defaults<br />
service pbs_mom start<br />
<br />
==NTP==<br />
Install NTP software:<br />
apt-get install ntp<br />
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.<br />
<br />
=Complications=<br />
<br />
==Hostnames==<br />
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described [https://bugs.launchpad.net/ubuntu/+source/dhcp3/+bug/90388 here]<br />
<br />
The solution we will use is to add the file <code>/etc/dhcp3/dhclient-exit-hooks.d/hostname</code> with the contents:<br />
<br />
<source lang="bash"><br />
if [[ -n $new_host_name ]]; then<br />
echo "$new_host_name" > /etc/hostname<br />
/bin/hostname $new_host_name<br />
fi<br />
</source><br />
<br />
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be <code>eth0</code>. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.<br />
<br />
Further, the current hostnames for the second interface on nodes <code>eth1</code> are '''invalid'''. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.<br />
<br />
==udev and Network Interfaces==<br />
<br />
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].<br />
<br />
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following<br />
<br />
#remove the file <code>/etc/udev/rules.d/70-persistent-net.rules</code><br />
#and to the top of the file: <code>/lib/udev/rules.d/75-persistent-net-generator.rules</code>, the following lines:<br />
<source lang="text"># skip generation of persistent network interfaces<br />
ACTION=="*", GOTO="persistent_net_generator_end"</source><br />
<br />
==Sysstat==<br />
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.<br />
<source lang="text"><br />
# The first element of the path is a directory where the debian-sa1<br />
# script is located<br />
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin<br />
<br />
# Activity reports every 10 minutes everyday<br />
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1<br />
<br />
# Additional run at 23:59 to rotate the statistics file<br />
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2<br />
</source><br />
<br />
=Upgrade September 2013=<br />
Upgrade motivated by needing a kernel newer than 2.6.35<br />
<br />
apt-get update</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Memory_size,_overcommit,_limit&diff=816Memory size, overcommit, limit2013-09-09T13:59:20Z<p>Davepc: </p>
<hr />
<div>== Paging and the OOM-Killer ==<br />
<br />
Due to the nature of experiments our group runs, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux has a pair of strategies to deal with heavy memory use. First, is overcommitting. This is where a process is allowed allocate or fork even when there is no more memory available. You can seem some interesting numbers here:[http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html]. The assumption is that processes may not use all memory that they allocate and failing on allocation is worse than failing at a later date when the memory use is actually required. More processes may be supported by allowing them to allocate memory (provided they do not use it all). The second part of the strategy is the Out-of-Memory killer (OOM Killer). When memory has been exhausted and a process tries to use some 'overcommitted' part of memory, the OOM killer is invoked. It's job is to rank all processes in terms of their memory use, priority, privilege and some other parameters, and then select a process to kill based on the ranks. <br />
<br />
The argument for using overcommital+OOM Killer is that rather than failing to allocate memory for some random unlucky process, which as a result would probably terminate, the kernel can instead allow the unlucky process to continue executing and then make a some-what-informed decision on which process to kill. Unfortunately, the behaviour of the OOM-killer sometimes causes problems which grind the machine to a complete halt, particularly when it decides to kill system processes. There is a good discussion on the OOM-killer here: [http://lwn.net/Articles/104179/] <br />
<br />
For this reason overcommit has been disabled on the HCL cluster. <br />
<br />
cat /proc/sys/vm/overcommit_memory <br />
2<br />
cat /proc/sys/vm/overcommit_ratio <br />
100<br />
<br />
To restore to default overcommit <br />
<br />
# echo 0 &gt; /proc/sys/vm/overcommit_memory<br />
# echo 50 &gt; /proc/sys/vm/overcommit_ratio<br />
<br />
<br> <br />
<br />
Run the following two programmes from here: [http://www.win.tue.nl/~aeb/linux/lk/lk-9.html www.win.tue.nl/~aeb/linux/lk/lk-9.html] to test results. <br />
<br />
Demo program 1: allocate memory without using it. <br />
<br />
<source lang="">#include <stdio.h><br />
#include <stdlib.h><br />
<br />
int main (void) {<br />
int n = 0;<br />
<br />
while (1) {<br />
if (malloc(1<<20) == NULL) {<br />
printf("malloc failure after %d MiB\n", n);<br />
return 0;<br />
}<br />
printf ("got %d MiB\n", ++n);<br />
}<br />
}</source><br> <br />
<br />
Demo program 2: allocate memory and actually touch it all. <br />
<br />
<source lang="">#include <stdio.h><br />
#include <string.h><br />
#include <stdlib.h><br />
<br />
int main (void) {<br />
int n = 0;<br />
char *p;<br />
<br />
while (1) {<br />
if ((p = malloc(1<<20)) == NULL) {<br />
printf("malloc failure after %d MiB\n", n);<br />
return 0;<br />
}<br />
memset (p, 0, (1<<20));<br />
printf ("got %d MiB\n", ++n);<br />
}<br />
}</source> <br />
<br />
== Manually Limit the Memory on the OS level ==<br />
<br />
as root edit /etc/default/grub <br />
<br />
GRUB_CMDLINE_LINUX_DEFAULT="quiet mem=128M"<br />
<br />
then run the command <br />
<br />
update-grub<br />
reboot</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Memory_size,_overcommit,_limit&diff=815Memory size, overcommit, limit2013-09-09T13:58:53Z<p>Davepc: </p>
<hr />
<div>== Paging and the OOM-Killer ==<br />
<br />
Due to the nature of experiments our group runs, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux has a pair of strategies to deal with heavy memory use. First, is overcommitting. This is where a process is allowed allocate or fork even when there is no more memory available. You can seem some interesting numbers here:[http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html]. The assumption is that processes may not use all memory that they allocate and failing on allocation is worse than failing at a later date when the memory use is actually required. More processes may be supported by allowing them to allocate memory (provided they do not use it all). The second part of the strategy is the Out-of-Memory killer (OOM Killer). When memory has been exhausted and a process tries to use some 'overcommitted' part of memory, the OOM killer is invoked. It's job is to rank all processes in terms of their memory use, priority, privilege and some other parameters, and then select a process to kill based on the ranks. <br />
<br />
The argument for using overcommital+OOM Killer is that rather than failing to allocate memory for some random unlucky process, which as a result would probably terminate, the kernel can instead allow the unlucky process to continue executing and then make a some-what-informed decision on which process to kill. Unfortunately, the behaviour of the OOM-killer sometimes causes problems which grind the machine to a complete halt, particularly when it decides to kill system processes. There is a good discussion on the OOM-killer here: [http://lwn.net/Articles/104179/] <br />
<br />
For this reason overcommit has been disabled on the HCL cluster. <br />
<br />
cat /proc/sys/vm/overcommit_memory <br />
2<br />
cat /proc/sys/vm/overcommit_ratio <br />
100<br />
<br />
To restore to default overcommit <br />
<br />
# echo 0 &gt; /proc/sys/vm/overcommit_memory<br />
# echo 50 &gt; /proc/sys/vm/overcommit_ratio<br />
<br />
<br />
<br />
Run the following two programmes are from here: [http://www.win.tue.nl/~aeb/linux/lk/lk-9.html www.win.tue.nl/~aeb/linux/lk/lk-9.html] to test results.<br />
<br />
Demo program 1: allocate memory without using it.<br />
<br />
<source lang="">#include <stdio.h><br />
#include <stdlib.h><br />
<br />
int main (void) {<br />
int n = 0;<br />
<br />
while (1) {<br />
if (malloc(1<<20) == NULL) {<br />
printf("malloc failure after %d MiB\n", n);<br />
return 0;<br />
}<br />
printf ("got %d MiB\n", ++n);<br />
}<br />
}</source><br> <br />
<br />
Demo program 2: allocate memory and actually touch it all.<br />
<br />
<source lang="">#include <stdio.h><br />
#include <string.h><br />
#include <stdlib.h><br />
<br />
int main (void) {<br />
int n = 0;<br />
char *p;<br />
<br />
while (1) {<br />
if ((p = malloc(1<<20)) == NULL) {<br />
printf("malloc failure after %d MiB\n", n);<br />
return 0;<br />
}<br />
memset (p, 0, (1<<20));<br />
printf ("got %d MiB\n", ++n);<br />
}<br />
}</source><br />
<br />
== Manually Limit the Memory on the OS level ==<br />
<br />
as root edit /etc/default/grub <br />
<br />
GRUB_CMDLINE_LINUX_DEFAULT="quiet mem=128M"<br />
<br />
then run the command <br />
<br />
update-grub<br />
reboot</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Main_Page&diff=814Main Page2013-09-09T13:54:37Z<p>Davepc: /* Hardware */</p>
<hr />
<div>This site is set up for sharing ideas, findings and experience in heterogeneous computing. Please, log in and create new or edit existing pages. How to format wiki-pages read [[Help:Editing|here]].<br />
<br />
== HCL software for heterogeneous computing ==<br />
* Extensions for [[MPI]]: [http://hcl.ucd.ie/project/mpC mpC] [http://hcl.ucd.ie/project/HeteroMPI HeteroMPI] [http://hcl.ucd.ie/project/libELC libELC]<br />
* Extensions for [http://en.wikipedia.org/wiki/GridRPC GridRPC]: [http://hcl.ucd.ie/project/SmartGridSolve SmartGridSolve] [http://hcl.ucd.ie/project/NI-Connect NI-Connect]<br />
* Computation benchmarking, modeling, dynamic load balancing: [http://hcl.ucd.ie/project/fupermod FuPerMod] [http://hcl.ucd.ie/project/pmm PMM]<br />
* Communication benchmarking, modeling, optimization: [http://hcl.ucd.ie/project/cpm CPM] [http://hcl.ucd.ie/project/mpiblib MPIBlib]<br />
<br />
== Heterogeneous mathematical software ==<br />
* [http://hcl.ucd.ie/project/HeteroScaLAPACK HeteroScaLAPACK]<br />
* [http://hcl.ucd.ie/project/Hydropad Hydropad]<br />
<br />
== Operating systems == <br />
* [[Linux]]<br />
* [[Windows]]<br />
<br />
== Development tools ==<br />
* [[C/C++]], [[Python]], [[UML]], [[FORTRAN]]<br />
* [[Autotools]]<br />
* [[GDB]], [[OProfile]], [[Valgrind]]<br />
* [[Doxygen]]<br />
* [[ChangeLog]], [[Subversion]]<br />
* [[Eclipse]]<br />
* [[Bash Scripts]]<br />
<br />
== [[Libraries]] ==<br />
* [[GNU C Library]]<br />
* [[MPI]]<br />
* [[STL]], [[Boost]]<br />
* [[GSL]]<br />
* [[BLAS LAPACK ScaLAPACK]]<br />
* [[NLOPT]]<br />
* [[BitTorrent (B. Cohen's version)]]<br />
* [[CUDA SDK]]<br />
<br />
== Data processing ==<br />
* [[gnuplot]], [[pgfplot]], [[matplotlib]]<br />
* [[Graphviz]]<br />
* [[Octave]], [[R]]<br />
* [[G3DViewer]]<br />
<br />
== Paper & Presentation Tools ==<br />
* [[Dia]], [[PGF/Tikz]], [[pgfplot]]<br />
* [[LaTeX]], [[Beamer]]<br />
* [[BibTeX]], [[JabRef]]<br />
<br />
== Hardware ==<br />
* [[HCL cluster]]<br />
* [[Other UCD Resources]]<br />
* [[UTK multicores + GPU]]<br />
* [[Grid5000]]<br />
* [[BlueGene/P]]<br />
* [[Desktop Backup]]<br />
* [[Memory size, overcommit, limit]]<br />
<br />
[[SSH|How to connect to cluster via SSH]]<br />
<br />
[[hwloc|How to find information about the hardware]]<br />
<br />
== Mathematics ==<br />
* [http://en.wikipedia.org/wiki/Confidence_interval Confidence interval (Statistics)], [http://en.wikipedia.org/wiki/Student's_t-distribution Student's t-distribution] (implemented in [[GSL]])<br />
* [http://en.wikipedia.org/wiki/Linear_regression Linear regression] (implemented in [[GSL]])<br />
* [http://en.wikipedia.org/wiki/Binomial_tree#Binomial_tree Binomial tree] (use [[Graphviz]] to visualize trees)<br />
* [http://en.wikipedia.org/wiki/Spline_interpolation Spline interpolation], [http://en.wikipedia.org/wiki/B-spline Spline approximation] (implemented in [[GSL]])</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster&diff=813HCL cluster2013-09-09T13:53:24Z<p>Davepc: </p>
<hr />
<div>== General Information ==<br />
[[Image:Cluster.jpg|right|thumbnail||HCL Cluster]]<br />
[[Image:network.jpg|right|thumbnail||Layout of the Cluster]]<br />
The hcl cluster is heterogeneous in computing hardware & network ability.<br />
<br />
Nodes are from Dell, IBM, and HP, with Celeron, Pentium 4, Xeon, and AMD processors ranging in speeds from 1.8 to 3.6Ghz. Accordingly architectures and parameters such as Front Side Bus, Cache, and Main Memory all vary.<br />
<br />
Operating System used is Debian “squeeze” with Linux kernel 2.6.32.<br />
<br />
The network hardware consists of two Cisco 24+4 port Gigabit switches. Each node has two Gigabit ethernet ports - each eth0 is connected to the first switch, and each eth1 is connected to the second switch. The switches are also connected to each other. The bandwidth of each port can be configured to meet any value between 8Kb/s and 1Gb/s. This allows testing on a very large number of network topologies, As the bandwidth on the link connecting the two switches can also be configured, the cluster can actually act as two separate clusters connected via one link.<br />
<br />
The diagram shows a schematic of the cluster.<br />
<br />
=== Detailed Cluster Specification ===<br />
* [[HCL Cluster Specifications]]<br />
* [[Old HCL Cluster Specifications]] (pre May 2010)<br />
<br />
=== Documentation ===<br />
* [[media:PE750.tgz|Dell Poweredge 750 Documentation]]<br />
* [[media:SC1425.tgz|Dell Poweredge SC1425 Documentation]]<br />
* [[media:X306.pdf|IBM x-Series 306 Documentation]]<br />
* [[media:E326.pdf|IBM e-Series 326 Documentation ]]<br />
* [[media:Proliant100SeriesGuide.pdf|HP Proliant DL-140 G2 Documentation]]<br />
* [[media:ProliantDL320G3Guide.pdf|HP Proliant DL-320 G3 Documentation]]<br />
* [[media:Cisco3560Specs.pdf|Cisco Catalyst 3560 Specifications]]<br />
* [[media:Cisco3560Guide.pdf|Cisco Catalyst 3560 User Guide]]<br />
* [[HCL Cluster Network]]<br />
<br />
== Cluster Administration ==<br />
<br />
If PBS jobs do not start after a reboot of heterogeneous.ucd.ie it may be necessary to manually start maui:<br />
/usr/local/maui/sbin/maui<br />
<br />
===Useful Tools===<br />
<code>root</code> on <code>heterogeneous.ucd.ie</code> has a number of [http://expect.nist.gov/ Expect] scripts to automate administration on the cluster (in <code>/root/scripts</code>). <code>root_ssh</code> will automatically log into a host, provide the root password and either return a shell to the user or execute a command that is passed as a second argument. Command syntax is as follows:<br />
<br />
<source lang="text"><br />
# root_ssh<br />
usage: root_ssh [user@]<host> [command]<br />
</source><br />
<br />
Example usage, to login and execute a command on each node in the cluster (note the file <code>/etc/dsh/machines.list</code> contains the hostnames of all compute nodes of the cluster):<br />
# for i in `seq -w 1 16`; do root_ssh hcl$i ps ax \| grep pbs; done<br />
<br />
The above is sequential. To run parallel jobs, for example: <code>apt-get update && apt-get -y upgrade</code>, try the following trick with [http://www.gnu.org/software/screen/ screen]:<br />
# for i in `seq -w 1 16`; do screen -L -d -m root_ssh hcl$i apt-get update \&\& apt-get -y upgrade; done<br />
You can check the screenlog.* files for errors and delete them when you are happy. Sometimes all logs are sent to screenlog.0, not sure why.<br />
<br />
== Software packages available on HCL Cluster 2.0 ==<br />
<br />
Wit a fresh installation of operating systems on HCL Cluster the follow list of packages are avalible:<br />
* autoconf<br />
* automake<br />
* gcc<br />
* ctags<br />
* cg-vg<br />
* fftw2<br />
* git<br />
* gfortran<br />
* gnuplot<br />
* libtool<br />
* netperf<br />
* octave3.2<br />
* qhull<br />
* subversion<br />
* valgrind<br />
* gsl-dev<br />
* vim<br />
* python<br />
* mc<br />
* openmpi-bin <br />
* openmpi-dev<br />
* evince<br />
* libboost-graph-dev<br />
* libboost-serialization-dev<br />
* libatlas-base-dev<br />
* r-cran-strucchange<br />
* graphviz<br />
* doxygen<br />
* colorgcc<br />
<br />
[[HCL_cluster/hcl_node_install_configuration_log|new hcl node install & configuration log]]<br />
<br />
[[HCL_cluster/heterogeneous.ucd.ie_install_log|new heterogeneous.ucd.ie install log]]<br />
<br />
===APT===<br />
To do unattended updates on cluster machines you need to specify some environment variables and switches to apt-get:<br />
<br />
export DEBIAN_FRONTEND=noninteractive apt-get -q -y upgrade<br />
<br />
NOTE: on hcl01 and hcl02 any updates to grub will force a prompt, despite the switches above. This happens because there are two disks on these machines and grub asks which it should install itself on.<br />
<br />
== Access and Security ==<br />
All access and security for the cluster is handled by the gateway machine (heterogeneous.ucd.ie). This machine is not considered a compute node and should not be used as such. The only new incoming connections allowed are ssh, other incoming packets such as http that are responding to requests from inside the cluster (established or related) are also allowed. Incoming ssh packets are only accepted if they are originating from designated IP addresses. These IP's must be registered ucd IP's. csserver.ucd.ie is allowed, as is hclgate.ucd.ie, on which all users have accounts. Thus to gain access to the cluster you can ssh from csserver, hclgate or other allowed machines to heterogeneous. From there you can ssh to any of the nodes (hcl01-hcl16) that you are running a pbs job on.<br />
<br />
Access from outside the UCD network is only allowed once you have gained entry to a server that allows outside connections (such as csserver.ucd.ie)<br />
<br />
=== Creating new user accounts ===<br />
As root on heterogeneous run:<br />
adduser <username><br />
make -C /var/yp<br />
<br />
=== Access to the nodes is controlled by Torque PBS.===<br />
Use qsub to submit a job, -I is for an interactive session, walltime is time required.<br />
qsub -I -l walltime=1:00 \\ Reserve 1 node for 1 hour<br />
qsub -l nodes=hcl01+hcl07,walltime=1:00 myscript.sh<br />
<br />
Example Script:<br />
#!/bin/sh<br />
#General Script<br />
#<br />
#<br />
#These commands set up the Grid Environment for your job:<br />
#PBS -N JOBNAME<br />
#PBS -l walltime=48:00:00<br />
#PBS -l nodes=16<br />
#PBS -m abe<br />
#PBS -k eo<br />
#PBS -V<br />
echo foo<br />
<br />
So see the queue<br />
qstat -n<br />
showq<br />
<br />
To remove your job <br />
qdel JOBNUM<br />
<br />
More info: [http://www.clusterresources.com/products/torque/docs/]<br />
<br />
== Some networking issues on HCL cluster (unsolved) ==<br />
<br />
"/sbin/route" should give:<br />
<br />
Kernel IP routing table<br />
Destination Gateway Genmask Flags Metric Ref Use Iface<br />
239.2.11.72 * 255.255.255.255 UH 0 0 0 eth0<br />
heterogeneous.u * 255.255.255.255 UH 0 0 0 eth0<br />
192.168.21.0 * 255.255.255.0 U 0 0 0 eth1<br />
192.168.20.0 * 255.255.255.0 U 0 0 0 eth0<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth0<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth1<br />
default heterogeneous.u 0.0.0.0 UG 0 0 0 eth0<br />
<br />
<br />
For reasons unclear, sometimes many machines miss the entry:<br />
<br />
192.168.21.0 * 255.255.255.0 U 0 0 0 eth1<br />
<br />
For Open MPI, this leads to inability to do a system sockets "connect" call to any 192.*.21.* address (hangup).<br />
For this case, you can <br />
<br />
* switch off eth1 (see also [http://hcl.ucd.ie/wiki/index.php/OpenMPI] ):<br />
<br />
mpirun --mca btl_tcp_if_exclude lo,eth1 ...<br />
<br />
or<br />
<br />
* you can restore the above table on all nodes by running "sh /etc/network/if-up.d/00routes" as root<br />
<br />
It is not yet clear why without this entry the connection to the "21" addresses can't be connected. We expect that in this case following rule should be matched (because of the mask):<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth0<br />
<br />
The packets leave over the eth0 network interface then and should go over switch1 to switch2 and eth1 interface of the corresponding node<br />
<br />
<br />
* If one attempts a ping from one node A, via its eth0 interface, to the address of another node's (B) eth1 interface, the following is observed:<br />
** outgoing ping packets appear only on the eth0 interface of the first node A.<br />
** incoming ping packets appear only on eth1 interface of the second node B.<br />
** outgoing ping response packets appear on the eth0 interface of the second node B, never on the eth1 interface despite pinging the eth1 address specifically.<br />
What explains this? With the routing tables as they are above, or in the damaged case, the ping may arrive to the correct interface, but the response from B is routed to A-eth0 via B-eth0. Further, after a number of ping packets have been sent in sequence (50 to 100), pings from A, though the -i eth0 switch is specified, begin to appear on both A-eth0 and A-eth1. This behaviour is unexpected, but does not effect the return path of the ping response packet.<br />
<br />
<br />
In order to get a symmetric behaviour, where a packet leaves A-eth0, travels via the switch bridge to B-eth1 and returns back from B-eth1 to A-eth0, one must ensure the routing table of B contains no eth0 entries.<br />
<br />
== Paging and the OOM-Killer ==<br />
Please read [[Virtual Memory Overcommit]] page for details. For reasons given overcommit has been disabled on the cluster.<br />
<br />
cat /proc/sys/vm/overcommit_memory <br />
2<br />
cat /proc/sys/vm/overcommit_ratio <br />
100<br />
<br />
To restore to default overcommit<br />
# echo 0 > /proc/sys/vm/overcommit_memory<br />
# echo 50 > /proc/sys/vm/overcommit_ratio<br />
<br />
== Manually Limit the Memory on the OS level ==<br />
Memory can be manually limited in the grub. This should be reset when you are finished.<br />
If you are doing memory exaustive experiments test check this has not being adjusted by someone else.<br />
See [[Memory size, overcommit, limit]] for more detail.</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Memory_size,_overcommit,_limit&diff=811Memory size, overcommit, limit2013-09-09T13:50:10Z<p>Davepc: moved Virtual memory overcommit to Memory size, overcommit, limit</p>
<hr />
<div>== Paging and the OOM-Killer ==<br />
Due to the nature of experiments our group runs, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux has a pair of strategies to deal with heavy memory use. First, is overcommitting. This is where a process is allowed allocate or fork even when there is no more memory available. You can seem some interesting numbers here:[http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html]. The assumption is that processes may not use all memory that they allocate and failing on allocation is worse than failing at a later date when the memory use is actually required. More processes may be supported by allowing them to allocate memory (provided they do not use it all). The second part of the strategy is the Out-of-Memory killer (OOM Killer). When memory has been exhausted and a process tries to use some 'overcommitted' part of memory, the OOM killer is invoked. It's job is to rank all processes in terms of their memory use, priority, privilege and some other parameters, and then select a process to kill based on the ranks.<br />
<br />
The argument for using overcommital+OOM Killer is that rather than failing to allocate memory for some random unlucky process, which as a result would probably terminate, the kernel can instead allow the unlucky process to continue executing and then make a some-what-informed decision on which process to kill. Unfortunately, the behaviour of the OOM-killer sometimes causes problems which grind the machine to a complete halt, particularly when it decides to kill system processes. There is a good discussion on the OOM-killer here: [http://lwn.net/Articles/104179/]<br />
<br />
For this reason overcommit has been disabled on the HCL cluster.<br />
cat /proc/sys/vm/overcommit_memory <br />
2<br />
cat /proc/sys/vm/overcommit_ratio <br />
100<br />
<br />
To restore to default overcommit<br />
# echo 0 > /proc/sys/vm/overcommit_memory<br />
# echo 50 > /proc/sys/vm/overcommit_ratio<br />
<br />
== Manually Limit the Memory on the OS level ==<br />
<br />
as root edit /etc/default/grub<br />
GRUB_CMDLINE_LINUX_DEFAULT="quiet mem=128M"<br />
then run the command<br />
update-grub<br />
reboot</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Virtual_memory_overcommit&diff=812Virtual memory overcommit2013-09-09T13:50:10Z<p>Davepc: moved Virtual memory overcommit to Memory size, overcommit, limit</p>
<hr />
<div>#REDIRECT [[Memory size, overcommit, limit]]</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Memory_size,_overcommit,_limit&diff=810Memory size, overcommit, limit2013-09-09T13:29:46Z<p>Davepc: Created page with "== Paging and the OOM-Killer == Due to the nature of experiments our group runs, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux …"</p>
<hr />
<div>== Paging and the OOM-Killer ==<br />
Due to the nature of experiments our group runs, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux has a pair of strategies to deal with heavy memory use. First, is overcommitting. This is where a process is allowed allocate or fork even when there is no more memory available. You can seem some interesting numbers here:[http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html]. The assumption is that processes may not use all memory that they allocate and failing on allocation is worse than failing at a later date when the memory use is actually required. More processes may be supported by allowing them to allocate memory (provided they do not use it all). The second part of the strategy is the Out-of-Memory killer (OOM Killer). When memory has been exhausted and a process tries to use some 'overcommitted' part of memory, the OOM killer is invoked. It's job is to rank all processes in terms of their memory use, priority, privilege and some other parameters, and then select a process to kill based on the ranks.<br />
<br />
The argument for using overcommital+OOM Killer is that rather than failing to allocate memory for some random unlucky process, which as a result would probably terminate, the kernel can instead allow the unlucky process to continue executing and then make a some-what-informed decision on which process to kill. Unfortunately, the behaviour of the OOM-killer sometimes causes problems which grind the machine to a complete halt, particularly when it decides to kill system processes. There is a good discussion on the OOM-killer here: [http://lwn.net/Articles/104179/]<br />
<br />
For this reason overcommit has been disabled on the HCL cluster.<br />
cat /proc/sys/vm/overcommit_memory <br />
2<br />
cat /proc/sys/vm/overcommit_ratio <br />
100<br />
<br />
To restore to default overcommit<br />
# echo 0 > /proc/sys/vm/overcommit_memory<br />
# echo 50 > /proc/sys/vm/overcommit_ratio<br />
<br />
== Manually Limit the Memory on the OS level ==<br />
<br />
as root edit /etc/default/grub<br />
GRUB_CMDLINE_LINUX_DEFAULT="quiet mem=128M"<br />
then run the command<br />
update-grub<br />
reboot</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster&diff=809HCL cluster2013-09-09T13:28:22Z<p>Davepc: </p>
<hr />
<div>== General Information ==<br />
[[Image:Cluster.jpg|right|thumbnail||HCL Cluster]]<br />
[[Image:network.jpg|right|thumbnail||Layout of the Cluster]]<br />
The hcl cluster is heterogeneous in computing hardware & network ability.<br />
<br />
Nodes are from Dell, IBM, and HP, with Celeron, Pentium 4, Xeon, and AMD processors ranging in speeds from 1.8 to 3.6Ghz. Accordingly architectures and parameters such as Front Side Bus, Cache, and Main Memory all vary.<br />
<br />
Operating System used is Debian “squeeze” with Linux kernel 2.6.32.<br />
<br />
The network hardware consists of two Cisco 24+4 port Gigabit switches. Each node has two Gigabit ethernet ports - each eth0 is connected to the first switch, and each eth1 is connected to the second switch. The switches are also connected to each other. The bandwidth of each port can be configured to meet any value between 8Kb/s and 1Gb/s. This allows testing on a very large number of network topologies, As the bandwidth on the link connecting the two switches can also be configured, the cluster can actually act as two separate clusters connected via one link.<br />
<br />
The diagram shows a schematic of the cluster.<br />
<br />
=== Detailed Cluster Specification ===<br />
* [[HCL Cluster Specifications]]<br />
* [[Old HCL Cluster Specifications]] (pre May 2010)<br />
<br />
=== Documentation ===<br />
* [[media:PE750.tgz|Dell Poweredge 750 Documentation]]<br />
* [[media:SC1425.tgz|Dell Poweredge SC1425 Documentation]]<br />
* [[media:X306.pdf|IBM x-Series 306 Documentation]]<br />
* [[media:E326.pdf|IBM e-Series 326 Documentation ]]<br />
* [[media:Proliant100SeriesGuide.pdf|HP Proliant DL-140 G2 Documentation]]<br />
* [[media:ProliantDL320G3Guide.pdf|HP Proliant DL-320 G3 Documentation]]<br />
* [[media:Cisco3560Specs.pdf|Cisco Catalyst 3560 Specifications]]<br />
* [[media:Cisco3560Guide.pdf|Cisco Catalyst 3560 User Guide]]<br />
* [[HCL Cluster Network]]<br />
<br />
== Cluster Administration ==<br />
<br />
If PBS jobs do not start after a reboot of heterogeneous.ucd.ie it may be necessary to manually start maui:<br />
/usr/local/maui/sbin/maui<br />
<br />
===Useful Tools===<br />
<code>root</code> on <code>heterogeneous.ucd.ie</code> has a number of [http://expect.nist.gov/ Expect] scripts to automate administration on the cluster (in <code>/root/scripts</code>). <code>root_ssh</code> will automatically log into a host, provide the root password and either return a shell to the user or execute a command that is passed as a second argument. Command syntax is as follows:<br />
<br />
<source lang="text"><br />
# root_ssh<br />
usage: root_ssh [user@]<host> [command]<br />
</source><br />
<br />
Example usage, to login and execute a command on each node in the cluster (note the file <code>/etc/dsh/machines.list</code> contains the hostnames of all compute nodes of the cluster):<br />
# for i in `seq -w 1 16`; do root_ssh hcl$i ps ax \| grep pbs; done<br />
<br />
The above is sequential. To run parallel jobs, for example: <code>apt-get update && apt-get -y upgrade</code>, try the following trick with [http://www.gnu.org/software/screen/ screen]:<br />
# for i in `seq -w 1 16`; do screen -L -d -m root_ssh hcl$i apt-get update \&\& apt-get -y upgrade; done<br />
You can check the screenlog.* files for errors and delete them when you are happy. Sometimes all logs are sent to screenlog.0, not sure why.<br />
<br />
== Software packages available on HCL Cluster 2.0 ==<br />
<br />
Wit a fresh installation of operating systems on HCL Cluster the follow list of packages are avalible:<br />
* autoconf<br />
* automake<br />
* gcc<br />
* ctags<br />
* cg-vg<br />
* fftw2<br />
* git<br />
* gfortran<br />
* gnuplot<br />
* libtool<br />
* netperf<br />
* octave3.2<br />
* qhull<br />
* subversion<br />
* valgrind<br />
* gsl-dev<br />
* vim<br />
* python<br />
* mc<br />
* openmpi-bin <br />
* openmpi-dev<br />
* evince<br />
* libboost-graph-dev<br />
* libboost-serialization-dev<br />
* libatlas-base-dev<br />
* r-cran-strucchange<br />
* graphviz<br />
* doxygen<br />
* colorgcc<br />
<br />
[[HCL_cluster/hcl_node_install_configuration_log|new hcl node install & configuration log]]<br />
<br />
[[HCL_cluster/heterogeneous.ucd.ie_install_log|new heterogeneous.ucd.ie install log]]<br />
<br />
===APT===<br />
To do unattended updates on cluster machines you need to specify some environment variables and switches to apt-get:<br />
<br />
export DEBIAN_FRONTEND=noninteractive apt-get -q -y upgrade<br />
<br />
NOTE: on hcl01 and hcl02 any updates to grub will force a prompt, despite the switches above. This happens because there are two disks on these machines and grub asks which it should install itself on.<br />
<br />
== Access and Security ==<br />
All access and security for the cluster is handled by the gateway machine (heterogeneous.ucd.ie). This machine is not considered a compute node and should not be used as such. The only new incoming connections allowed are ssh, other incoming packets such as http that are responding to requests from inside the cluster (established or related) are also allowed. Incoming ssh packets are only accepted if they are originating from designated IP addresses. These IP's must be registered ucd IP's. csserver.ucd.ie is allowed, as is hclgate.ucd.ie, on which all users have accounts. Thus to gain access to the cluster you can ssh from csserver, hclgate or other allowed machines to heterogeneous. From there you can ssh to any of the nodes (hcl01-hcl16) that you are running a pbs job on.<br />
<br />
Access from outside the UCD network is only allowed once you have gained entry to a server that allows outside connections (such as csserver.ucd.ie)<br />
<br />
=== Creating new user accounts ===<br />
As root on heterogeneous run:<br />
adduser <username><br />
make -C /var/yp<br />
<br />
=== Access to the nodes is controlled by Torque PBS.===<br />
Use qsub to submit a job, -I is for an interactive session, walltime is time required.<br />
qsub -I -l walltime=1:00 \\ Reserve 1 node for 1 hour<br />
qsub -l nodes=hcl01+hcl07,walltime=1:00 myscript.sh<br />
<br />
Example Script:<br />
#!/bin/sh<br />
#General Script<br />
#<br />
#<br />
#These commands set up the Grid Environment for your job:<br />
#PBS -N JOBNAME<br />
#PBS -l walltime=48:00:00<br />
#PBS -l nodes=16<br />
#PBS -m abe<br />
#PBS -k eo<br />
#PBS -V<br />
echo foo<br />
<br />
So see the queue<br />
qstat -n<br />
showq<br />
<br />
To remove your job <br />
qdel JOBNUM<br />
<br />
More info: [http://www.clusterresources.com/products/torque/docs/]<br />
<br />
== Some networking issues on HCL cluster (unsolved) ==<br />
<br />
"/sbin/route" should give:<br />
<br />
Kernel IP routing table<br />
Destination Gateway Genmask Flags Metric Ref Use Iface<br />
239.2.11.72 * 255.255.255.255 UH 0 0 0 eth0<br />
heterogeneous.u * 255.255.255.255 UH 0 0 0 eth0<br />
192.168.21.0 * 255.255.255.0 U 0 0 0 eth1<br />
192.168.20.0 * 255.255.255.0 U 0 0 0 eth0<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth0<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth1<br />
default heterogeneous.u 0.0.0.0 UG 0 0 0 eth0<br />
<br />
<br />
For reasons unclear, sometimes many machines miss the entry:<br />
<br />
192.168.21.0 * 255.255.255.0 U 0 0 0 eth1<br />
<br />
For Open MPI, this leads to inability to do a system sockets "connect" call to any 192.*.21.* address (hangup).<br />
For this case, you can <br />
<br />
* switch off eth1 (see also [http://hcl.ucd.ie/wiki/index.php/OpenMPI] ):<br />
<br />
mpirun --mca btl_tcp_if_exclude lo,eth1 ...<br />
<br />
or<br />
<br />
* you can restore the above table on all nodes by running "sh /etc/network/if-up.d/00routes" as root<br />
<br />
It is not yet clear why without this entry the connection to the "21" addresses can't be connected. We expect that in this case following rule should be matched (because of the mask):<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth0<br />
<br />
The packets leave over the eth0 network interface then and should go over switch1 to switch2 and eth1 interface of the corresponding node<br />
<br />
<br />
* If one attempts a ping from one node A, via its eth0 interface, to the address of another node's (B) eth1 interface, the following is observed:<br />
** outgoing ping packets appear only on the eth0 interface of the first node A.<br />
** incoming ping packets appear only on eth1 interface of the second node B.<br />
** outgoing ping response packets appear on the eth0 interface of the second node B, never on the eth1 interface despite pinging the eth1 address specifically.<br />
What explains this? With the routing tables as they are above, or in the damaged case, the ping may arrive to the correct interface, but the response from B is routed to A-eth0 via B-eth0. Further, after a number of ping packets have been sent in sequence (50 to 100), pings from A, though the -i eth0 switch is specified, begin to appear on both A-eth0 and A-eth1. This behaviour is unexpected, but does not effect the return path of the ping response packet.<br />
<br />
<br />
In order to get a symmetric behaviour, where a packet leaves A-eth0, travels via the switch bridge to B-eth1 and returns back from B-eth1 to A-eth0, one must ensure the routing table of B contains no eth0 entries.<br />
<br />
== Paging and the OOM-Killer ==<br />
Please read [[virtual memory overcommit]] for details.<br />
Due to the nature of experiments run on the cluster, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux has a pair of strategies to deal with heavy memory use. First, is overcommitting. This is where a process is allowed allocate or fork even when there is no more memory available. You can seem some interesting numbers here:[http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html]. The assumption is that processes may not use all memory that they allocate and failing on allocation is worse than failing at a later date when the memory use is actually required. More processes may be supported by allowing them to allocate memory (provided they do not use it all). The second part of the strategy is the Out-of-Memory killer (OOM Killer). When memory has been exhausted and a process tries to use some 'overcommitted' part of memory, the OOM killer is invoked. It's job is to rank all processes in terms of their memory use, priority, privilege and some other parameters, and then select a process to kill based on the ranks.<br />
<br />
The argument for using overcommital+OOM Killer is that rather than failing to allocate memory for some random unlucky process, which as a result would probably terminate, the kernel can instead allow the unlucky process to continue executing and then make a some-what-informed decision on which process to kill. Unfortunately, the behaviour of the OOM-killer sometimes causes problems which grind the machine to a complete halt, particularly when it decides to kill system processes. There is a good discussion on the OOM-killer here: [http://lwn.net/Articles/104179/]<br />
<br />
For this reason overcommit has been disabled on the cluster.<br />
cat /proc/sys/vm/overcommit_memory <br />
2<br />
cat /proc/sys/vm/overcommit_ratio <br />
100<br />
<br />
To restore to default overcommit<br />
# echo 0 > /proc/sys/vm/overcommit_memory<br />
# echo 50 > /proc/sys/vm/overcommit_ratio<br />
<br />
== Manually Limit the Memory on the OS level ==<br />
<br />
as root edit /etc/default/grub<br />
GRUB_CMDLINE_LINUX_DEFAULT="quiet mem=128M"<br />
then run the command<br />
update-grub<br />
reboot</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Grid5000&diff=808Grid50002013-07-22T19:16:24Z<p>Davepc: </p>
<hr />
<div>https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home <br />
<br />
[https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY]&nbsp; - Very important, after booking nodes (oarsub ...) run the command:&nbsp;<source lang="">outofchart</source>&nbsp;This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin. <br />
<br />
<br> <br />
<br />
== Login, job submission, deployment of image ==<br />
<br />
*Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] <br />
*Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:<br />
<br />
<source lang="bash"><br />
access_$ ssh frontend.SITE2<br />
</source> <br />
<br />
*There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. <br />
*Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. <br />
*Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: <br />
**'''oarstat''' - queue status <br />
**'''oarsub''' - job submission <br />
**'''oardel''' - job removal<br />
<br />
Interactive job on deployed images: <source lang="bash"><br />
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Batch job on installed images: <source lang="bash"><br />
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Specifying cluster name to reserve: <source lang="bash"><br />
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"<br />
</source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time. <br />
<br />
*The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here]<br />
<br />
Loading: <source lang="bash"><br />
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES <br />
</source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000. <br />
<br />
== Compiling and running MPI applications ==<br />
<br />
*Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) <br />
*Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] <br />
**mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)<br />
<br />
== Setting up new deploy image ==<br />
<br />
List available images <br />
<br />
kaenv3 -l<br />
<br />
Then book node and launch: <br />
<br />
oarsub -I -t deploy -l nodes=1,walltime=12<br />
kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k<br />
ssh root@`head -n 1 $OAR_NODEFILE`<br />
<br />
default password: grid5000 <br />
<br />
edit /etc/apt/sources.list <br />
<br />
apt-get update<br />
apt-get upgrade<br />
<br />
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev<br />
<br />
Possibly also install (for using extrae): <br />
<br />
apt-get install libxml2-dev binutils-dev libunwind7-dev<br />
<br />
<br> Compiled for sources by us: <br />
<br />
*<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)&nbsp;</strike> ''Now with squeeze it is in repository.''<br />
<br />
<strike>./configure &amp;&amp; make &amp;&amp; make install</strike><br />
<br />
*mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)<br />
<br />
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd<br />
make &amp;&amp; make install<br />
<br />
Mpich2 installed to: <br />
<br />
Installing MPE2 include files to /usr/local/include<br />
Installing MPE2 libraries to /usr/local/lib<br />
Installing MPE2 utility programs to /usr/local/bin<br />
Installing MPE2 configuration files to /usr/local/etc<br />
Installing MPE2 system utility programs to /usr/local/sbin<br />
Installing MPE2 man to /usr/local/share/man<br />
Installing MPE2 html to /usr/local/share/doc/<br />
Installed MPE2 in /usr/local<br />
<br />
*hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)<br />
<br />
compile from sources. To get xml support install libxml2-dev and pkg-config <br />
<br />
apt-get install libxml2-dev pkg-config<br />
tar -xzvf hwloc-1.1.1.tar.gz<br />
cd hwloc-1.1.1<br />
./configure &amp;&amp; make &amp;&amp; make install<br />
<br />
Change root password. <br />
<br />
rm sources from root dir. <br />
<br />
Edit the "message of the day" <br />
<br />
vi /etc/motd.tail<br />
<br />
echo 90 &gt; /proc/sys/vm/overcommit_ratio<br />
echo 2 &gt; /proc/sys/vm/overcommit_memory<br />
date &gt;&gt; release<br />
<br />
Cleanup <br />
<br />
apt-get clean<br />
rm /etc/udev/rules.d/*-persistent-net.rules<br />
<br />
Make image <br />
<br />
ssh root@'''node''' tgz-g5k &gt; $HOME/grid5000/'''imagename'''.tgz<br />
<br />
make appropriate .env file. <br />
<br />
kaenv3 -p lenny-x64-nfs -u deploy &gt; lenny-x64-custom-2.3.env<br />
<br />
<br> <br />
<br />
== GotoBLAS2 ==<br />
<br />
http://www.tacc.utexas.edu/tacc-projects/gotoblas2 When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Resolving www.netlib.org... 160.36.58.108<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
<br />
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
...</source> <br />
<br />
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile <br />
<br />
184c184<br />
&lt; -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
---<br />
&gt; # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
<br />
<br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc <br />
<br />
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`<br />
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH<br />
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH<br />
<br />
Run the following script once on each cluster: <br />
<br />
<source lang="bash">#! /bin/bash<br />
echo "Compiling gotoblas for cluster: $CLUSTER"<br />
cd $HOME/src<br />
if [ ! -d "$CLUSTER" ]; then<br />
mkdir $CLUSTER<br />
fi<br />
cd $CLUSTER<br />
tar -xzf ../Goto*.tar.gz<br />
cd Goto*<br />
make &> m.log<br />
<br />
<br />
if [ ! -d "$HOME/lib/$CLUSTER" ]; then<br />
mkdir $HOME/lib/$CLUSTER<br />
fi<br />
<br />
cp libgoto2.so $HOME/lib/$CLUSTER<br />
<br />
echo results<br />
ls -d $HOME/src/$CLUSTER<br />
ls $HOME/src/$CLUSTER<br />
<br />
ls -d $HOME/lib/$CLUSTER<br />
ls $HOME/lib/$CLUSTER</source> <br />
<br />
note: for newer processors this may fail. If it is a NEHALEM processor try: <br />
<br />
make clean<br />
make TARGET=NEHALEM<br />
<br />
== Paging and the OOM-Killer ==<br />
<br />
When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. <br />
<br />
== Example of experiment setup across several sites ==<br />
<br />
Sources of all files mentioned below is available at: [[Grid5000:sources]]. <br />
<br />
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources <br />
<br />
cd dave/fupermod-1.1.0<br />
make clean<br />
./configure --with-cblas=goto --prefix=/usr/local/<br />
<br />
Reserve 2 nodes from all clusters on a 3 cluster site: <br />
<br />
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00<br />
<br />
Automate with: <br />
<br />
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done<br />
<br />
Then on each site, where xxx is site name: <br />
<br />
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx<br />
<br />
Gather deployed files to a head node: <br />
<br />
for i in `cat ~/sites `; do echo $i; scp $i:deployed* .&nbsp;; done<br />
cat deployed.* &gt; deployed.all<br />
<br />
Copy cluster specific libs to each deployed node /usr/local/lib dir with script <br />
<br />
copy_local_libs.sh deployed.all<br />
<br />
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) <br />
<br />
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i:&nbsp;; done<br />
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0&nbsp;; make all install"&nbsp;; done<br />
<br />
ssh to the first node <br />
<br />
ssh `head -n1 deployed.all`<br />
n=$(cat deployed.all |wc -l)<br />
mpdboot --totalnum=$n --file=$HOME/deployed.all<br />
mpdtrace<br />
<br />
cd dave/data/<br />
mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100<br />
<br />
Cleanup after: <br />
<br />
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.*&nbsp;; done<br />
<br />
== Check network speed ==<br />
<br />
apt-get install iperf<br />
<br />
== Choose which network interface to use ==<br />
<br />
mpirun --mca btl self,openib ...<br />
<br />
or <br />
<br />
mpirun --mca btl self,tcp ...<br />
<br />
== Installing Gadget-2.0.7 ==<br />
<br />
# apt-get install hdf5-openmpi-dev sfftw-dev<br />
$ tar -xzvf gadget2.tar.gz<br />
$ cd Gadget-2.0.7/Gadget2<br />
$ make CFLAGS="-DH5_USE_16_API<br />
$ make clean; make<br />
<br />
== Installing Wrekavoc ==<br />
<br />
Download from http://wrekavoc.gforge.inria.fr/ <br />
<br />
# apt-get install libxml2-dev pkg-config<br />
# tar -xzvf wrekavoc-1.1.tar.gz <br />
# cd wrekavoc-1.1/<br />
# ./configure <br />
# make<br />
# ./src/burn 50<br />
<br />
== Installing Extrae ==<br />
<br />
(on grid5000 wheezy big) <br />
<br />
First install [http://www.dyninst.org/ Dyninst] <br />
<br />
# apt-get install libelf-dev libdwarf-dev <br />
# tar -xzvf DyninstAPI-8.1.2.tgz <br />
# cd DyninstAPI-8.1.2<br />
# ./configure --with-libdwarf-static<br />
# make<br />
# make install<br />
<br />
Then Extrae <br />
<br />
# apt-get install<br />
# ./configure --with-mpi=/usr --with-mpi-libs=/usr/lib --with-papi=/usr/local --with-unwind=/usr --with-dyninst=/usr/local --with-dwarf=/usr</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Grid5000&diff=807Grid50002013-07-09T16:32:30Z<p>Davepc: </p>
<hr />
<div>https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home <br />
<br />
[https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY]&nbsp; - Very important, after booking nodes (oarsub ...) run the command:&nbsp;<source lang="">outofchart</source>&nbsp;This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin. <br />
<br />
<br> <br />
<br />
== Login, job submission, deployment of image ==<br />
<br />
*Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] <br />
*Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:<br />
<br />
<source lang="bash"><br />
access_$ ssh frontend.SITE2<br />
</source> <br />
<br />
*There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. <br />
*Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. <br />
*Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: <br />
**'''oarstat''' - queue status <br />
**'''oarsub''' - job submission <br />
**'''oardel''' - job removal<br />
<br />
Interactive job on deployed images: <source lang="bash"><br />
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Batch job on installed images: <source lang="bash"><br />
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Specifying cluster name to reserve: <source lang="bash"><br />
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"<br />
</source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time. <br />
<br />
*The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here]<br />
<br />
Loading: <source lang="bash"><br />
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES <br />
</source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000. <br />
<br />
== Compiling and running MPI applications ==<br />
<br />
*Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) <br />
*Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] <br />
**mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)<br />
<br />
== Setting up new deploy image ==<br />
<br />
List available images <br />
<br />
kaenv3 -l<br />
<br />
Then book node and launch: <br />
<br />
oarsub -I -t deploy -l nodes=1,walltime=12<br />
kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k<br />
ssh root@`head -n 1 $OAR_NODEFILE`<br />
<br />
default password: grid5000 <br />
<br />
edit /etc/apt/sources.list <br />
<br />
apt-get update<br />
apt-get upgrade<br />
<br />
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev<br />
<br />
Possibly also install (for using extrae): <br />
<br />
apt-get install libxml2-dev binutils-dev libunwind7-dev<br />
<br />
<br> Compiled for sources by us: <br />
<br />
*<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)&nbsp;</strike> ''Now with squeeze it is in repository.''<br />
<br />
<strike>./configure &amp;&amp; make &amp;&amp; make install</strike><br />
<br />
*mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)<br />
<br />
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd<br />
make &amp;&amp; make install<br />
<br />
Mpich2 installed to: <br />
<br />
Installing MPE2 include files to /usr/local/include<br />
Installing MPE2 libraries to /usr/local/lib<br />
Installing MPE2 utility programs to /usr/local/bin<br />
Installing MPE2 configuration files to /usr/local/etc<br />
Installing MPE2 system utility programs to /usr/local/sbin<br />
Installing MPE2 man to /usr/local/share/man<br />
Installing MPE2 html to /usr/local/share/doc/<br />
Installed MPE2 in /usr/local<br />
<br />
*hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)<br />
<br />
compile from sources. To get xml support install libxml2-dev and pkg-config <br />
<br />
apt-get install libxml2-dev pkg-config<br />
tar -xzvf hwloc-1.1.1.tar.gz<br />
cd hwloc-1.1.1<br />
./configure &amp;&amp; make &amp;&amp; make install<br />
<br />
Change root password. <br />
<br />
rm sources from root dir. <br />
<br />
Edit the "message of the day" <br />
<br />
vi /etc/motd.tail<br />
<br />
echo 90 &gt; /proc/sys/vm/overcommit_ratio<br />
echo 2 &gt; /proc/sys/vm/overcommit_memory<br />
date &gt;&gt; release<br />
<br />
Cleanup <br />
<br />
apt-get clean<br />
rm /etc/udev/rules.d/*-persistent-net.rules<br />
<br />
Make image <br />
<br />
ssh root@'''node''' tgz-g5k &gt; $HOME/grid5000/'''imagename'''.tgz<br />
<br />
make appropriate .env file. <br />
<br />
kaenv3 -p lenny-x64-nfs -u deploy &gt; lenny-x64-custom-2.3.env<br />
<br />
<br> <br />
<br />
== GotoBLAS2 ==<br />
<br />
http://www.tacc.utexas.edu/tacc-projects/gotoblas2 When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Resolving www.netlib.org... 160.36.58.108<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
<br />
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
...</source> <br />
<br />
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile <br />
<br />
184c184<br />
&lt; -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
---<br />
&gt; # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
<br />
<br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc <br />
<br />
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`<br />
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH<br />
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH<br />
<br />
Run the following script once on each cluster: <br />
<br />
<source lang="bash">#! /bin/bash<br />
echo "Compiling gotoblas for cluster: $CLUSTER"<br />
cd $HOME/src<br />
if [ ! -d "$CLUSTER" ]; then<br />
mkdir $CLUSTER<br />
fi<br />
cd $CLUSTER<br />
tar -xzf ../Goto*.tar.gz<br />
cd Goto*<br />
make &> m.log<br />
<br />
<br />
if [ ! -d "$HOME/lib/$CLUSTER" ]; then<br />
mkdir $HOME/lib/$CLUSTER<br />
fi<br />
<br />
cp libgoto2.so $HOME/lib/$CLUSTER<br />
<br />
echo results<br />
ls -d $HOME/src/$CLUSTER<br />
ls $HOME/src/$CLUSTER<br />
<br />
ls -d $HOME/lib/$CLUSTER<br />
ls $HOME/lib/$CLUSTER</source> <br />
<br />
note: for newer processors this may fail. If it is a NEHALEM processor try: <br />
<br />
make clean<br />
make TARGET=NEHALEM<br />
<br />
== Paging and the OOM-Killer ==<br />
<br />
When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. <br />
<br />
== Example of experiment setup across several sites ==<br />
<br />
Sources of all files mentioned below is available at: [[Grid5000:sources]]. <br />
<br />
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources <br />
<br />
cd dave/fupermod-1.1.0<br />
make clean<br />
./configure --with-cblas=goto --prefix=/usr/local/<br />
<br />
Reserve 2 nodes from all clusters on a 3 cluster site: <br />
<br />
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00<br />
<br />
Automate with: <br />
<br />
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done<br />
<br />
Then on each site, where xxx is site name: <br />
<br />
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx<br />
<br />
Gather deployed files to a head node: <br />
<br />
for i in `cat ~/sites `; do echo $i; scp $i:deployed* .&nbsp;; done<br />
cat deployed.* &gt; deployed.all<br />
<br />
Copy cluster specific libs to each deployed node /usr/local/lib dir with script <br />
<br />
copy_local_libs.sh deployed.all<br />
<br />
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) <br />
<br />
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i:&nbsp;; done<br />
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0&nbsp;; make all install"&nbsp;; done<br />
<br />
ssh to the first node <br />
<br />
ssh `head -n1 deployed.all`<br />
n=$(cat deployed.all |wc -l)<br />
mpdboot --totalnum=$n --file=$HOME/deployed.all<br />
mpdtrace<br />
<br />
cd dave/data/<br />
mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100<br />
<br />
Cleanup after: <br />
<br />
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.*&nbsp;; done<br />
<br />
== Check network speed ==<br />
<br />
apt-get install iperf<br />
<br />
== Choose which network interface to use ==<br />
<br />
mpirun --mca btl self,openib ...<br />
<br />
or <br />
<br />
mpirun --mca btl self,tcp ...<br />
<br />
== Installing Gadget-2.0.7 ==<br />
<br />
# apt-get install hdf5-openmpi-dev sfftw-dev<br />
$ tar -xzvf gadget2.tar.gz<br />
$ cd Gadget-2.0.7/Gadget2<br />
$ make CFLAGS="-DH5_USE_16_API<br />
$ make clean; make<br />
<br />
== Installing Wrekavoc ==<br />
<br />
Download from http://wrekavoc.gforge.inria.fr/ <br />
<br />
# apt-get install libxml2-dev pkg-config<br />
# tar -xzvf wrekavoc-1.1.tar.gz <br />
# cd wrekavoc-1.1/<br />
# ./configure <br />
# make<br />
# ./src/burn 50<br />
<br />
== Installing Extrae ==<br />
<br />
(on grid5000 wheezy big) <br />
<br />
First install [http://www.dyninst.org/ Dyninst] <br />
<br />
# apt-get install libelf-dev libdwarf-dev <br />
# tar -xzvf DyninstAPI-8.1.2.tgz <br />
# cd DyninstAPI-8.1.2<br />
# ./configure --with-libdwarf-static<br />
# make<br />
# make install<br />
<br />
Then Extrae <br />
<br />
# apt-get install<br />
# ./configure --with-mpi=/usr --with-mpi-libs=/usr/lib --with-papi=/usr/local --with-unwind=/usr</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Grid5000&diff=806Grid50002013-07-09T16:27:41Z<p>Davepc: </p>
<hr />
<div>https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home <br />
<br />
[https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY]&nbsp; - Very important, after booking nodes (oarsub ...) run the command:&nbsp;<source lang="">outofchart</source>&nbsp;This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin. <br />
<br />
<br> <br />
<br />
== Login, job submission, deployment of image ==<br />
<br />
*Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] <br />
*Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:<br />
<br />
<source lang="bash"><br />
access_$ ssh frontend.SITE2<br />
</source> <br />
<br />
*There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. <br />
*Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. <br />
*Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: <br />
**'''oarstat''' - queue status <br />
**'''oarsub''' - job submission <br />
**'''oardel''' - job removal<br />
<br />
Interactive job on deployed images: <source lang="bash"><br />
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Batch job on installed images: <source lang="bash"><br />
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Specifying cluster name to reserve: <source lang="bash"><br />
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"<br />
</source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time. <br />
<br />
*The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here]<br />
<br />
Loading: <source lang="bash"><br />
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES <br />
</source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000. <br />
<br />
== Compiling and running MPI applications ==<br />
<br />
*Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) <br />
*Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] <br />
**mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)<br />
<br />
== Setting up new deploy image ==<br />
<br />
List available images <br />
<br />
kaenv3 -l<br />
<br />
Then book node and launch: <br />
<br />
oarsub -I -t deploy -l nodes=1,walltime=12<br />
kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k<br />
ssh root@`head -n 1 $OAR_NODEFILE`<br />
<br />
default password: grid5000 <br />
<br />
edit /etc/apt/sources.list <br />
<br />
apt-get update<br />
apt-get upgrade<br />
<br />
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev<br />
<br />
Possibly also install (for using extrae): <br />
<br />
apt-get install libxml2-dev binutils-dev libunwind7-dev<br />
<br />
<br> Compiled for sources by us: <br />
<br />
*<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)&nbsp;</strike> ''Now with squeeze it is in repository.''<br />
<br />
<strike>./configure &amp;&amp; make &amp;&amp; make install</strike><br />
<br />
*mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)<br />
<br />
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd<br />
make &amp;&amp; make install<br />
<br />
Mpich2 installed to: <br />
<br />
Installing MPE2 include files to /usr/local/include<br />
Installing MPE2 libraries to /usr/local/lib<br />
Installing MPE2 utility programs to /usr/local/bin<br />
Installing MPE2 configuration files to /usr/local/etc<br />
Installing MPE2 system utility programs to /usr/local/sbin<br />
Installing MPE2 man to /usr/local/share/man<br />
Installing MPE2 html to /usr/local/share/doc/<br />
Installed MPE2 in /usr/local<br />
<br />
*hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)<br />
<br />
compile from sources. To get xml support install libxml2-dev and pkg-config <br />
<br />
apt-get install libxml2-dev pkg-config<br />
tar -xzvf hwloc-1.1.1.tar.gz<br />
cd hwloc-1.1.1<br />
./configure &amp;&amp; make &amp;&amp; make install<br />
<br />
Change root password. <br />
<br />
rm sources from root dir. <br />
<br />
Edit the "message of the day" <br />
<br />
vi /etc/motd.tail<br />
<br />
echo 90 &gt; /proc/sys/vm/overcommit_ratio<br />
echo 2 &gt; /proc/sys/vm/overcommit_memory<br />
date &gt;&gt; release<br />
<br />
Cleanup <br />
<br />
apt-get clean<br />
rm /etc/udev/rules.d/*-persistent-net.rules<br />
<br />
Make image <br />
<br />
ssh root@'''node''' tgz-g5k &gt; $HOME/grid5000/'''imagename'''.tgz<br />
<br />
make appropriate .env file. <br />
<br />
kaenv3 -p lenny-x64-nfs -u deploy &gt; lenny-x64-custom-2.3.env<br />
<br />
<br> <br />
<br />
== GotoBLAS2 ==<br />
<br />
http://www.tacc.utexas.edu/tacc-projects/gotoblas2 When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Resolving www.netlib.org... 160.36.58.108<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
<br />
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
...</source> <br />
<br />
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile <br />
<br />
184c184<br />
&lt; -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
---<br />
&gt; # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
<br />
<br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc <br />
<br />
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`<br />
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH<br />
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH<br />
<br />
Run the following script once on each cluster: <br />
<br />
<source lang="bash">#! /bin/bash<br />
echo "Compiling gotoblas for cluster: $CLUSTER"<br />
cd $HOME/src<br />
if [ ! -d "$CLUSTER" ]; then<br />
mkdir $CLUSTER<br />
fi<br />
cd $CLUSTER<br />
tar -xzf ../Goto*.tar.gz<br />
cd Goto*<br />
make &> m.log<br />
<br />
<br />
if [ ! -d "$HOME/lib/$CLUSTER" ]; then<br />
mkdir $HOME/lib/$CLUSTER<br />
fi<br />
<br />
cp libgoto2.so $HOME/lib/$CLUSTER<br />
<br />
echo results<br />
ls -d $HOME/src/$CLUSTER<br />
ls $HOME/src/$CLUSTER<br />
<br />
ls -d $HOME/lib/$CLUSTER<br />
ls $HOME/lib/$CLUSTER</source> <br />
<br />
note: for newer processors this may fail. If it is a NEHALEM processor try: <br />
<br />
make clean<br />
make TARGET=NEHALEM<br />
<br />
== Paging and the OOM-Killer ==<br />
<br />
When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. <br />
<br />
== Example of experiment setup across several sites ==<br />
<br />
Sources of all files mentioned below is available at: [[Grid5000:sources]]. <br />
<br />
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources <br />
<br />
cd dave/fupermod-1.1.0<br />
make clean<br />
./configure --with-cblas=goto --prefix=/usr/local/<br />
<br />
Reserve 2 nodes from all clusters on a 3 cluster site: <br />
<br />
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00<br />
<br />
Automate with: <br />
<br />
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done<br />
<br />
Then on each site, where xxx is site name: <br />
<br />
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx<br />
<br />
Gather deployed files to a head node: <br />
<br />
for i in `cat ~/sites `; do echo $i; scp $i:deployed* .&nbsp;; done<br />
cat deployed.* &gt; deployed.all<br />
<br />
Copy cluster specific libs to each deployed node /usr/local/lib dir with script <br />
<br />
copy_local_libs.sh deployed.all<br />
<br />
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) <br />
<br />
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i:&nbsp;; done<br />
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0&nbsp;; make all install"&nbsp;; done<br />
<br />
ssh to the first node <br />
<br />
ssh `head -n1 deployed.all`<br />
n=$(cat deployed.all |wc -l)<br />
mpdboot --totalnum=$n --file=$HOME/deployed.all<br />
mpdtrace<br />
<br />
cd dave/data/<br />
mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100<br />
<br />
Cleanup after: <br />
<br />
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.*&nbsp;; done<br />
<br />
== Check network speed ==<br />
<br />
apt-get install iperf<br />
<br />
== Choose which network interface to use ==<br />
<br />
mpirun --mca btl self,openib ...<br />
<br />
or <br />
<br />
mpirun --mca btl self,tcp ...<br />
<br />
== Installing Gadget-2.0.7 ==<br />
<br />
# apt-get install hdf5-openmpi-dev sfftw-dev<br />
$ tar -xzvf gadget2.tar.gz<br />
$ cd Gadget-2.0.7/Gadget2<br />
$ make CFLAGS="-DH5_USE_16_API<br />
$ make clean; make<br />
<br />
== Installing Wrekavoc ==<br />
<br />
Download from http://wrekavoc.gforge.inria.fr/ <br />
<br />
# apt-get install libxml2-dev pkg-config<br />
# tar -xzvf wrekavoc-1.1.tar.gz <br />
# cd wrekavoc-1.1/<br />
# ./configure <br />
# make<br />
# ./src/burn 50<br />
<br />
== Installing Extrae ==<br />
(on grid5000 wheezy big)<br />
<br />
First install [http://www.dyninst.org/ Dyninst]<br />
# apt-get install libelf-dev libdwarf-dev<br />
# tar -xzvf <br />
# tar -xzvf DyninstAPI-8.1.2.tgz <br />
# cd DyninstAPI-8.1.2<br />
# ./configure --with-libdwarf-static<br />
# make<br />
# make install<br />
<br />
Then Extrae<br />
# apt-get install</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Grid5000&diff=805Grid50002013-07-08T20:04:24Z<p>Davepc: </p>
<hr />
<div>https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home <br />
<br />
[https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY]&nbsp; - Very important, after booking nodes (oarsub ...) run the command:&nbsp;<source lang="">outofchart</source>&nbsp;This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin.<br />
<br />
<br />
<br />
== Login, job submission, deployment of image ==<br />
<br />
*Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] <br />
*Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:<br />
<br />
<source lang="bash"><br />
access_$ ssh frontend.SITE2<br />
</source> <br />
<br />
*There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. <br />
*Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. <br />
*Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: <br />
**'''oarstat''' - queue status <br />
**'''oarsub''' - job submission <br />
**'''oardel''' - job removal<br />
<br />
Interactive job on deployed images: <source lang="bash"><br />
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Batch job on installed images: <source lang="bash"><br />
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> <br />
Specifying cluster name to reserve: <source lang="bash"><br />
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"<br />
</source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time.<br />
<br />
*The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here]<br />
<br />
Loading: <source lang="bash"><br />
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES <br />
</source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000.<br />
<br />
== Compiling and running MPI applications ==<br />
<br />
*Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) <br />
*Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] <br />
**mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)<br />
<br />
== Setting up new deploy image ==<br />
<br />
List available images <br />
<br />
kaenv3 -l<br />
<br />
Then book node and launch: <br />
<br />
oarsub -I -t deploy -l nodes=1,walltime=12<br />
kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k<br />
ssh root@`head -n 1 $OAR_NODEFILE`<br />
<br />
default password: grid5000 <br />
<br />
edit /etc/apt/sources.list <br />
<br />
apt-get update<br />
apt-get upgrade<br />
<br />
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev<br />
<br />
Possibly also install (for using extrae): <br />
<br />
apt-get install libxml2-dev binutils-dev libunwind7-dev<br />
<br />
<br> Compiled for sources by us: <br />
<br />
*<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)&nbsp;</strike> ''Now with squeeze it is in repository.''<br />
<br />
<strike>./configure &amp;&amp; make &amp;&amp; make install</strike><br />
<br />
*mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)<br />
<br />
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd<br />
make &amp;&amp; make install<br />
<br />
Mpich2 installed to: <br />
<br />
Installing MPE2 include files to /usr/local/include<br />
Installing MPE2 libraries to /usr/local/lib<br />
Installing MPE2 utility programs to /usr/local/bin<br />
Installing MPE2 configuration files to /usr/local/etc<br />
Installing MPE2 system utility programs to /usr/local/sbin<br />
Installing MPE2 man to /usr/local/share/man<br />
Installing MPE2 html to /usr/local/share/doc/<br />
Installed MPE2 in /usr/local<br />
<br />
*hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)<br />
<br />
compile from sources. To get xml support install libxml2-dev and pkg-config <br />
<br />
apt-get install libxml2-dev pkg-config<br />
tar -xzvf hwloc-1.1.1.tar.gz<br />
cd hwloc-1.1.1<br />
./configure &amp;&amp; make &amp;&amp; make install<br />
<br />
Change root password. <br />
<br />
rm sources from root dir. <br />
<br />
Edit the "message of the day" <br />
<br />
vi /etc/motd.tail<br />
<br />
echo 90 &gt; /proc/sys/vm/overcommit_ratio<br />
echo 2 &gt; /proc/sys/vm/overcommit_memory<br />
date &gt;&gt; release<br />
<br />
Cleanup <br />
<br />
apt-get clean<br />
rm /etc/udev/rules.d/*-persistent-net.rules<br />
<br />
Make image <br />
<br />
ssh root@'''node''' tgz-g5k &gt; $HOME/grid5000/'''imagename'''.tgz<br />
<br />
make appropriate .env file. <br />
<br />
kaenv3 -p lenny-x64-nfs -u deploy &gt; lenny-x64-custom-2.3.env<br />
<br />
<br> <br />
<br />
== GotoBLAS2 ==<br />
http://www.tacc.utexas.edu/tacc-projects/gotoblas2<br />
When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Resolving www.netlib.org... 160.36.58.108<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
<br />
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
...</source> <br />
<br />
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile <br />
<br />
184c184<br />
&lt; -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
---<br />
&gt; # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
<br />
<br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc <br />
<br />
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`<br />
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH<br />
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH<br />
<br />
Run the following script once on each cluster: <br />
<br />
<source lang="bash">#! /bin/bash<br />
echo "Compiling gotoblas for cluster: $CLUSTER"<br />
cd $HOME/src<br />
if [ ! -d "$CLUSTER" ]; then<br />
mkdir $CLUSTER<br />
fi<br />
cd $CLUSTER<br />
tar -xzf ../Goto*.tar.gz<br />
cd Goto*<br />
make &> m.log<br />
<br />
<br />
if [ ! -d "$HOME/lib/$CLUSTER" ]; then<br />
mkdir $HOME/lib/$CLUSTER<br />
fi<br />
<br />
cp libgoto2.so $HOME/lib/$CLUSTER<br />
<br />
echo results<br />
ls -d $HOME/src/$CLUSTER<br />
ls $HOME/src/$CLUSTER<br />
<br />
ls -d $HOME/lib/$CLUSTER<br />
ls $HOME/lib/$CLUSTER</source> <br />
<br />
note: for newer processors this may fail. If it is a NEHALEM processor try: <br />
<br />
make clean<br />
make TARGET=NEHALEM<br />
<br />
== Paging and the OOM-Killer ==<br />
<br />
When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. <br />
<br />
== Example of experiment setup across several sites ==<br />
<br />
Sources of all files mentioned below is available at: [[Grid5000:sources]]. <br />
<br />
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources <br />
<br />
cd dave/fupermod-1.1.0<br />
make clean<br />
./configure --with-cblas=goto --prefix=/usr/local/<br />
<br />
Reserve 2 nodes from all clusters on a 3 cluster site: <br />
<br />
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00<br />
<br />
Automate with: <br />
<br />
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done<br />
<br />
Then on each site, where xxx is site name: <br />
<br />
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx<br />
<br />
Gather deployed files to a head node: <br />
<br />
for i in `cat ~/sites `; do echo $i; scp $i:deployed* .&nbsp;; done<br />
cat deployed.* &gt; deployed.all<br />
<br />
Copy cluster specific libs to each deployed node /usr/local/lib dir with script <br />
<br />
copy_local_libs.sh deployed.all<br />
<br />
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) <br />
<br />
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i:&nbsp;; done<br />
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0&nbsp;; make all install"&nbsp;; done<br />
<br />
ssh to the first node <br />
<br />
ssh `head -n1 deployed.all`<br />
n=$(cat deployed.all |wc -l)<br />
mpdboot --totalnum=$n --file=$HOME/deployed.all<br />
mpdtrace<br />
<br />
cd dave/data/<br />
mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100<br />
<br />
Cleanup after: <br />
<br />
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.*&nbsp;; done<br />
<br />
== Check network speed ==<br />
<br />
apt-get install iperf<br />
<br />
== Choose which network interface to use ==<br />
<br />
mpirun --mca btl self,openib ...<br />
<br />
or <br />
<br />
mpirun --mca btl self,tcp ...<br />
<br />
== Installing Gadget-2.0.7 ==<br />
# apt-get install hdf5-openmpi-dev sfftw-dev<br />
$ tar -xzvf gadget2.tar.gz<br />
$ cd Gadget-2.0.7/Gadget2<br />
$ make CFLAGS="-DH5_USE_16_API<br />
$ make clean; make<br />
<br />
== Installing Wrekavoc ==<br />
Download from http://wrekavoc.gforge.inria.fr/<br />
# apt-get install libxml2-dev pkg-config<br />
# tar -xzvf wrekavoc-1.1.tar.gz <br />
# cd wrekavoc-1.1/<br />
# ./configure <br />
# make<br />
# ./src/burn 50</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Grid5000&diff=804Grid50002013-07-08T16:39:16Z<p>Davepc: </p>
<hr />
<div>https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home <br />
<br />
[https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY]&nbsp; - Very important, after booking nodes (oarsub ...) run the command:&nbsp;<source lang="">outofchart</source>&nbsp;This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin.<br />
<br />
<br />
<br />
== Login, job submission, deployment of image ==<br />
<br />
*Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] <br />
*Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:<br />
<br />
<source lang="bash"><br />
access_$ ssh frontend.SITE2<br />
</source> <br />
<br />
*There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. <br />
*Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. <br />
*Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: <br />
**'''oarstat''' - queue status <br />
**'''oarsub''' - job submission <br />
**'''oardel''' - job removal<br />
<br />
Interactive job on deployed images: <source lang="bash"><br />
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Batch job on installed images: <source lang="bash"><br />
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> <br />
Specifying cluster name to reserve: <source lang="bash"><br />
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"<br />
</source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time.<br />
<br />
*The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here]<br />
<br />
Loading: <source lang="bash"><br />
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES <br />
</source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000.<br />
<br />
== Compiling and running MPI applications ==<br />
<br />
*Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) <br />
*Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] <br />
**mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)<br />
<br />
== Setting up new deploy image ==<br />
<br />
List available images <br />
<br />
kaenv3 -l<br />
<br />
Then book node and launch: <br />
<br />
oarsub -I -t deploy -l nodes=1,walltime=12<br />
kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k<br />
ssh root@`head -n 1 $OAR_NODEFILE`<br />
<br />
default password: grid5000 <br />
<br />
edit /etc/apt/sources.list <br />
<br />
apt-get update<br />
apt-get upgrade<br />
<br />
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev<br />
<br />
Possibly also install (for using extrae): <br />
<br />
apt-get install libxml2-dev binutils-dev libunwind7-dev<br />
<br />
<br> Compiled for sources by us: <br />
<br />
*<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)&nbsp;</strike> ''Now with squeeze it is in repository.''<br />
<br />
<strike>./configure &amp;&amp; make &amp;&amp; make install</strike><br />
<br />
*mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)<br />
<br />
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd<br />
make &amp;&amp; make install<br />
<br />
Mpich2 installed to: <br />
<br />
Installing MPE2 include files to /usr/local/include<br />
Installing MPE2 libraries to /usr/local/lib<br />
Installing MPE2 utility programs to /usr/local/bin<br />
Installing MPE2 configuration files to /usr/local/etc<br />
Installing MPE2 system utility programs to /usr/local/sbin<br />
Installing MPE2 man to /usr/local/share/man<br />
Installing MPE2 html to /usr/local/share/doc/<br />
Installed MPE2 in /usr/local<br />
<br />
*hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)<br />
<br />
compile from sources. To get xml support install libxml2-dev and pkg-config <br />
<br />
apt-get install libxml2-dev pkg-config<br />
tar -xzvf hwloc-1.1.1.tar.gz<br />
cd hwloc-1.1.1<br />
./configure &amp;&amp; make &amp;&amp; make install<br />
<br />
Change root password. <br />
<br />
rm sources from root dir. <br />
<br />
Edit the "message of the day" <br />
<br />
vi /etc/motd.tail<br />
<br />
echo 90 &gt; /proc/sys/vm/overcommit_ratio<br />
echo 2 &gt; /proc/sys/vm/overcommit_memory<br />
date &gt;&gt; release<br />
<br />
Cleanup <br />
<br />
apt-get clean<br />
rm /etc/udev/rules.d/*-persistent-net.rules<br />
<br />
Make image <br />
<br />
ssh root@'''node''' tgz-g5k &gt; $HOME/grid5000/'''imagename'''.tgz<br />
<br />
make appropriate .env file. <br />
<br />
kaenv3 -p lenny-x64-nfs -u deploy &gt; lenny-x64-custom-2.3.env<br />
<br />
<br> <br />
<br />
== GotoBLAS2 ==<br />
http://www.tacc.utexas.edu/tacc-projects/gotoblas2<br />
When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Resolving www.netlib.org... 160.36.58.108<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
<br />
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
...</source> <br />
<br />
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile <br />
<br />
184c184<br />
&lt; -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
---<br />
&gt; # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
<br />
<br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc <br />
<br />
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`<br />
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH<br />
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH<br />
<br />
Run the following script once on each cluster: <br />
<br />
<source lang="bash">#! /bin/bash<br />
echo "Compiling gotoblas for cluster: $CLUSTER"<br />
cd $HOME/src<br />
if [ ! -d "$CLUSTER" ]; then<br />
mkdir $CLUSTER<br />
fi<br />
cd $CLUSTER<br />
tar -xzf ../Goto*.tar.gz<br />
cd Goto*<br />
make &> m.log<br />
<br />
<br />
if [ ! -d "$HOME/lib/$CLUSTER" ]; then<br />
mkdir $HOME/lib/$CLUSTER<br />
fi<br />
<br />
cp libgoto2.so $HOME/lib/$CLUSTER<br />
<br />
echo results<br />
ls -d $HOME/src/$CLUSTER<br />
ls $HOME/src/$CLUSTER<br />
<br />
ls -d $HOME/lib/$CLUSTER<br />
ls $HOME/lib/$CLUSTER</source> <br />
<br />
note: for newer processors this may fail. If it is a NEHALEM processor try: <br />
<br />
make clean<br />
make TARGET=NEHALEM<br />
<br />
== Paging and the OOM-Killer ==<br />
<br />
When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. <br />
<br />
== Example of experiment setup across several sites ==<br />
<br />
Sources of all files mentioned below is available at: [[Grid5000:sources]]. <br />
<br />
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources <br />
<br />
cd dave/fupermod-1.1.0<br />
make clean<br />
./configure --with-cblas=goto --prefix=/usr/local/<br />
<br />
Reserve 2 nodes from all clusters on a 3 cluster site: <br />
<br />
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00<br />
<br />
Automate with: <br />
<br />
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done<br />
<br />
Then on each site, where xxx is site name: <br />
<br />
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx<br />
<br />
Gather deployed files to a head node: <br />
<br />
for i in `cat ~/sites `; do echo $i; scp $i:deployed* .&nbsp;; done<br />
cat deployed.* &gt; deployed.all<br />
<br />
Copy cluster specific libs to each deployed node /usr/local/lib dir with script <br />
<br />
copy_local_libs.sh deployed.all<br />
<br />
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) <br />
<br />
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i:&nbsp;; done<br />
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0&nbsp;; make all install"&nbsp;; done<br />
<br />
ssh to the first node <br />
<br />
ssh `head -n1 deployed.all`<br />
n=$(cat deployed.all |wc -l)<br />
mpdboot --totalnum=$n --file=$HOME/deployed.all<br />
mpdtrace<br />
<br />
cd dave/data/<br />
mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100<br />
<br />
Cleanup after: <br />
<br />
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.*&nbsp;; done<br />
<br />
== Check network speed ==<br />
<br />
apt-get install iperf<br />
<br />
== Choose which network interface to use ==<br />
<br />
mpirun --mca btl self,openib ...<br />
<br />
or <br />
<br />
mpirun --mca btl self,tcp ...<br />
<br />
== Installing Gadget-2.0.7 ==<br />
# apt-get install hdf5-openmpi-dev sfftw-dev<br />
$ tar -xzvf gadget2.tar.gz<br />
$ cd Gadget-2.0.7/Gadget2<br />
$ make CFLAGS="-DH5_USE_16_API<br />
$ make clean; make</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Grid5000&diff=803Grid50002013-07-03T20:56:22Z<p>Davepc: </p>
<hr />
<div>https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home <br />
<br />
[https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY]&nbsp; - Very important, after booking nodes (oarsub ...) run the command:&nbsp;<source lang="">outofchart</source>&nbsp;This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin.<br />
<br />
<br />
<br />
== Login, job submission, deployment of image ==<br />
<br />
*Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] <br />
*Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:<br />
<br />
<source lang="bash"><br />
access_$ ssh frontend.SITE2<br />
</source> <br />
<br />
*There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. <br />
*Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. <br />
*Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: <br />
**'''oarstat''' - queue status <br />
**'''oarsub''' - job submission <br />
**'''oardel''' - job removal<br />
<br />
Interactive job on deployed images: <source lang="bash"><br />
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Batch job on installed images: <source lang="bash"><br />
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> <br />
Specifying cluster name to reserve: <source lang="bash"><br />
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"<br />
</source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time.<br />
<br />
*The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here]<br />
<br />
Loading: <source lang="bash"><br />
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES <br />
</source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000.<br />
<br />
== Compiling and running MPI applications ==<br />
<br />
*Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) <br />
*Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] <br />
**mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)<br />
<br />
== Setting up new deploy image ==<br />
<br />
List available images <br />
<br />
kaenv3 -l<br />
<br />
Then book node and launch: <br />
<br />
oarsub -I -t deploy -l nodes=1,walltime=12<br />
kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k<br />
ssh root@`head -n 1 $OAR_NODEFILE`<br />
<br />
default password: grid5000 <br />
<br />
edit /etc/apt/sources.list <br />
<br />
apt-get update<br />
apt-get upgrade<br />
<br />
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev <br />
libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev<br />
<br />
Possibly also install (for using extrae): <br />
<br />
libxml2-dev binutils-dev libunwind7-dev<br />
<br />
<br> Compiled for sources by us: <br />
<br />
*<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)&nbsp;</strike> ''Now with squeeze it is in repository.''<br />
<br />
<strike>./configure &amp;&amp; make &amp;&amp; make install</strike><br />
<br />
*mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)<br />
<br />
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd<br />
make &amp;&amp; make install<br />
<br />
Mpich2 installed to: <br />
<br />
Installing MPE2 include files to /usr/local/include<br />
Installing MPE2 libraries to /usr/local/lib<br />
Installing MPE2 utility programs to /usr/local/bin<br />
Installing MPE2 configuration files to /usr/local/etc<br />
Installing MPE2 system utility programs to /usr/local/sbin<br />
Installing MPE2 man to /usr/local/share/man<br />
Installing MPE2 html to /usr/local/share/doc/<br />
Installed MPE2 in /usr/local<br />
<br />
*hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)<br />
<br />
compile from sources. To get xml support install libxml2-dev and pkg-config <br />
<br />
apt-get install libxml2-dev pkg-config<br />
tar -xzvf hwloc-1.1.1.tar.gz<br />
cd hwloc-1.1.1<br />
./configure &amp;&amp; make &amp;&amp; make install<br />
<br />
Change root password. <br />
<br />
rm sources from root dir. <br />
<br />
Edit the "message of the day" <br />
<br />
vi /etc/motd.tail<br />
<br />
echo 90 &gt; /proc/sys/vm/overcommit_ratio<br />
echo 2 &gt; /proc/sys/vm/overcommit_memory<br />
date &gt;&gt; release<br />
<br />
Cleanup <br />
<br />
apt-get clean<br />
rm /etc/udev/rules.d/*-persistent-net.rules<br />
<br />
Make image <br />
<br />
ssh root@'''node''' tgz-g5k &gt; $HOME/grid5000/'''imagename'''.tgz<br />
<br />
make appropriate .env file. <br />
<br />
kaenv3 -p lenny-x64-nfs -u deploy &gt; lenny-x64-custom-2.3.env<br />
<br />
<br> <br />
<br />
== GotoBLAS2 ==<br />
http://www.tacc.utexas.edu/tacc-projects/gotoblas2<br />
When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Resolving www.netlib.org... 160.36.58.108<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
<br />
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
...</source> <br />
<br />
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile <br />
<br />
184c184<br />
&lt; -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
---<br />
&gt; # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
<br />
<br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc <br />
<br />
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`<br />
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH<br />
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH<br />
<br />
Run the following script once on each cluster: <br />
<br />
<source lang="bash">#! /bin/bash<br />
echo "Compiling gotoblas for cluster: $CLUSTER"<br />
cd $HOME/src<br />
if [ ! -d "$CLUSTER" ]; then<br />
mkdir $CLUSTER<br />
fi<br />
cd $CLUSTER<br />
tar -xzf ../Goto*.tar.gz<br />
cd Goto*<br />
make &> m.log<br />
<br />
<br />
if [ ! -d "$HOME/lib/$CLUSTER" ]; then<br />
mkdir $HOME/lib/$CLUSTER<br />
fi<br />
<br />
cp libgoto2.so $HOME/lib/$CLUSTER<br />
<br />
echo results<br />
ls -d $HOME/src/$CLUSTER<br />
ls $HOME/src/$CLUSTER<br />
<br />
ls -d $HOME/lib/$CLUSTER<br />
ls $HOME/lib/$CLUSTER</source> <br />
<br />
note: for newer processors this may fail. If it is a NEHALEM processor try: <br />
<br />
make clean<br />
make TARGET=NEHALEM<br />
<br />
== Paging and the OOM-Killer ==<br />
<br />
When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. <br />
<br />
== Example of experiment setup across several sites ==<br />
<br />
Sources of all files mentioned below is available at: [[Grid5000:sources]]. <br />
<br />
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources <br />
<br />
cd dave/fupermod-1.1.0<br />
make clean<br />
./configure --with-cblas=goto --prefix=/usr/local/<br />
<br />
Reserve 2 nodes from all clusters on a 3 cluster site: <br />
<br />
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00<br />
<br />
Automate with: <br />
<br />
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done<br />
<br />
Then on each site, where xxx is site name: <br />
<br />
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx<br />
<br />
Gather deployed files to a head node: <br />
<br />
for i in `cat ~/sites `; do echo $i; scp $i:deployed* .&nbsp;; done<br />
cat deployed.* &gt; deployed.all<br />
<br />
Copy cluster specific libs to each deployed node /usr/local/lib dir with script <br />
<br />
copy_local_libs.sh deployed.all<br />
<br />
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) <br />
<br />
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i:&nbsp;; done<br />
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0&nbsp;; make all install"&nbsp;; done<br />
<br />
ssh to the first node <br />
<br />
ssh `head -n1 deployed.all`<br />
n=$(cat deployed.all |wc -l)<br />
mpdboot --totalnum=$n --file=$HOME/deployed.all<br />
mpdtrace<br />
<br />
cd dave/data/<br />
mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100<br />
<br />
Cleanup after: <br />
<br />
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.*&nbsp;; done<br />
<br />
== Check network speed ==<br />
<br />
apt-get install iperf<br />
<br />
== Choose which network interface to use ==<br />
<br />
mpirun --mca btl self,openib ...<br />
<br />
or <br />
<br />
mpirun --mca btl self,tcp ...<br />
<br />
== Installing Gadget-2.0.7 ==<br />
# apt-get install hdf5-openmpi-dev sfftw-dev<br />
$ tar -xzvf gadget2.tar.gz<br />
$ cd Gadget-2.0.7/Gadget2<br />
$ make CFLAGS="-DH5_USE_16_API<br />
$ make clean; make</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=C/C%2B%2B&diff=802C/C++2013-04-09T17:47:34Z<p>Davepc: </p>
<hr />
<div>== Coding ==<br />
* C++ programming style is preferrable. For example, in variable declarations, pointers and references should have their reference symbol next to the type rather than to the name. Variables should be initialized where they are declared, and should be declared where they are used. For more details, see [http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml Google C++ Style Guide]<br />
* [http://en.wikipedia.org/wiki/Indent_style#Variant:_1TBS One-true-brace ident style]<br />
* [http://en.wikipedia.org/wiki/Pragma_once Coding header files]<br />
* Learn from examples and use coding approaches from third-party software<br />
<br />
== Commenting ==<br />
* Place [[Doxygen]] comments in header files (before declarations of namespaces/classes/structs/typedefs/macros) and main source files (for documenting tools and tests)<br />
* Use double forward slash for short comments in the code<br />
<br />
== C++ ==<br />
* [http://developers.sun.com/solaris/articles/mixing.html Mixing C/C++]<br />
* Provide main API in C<br />
* Use plain C unless you need flexible data structures or [[STL]]/[[Boost]] functionality<br />
* [http://en.wikipedia.org/wiki/Template_metaprogramming Template C++] is preferrable from the point of view of runtime performance<br />
* Mind the life cycle of objects: [http://en.wikipedia.org/wiki/Default_constructor Default constructor] [http://en.wikipedia.org/wiki/Copy_constructor Copy constructor], [http://en.wikipedia.org/wiki/Destructor_(computer_science) Destructor]<br />
* [http://www.gnu.org/software/hello/manual/automake/Libtool-Convenience-Libraries.html Force C++ linking]<br />
<br />
== Tips &amp; Tricks ==<br />
<br />
*[http://www.gnu.org/s/libc/manual/html_node/Date-and-Time.html#Date-and-Time Timing in C] <br />
*Don't use non-standard functions, like [http://en.wikipedia.org/wiki/Itoa itoa] <br />
*[http://www.gnu.org/software/libc/manual/html_node/Program-Arguments.html Handling program arguments] (avoid <code>argp</code> since it is not supported on many platforms) <br />
*[http://en.wikipedia.org/wiki/Dynamic_loading Dynamic loading of shared libraries] <br />
*Avoid [http://en.wikipedia.org/wiki/Variable-length_array variable-length arrays]. First, GCC allocates them on the stack. Second, the status of this feature in GCC is BROKEN. Therefore, never do this:<br />
<br />
<source lang="C"><br />
int size;<br />
MPI_Comm_size(MPI_COMM_WORLD, &size);<br />
char names[size][MPI_MAX_PROCESSOR_NAME];<br />
</source> <br />
<br />
*Implement delays in the execution of the program with help of [http://linux.die.net/man/2/nanosleep nanosleep]. Compared to sleep and usleep, nanosleep has the advantage of not affecting any signals, it is standardized by POSIX, it provides higher timing resolution, and it allows to continue a sleep that has been interrupted by a signal more easily.<br />
<br />
*Indenting in fupermod is done in the [http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Spaces_vs._Tabs#Spaces_vs._Tabs google code style], two literal spaces, no tabs. To set vim to do this put the following in .vimrc:<br />
set autoindent<br />
set expandtab<br />
set tabstop=2<br />
set shiftwidth=2<br />
set softtabstop=2<br />
<br />
*To indent all .c and .h files with vim use the following ([http://stackoverflow.com/questions/3218528/indenting-in-vim-with-all-the-files-in-folder explained here]):<br />
:args ./*/*.[ch] | argdo execute "normal gg=G" | update<br />
or use the Unix command <br />
$ indent<br />
<br />
== Color GCC ==<br />
Colours output of GCC so you can see errors and warnings.<br />
sudo apt-get install colorgcc<br />
ln -s /usr/bin/colorgcc ~/bin/gcc<br />
*Make sure ~/bin is in path _before_ gcc. (Add ~/bin to path in ~/.profile)</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Desktop_Backup&diff=793Desktop Backup2012-12-06T18:35:41Z<p>Davepc: </p>
<hr />
<div>Members of the HCL group may backup their desktops to heterogeneous server in the following directory:<br />
<br />
heterogeneous:/home/desktops/&lt;user&gt;<br />
<br />
This can easly be done with rsync as follows:<br />
<br />
rsync -axv /home/&lt;your_desktop_username&gt;/ &lt;user&gt;@heterogeneous:/home/desktops/&lt;user&gt;/ --exclude-from=.bkup_excludes<br />
<br />
and make the file .backup_excludes with files and directorys you would like to exclude, for example your download folder, internet cache, etc. An example of Daves excludes file: <br />
<br />
.Skype/<br />
.Trash-1000/<br />
.adobe/<br />
.cache/<br />
.config/chromium<br />
.dropbox/<br />
.mozilla/<br />
.ssh/<br />
.svn/<br />
.thumbnails/<br />
.thunderbird/<br />
Downloads/<br />
Dropbox/<br />
backups/</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Desktop_Backup&diff=792Desktop Backup2012-12-06T18:32:56Z<p>Davepc: </p>
<hr />
<div>Members of the HCL group may backup their desktops to heterogeneous server in the following directory:<br />
<br />
heterogeneous:/home/desktops/&lt;user&gt;<br />
<br />
This can easly be done with rsync as follows:<br />
<br />
rsync -axv /home/&lt;your_desktop_username&gt;/ &lt;user&gt;@heterogeneous:/home/desktops/&lt;user&gt;/ --exclude-from=.bkup_excludes<br />
<br />
and make the file .backup_excludes with files and directorys you would like to exclude, for example your download folder. An example of Daves excludes file: <br />
<br />
.Skype/<br />
.Trash-1000/<br />
.adobe/<br />
.cache/<br />
.config/chromium<br />
.dropbox/<br />
.mozilla/<br />
.ssh/<br />
.svn/<br />
.thumbnails/<br />
.thunderbird/<br />
Downloads/<br />
Dropbox/<br />
backups/</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Desktop_Backup&diff=791Desktop Backup2012-12-06T18:32:33Z<p>Davepc: </p>
<hr />
<div>Members of the HCL group may backup their desktops to heterogeneous server in the following directory <br />
<br />
heterogeneous:/home/desktops/&lt;user&gt;<br />
<br />
This can easly be done with rsync as follows <br />
<br />
rsync -axv /home/&lt;your_desktop_username&gt;/ &lt;user&gt;@heterogeneous:/home/desktops/&lt;user&gt;/ --exclude-from=.bkup_excludes<br />
<br />
and make the file .backup_excludes with files and directorys you would like to exclude, for example your download folder. An example of Daves excludes file: <br />
<br />
.Skype/<br />
.Trash-1000/<br />
.adobe/<br />
.cache/<br />
.config/chromium<br />
.dropbox/<br />
.mozilla/<br />
.ssh/<br />
.svn/<br />
.thumbnails/<br />
.thunderbird/<br />
Downloads/<br />
Dropbox/<br />
backups/</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Desktop_Backup&diff=790Desktop Backup2012-12-06T18:30:32Z<p>Davepc: Created page with "Members of the HCL group may backup their desktops to heterogeneous server in the following directory /home/desktops/&lt;user&gt; This can easly be done with rsync as follow…"</p>
<hr />
<div>Members of the HCL group may backup their desktops to heterogeneous server in the following directory <br />
<br />
/home/desktops/&lt;user&gt;<br />
<br />
This can easly be done with rsync as follows <br />
<br />
rsync -axv /home/&lt;your_desktop_username&gt;/ &lt;user&gt;@heterogeneous:/home/desktops/&lt;user&gt;/ --exclude-from=.bkup_excludes<br />
<br />
and make the file .backup_excludes with files and directorys you would like to exclude, for example your download folder. An example of Daves excludes file:<br />
<br />
<source lang="text">.Skype/<br />
.Trash-1000/<br />
.adobe/<br />
.cache/<br />
.config/chromium<br />
.dropbox/<br />
.mozilla/<br />
.ssh/<br />
.svn/<br />
.thumbnails/<br />
.thunderbird/<br />
Downloads/<br />
Dropbox/<br />
backups/</source></div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Main_Page&diff=789Main Page2012-12-06T18:21:56Z<p>Davepc: /* Hardware */</p>
<hr />
<div>This site is set up for sharing ideas, findings and experience in heterogeneous computing. Please, log in and create new or edit existing pages. How to format wiki-pages read [[Help:Editing|here]].<br />
<br />
== HCL software for heterogeneous computing ==<br />
* Extensions for [[MPI]]: [http://hcl.ucd.ie/project/mpC mpC] [http://hcl.ucd.ie/project/HeteroMPI HeteroMPI] [http://hcl.ucd.ie/project/libELC libELC]<br />
* Extensions for [http://en.wikipedia.org/wiki/GridRPC GridRPC]: [http://hcl.ucd.ie/project/SmartGridSolve SmartGridSolve] [http://hcl.ucd.ie/project/NI-Connect NI-Connect]<br />
* Computation benchmarking, modeling, dynamic load balancing: [http://hcl.ucd.ie/project/fupermod FuPerMod] [http://hcl.ucd.ie/project/pmm PMM]<br />
* Communication benchmarking, modeling, optimization: [http://hcl.ucd.ie/project/cpm CPM] [http://hcl.ucd.ie/project/mpiblib MPIBlib]<br />
<br />
== Heterogeneous mathematical software ==<br />
* [http://hcl.ucd.ie/project/HeteroScaLAPACK HeteroScaLAPACK]<br />
* [http://hcl.ucd.ie/project/Hydropad Hydropad]<br />
<br />
== Operating systems == <br />
* [[Linux]]<br />
* [[Windows]]<br />
<br />
== Development tools ==<br />
* [[C/C++]], [[Python]], [[UML]], [[FORTRAN]]<br />
* [[Autotools]]<br />
* [[GDB]], [[OProfile]], [[Valgrind]]<br />
* [[Doxygen]]<br />
* [[ChangeLog]], [[Subversion]]<br />
* [[Eclipse]]<br />
* [[Bash Scripts]]<br />
<br />
== [[Libraries]] ==<br />
* [[GNU C Library]]<br />
* [[MPI]]<br />
* [[STL]], [[Boost]]<br />
* [[GSL]]<br />
* [[BLAS LAPACK ScaLAPACK]]<br />
* [[NLOPT]]<br />
* [[BitTorrent (B. Cohen's version)]]<br />
* [[CUDA SDK]]<br />
<br />
== Data processing ==<br />
* [[gnuplot]], [[pgfplot]], [[matplotlib]]<br />
* [[Graphviz]]<br />
* [[Octave]], [[R]]<br />
* [[G3DViewer]]<br />
<br />
== Paper & Presentation Tools ==<br />
* [[Dia]], [[PGF/Tikz]], [[pgfplot]]<br />
* [[LaTeX]], [[Beamer]]<br />
* [[BibTeX]], [[JabRef]]<br />
<br />
== Hardware ==<br />
* [[HCL cluster]]<br />
* [[Other UCD Resources]]<br />
* [[UTK multicores + GPU]]<br />
* [[Grid5000]]<br />
* [[Desktop Backup]]<br />
<br />
[[SSH|How to connect to cluster via SSH]]<br />
<br />
[[hwloc|How to find information about the hardware]]<br />
<br />
== Mathematics ==<br />
* [http://en.wikipedia.org/wiki/Confidence_interval Confidence interval (Statistics)], [http://en.wikipedia.org/wiki/Student's_t-distribution Student's t-distribution] (implemented in [[GSL]])<br />
* [http://en.wikipedia.org/wiki/Linear_regression Linear regression] (implemented in [[GSL]])<br />
* [http://en.wikipedia.org/wiki/Binomial_tree#Binomial_tree Binomial tree] (use [[Graphviz]] to visualize trees)<br />
* [http://en.wikipedia.org/wiki/Spline_interpolation Spline interpolation], [http://en.wikipedia.org/wiki/B-spline Spline approximation] (implemented in [[GSL]])</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=PGF/Tikz&diff=786PGF/Tikz2012-11-07T17:46:31Z<p>Davepc: </p>
<hr />
<div>* [http://en.wikipedia.org/wiki/PGF/TikZ Tikz on Wikipedia]<br />
* [http://www.ctan.org/tex-archive/graphics/pgf/base/doc/generic/pgf/pgfmanual.pdf PGF manual]<br />
* [http://www.texample.net/tikz/examples/ Examples]<br />
<br />
= Write a figure =<br />
<br />
The preamble of the latex file must contain: <source lang="latex">\usepackage{tikz}</source> Some optional libraries could be add like this: <source lang="latex">\usetikzlibrary{calc}</source> <br />
<br />
To start a figure, the code must be inside the tikzpicture environment like this: <source lang="latex">\begin{tikzpicture} ... TikZ code here... \end{tikzpicture}</source> <br />
<br />
= Exemple =<br />
<br />
<source lang="latex"><br />
% Author: Quintin Jean-Noël<br />
% <http://moais.imag.fr/membres/jean-noel.quintin/><br />
\documentclass{article}<br />
\usepackage{tikz}<br />
\usetikzlibrary[topaths]<br />
% A counter, since TikZ is not clever enough (yet) to handle<br />
% arbitrary angle systems.<br />
\newcount\mycount<br />
\begin{document}<br />
\begin{tikzpicture}[transform shape]<br />
%the multiplication with floats is not possible. Thus I split the loop<br />
%in two.<br />
\foreach \number in {1,...,8}{<br />
% Computer angle:<br />
\mycount=\number<br />
\advance\mycount by -1<br />
\multiply\mycount by 45<br />
\advance\mycount by 0 <br />
\node[draw,circle,inner sep=0.125cm] (N-\number) at (\the\mycount:5.4cm) {};<br />
} <br />
\foreach \number in {9,...,16}{<br />
% Computer angle:<br />
\mycount=\number<br />
\advance\mycount by -1<br />
\multiply\mycount by 45<br />
\advance\mycount by 22.5<br />
\node[draw,circle,inner sep=0.125cm] (N-\number) at (\the\mycount:5.4cm) {};<br />
} <br />
\foreach \number in {1,...,15}{<br />
\mycount=\number<br />
\advance\mycount by 1 <br />
\foreach \numbera in {\the\mycount,...,16}{<br />
\path (N-\number) edge[->,bend right=3] (N-\numbera) edge[<-,bend<br />
left=3] (N-\numbera);<br />
} <br />
}<br />
\end{tikzpicture}<br />
\end{document}<br />
</source> <br />
<br />
= Voir aussi =<br />
<br />
*[[Pfgplot]]</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=PGF/Tikz&diff=785PGF/Tikz2012-11-07T17:43:50Z<p>Davepc: </p>
<hr />
<div>* [http://en.wikipedia.org/wiki/PGF/TikZ Tikz on Wikipedia]<br />
* [http://www.ctan.org/tex-archive/graphics/pgf/base/doc/generic/pgf/pgfmanual.pdf PGF manual]<br />
<br />
= Write a figure =<br />
<br />
The preamble of the latex file must contain: <source lang="latex">\usepackage{tikz}</source> Some optional libraries could be add like this: <source lang="latex">\usetikzlibrary{calc}</source> <br />
<br />
To start a figure, the code must be inside the tikzpicture environment like this: <source lang="latex">\begin{tikzpicture} ... TikZ code here... \end{tikzpicture}</source> <br />
<br />
= Exemple =<br />
<br />
<source lang="latex"><br />
% Author: Quintin Jean-Noël<br />
% <http://moais.imag.fr/membres/jean-noel.quintin/><br />
\documentclass{article}<br />
\usepackage{tikz}<br />
\usetikzlibrary[topaths]<br />
% A counter, since TikZ is not clever enough (yet) to handle<br />
% arbitrary angle systems.<br />
\newcount\mycount<br />
\begin{document}<br />
\begin{tikzpicture}[transform shape]<br />
%the multiplication with floats is not possible. Thus I split the loop<br />
%in two.<br />
\foreach \number in {1,...,8}{<br />
% Computer angle:<br />
\mycount=\number<br />
\advance\mycount by -1<br />
\multiply\mycount by 45<br />
\advance\mycount by 0 <br />
\node[draw,circle,inner sep=0.125cm] (N-\number) at (\the\mycount:5.4cm) {};<br />
} <br />
\foreach \number in {9,...,16}{<br />
% Computer angle:<br />
\mycount=\number<br />
\advance\mycount by -1<br />
\multiply\mycount by 45<br />
\advance\mycount by 22.5<br />
\node[draw,circle,inner sep=0.125cm] (N-\number) at (\the\mycount:5.4cm) {};<br />
} <br />
\foreach \number in {1,...,15}{<br />
\mycount=\number<br />
\advance\mycount by 1 <br />
\foreach \numbera in {\the\mycount,...,16}{<br />
\path (N-\number) edge[->,bend right=3] (N-\numbera) edge[<-,bend<br />
left=3] (N-\numbera);<br />
} <br />
}<br />
\end{tikzpicture}<br />
\end{document}<br />
</source> <br />
<br />
= Voir aussi =<br />
<br />
*[[Pfgplot]]</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Subversion&diff=784Subversion2012-11-01T16:27:36Z<p>Davepc: </p>
<hr />
<div>http://svnbook.red-bean.com/ <br />
<br />
*Subversion clients work with <code>.svn</code> directories - don't remove them. <br />
*Mind the version of the client (currently, 1.5, 1.6).<br />
<br />
== Repositories ==<br />
*http://gforge.ucd.ie/softwaremap/tag_cloud.php?tag=heterogeneous+computing<br />
<br />
== To submit ==<br />
<br />
*Software sources: models, code, resource files <br />
*Documentation sources: texts, diagrams, data <br />
*Configuration files <br />
*Test sourses: code, input data<br />
<br />
== Not to submit ==<br />
<br />
*Binaries: object files, libraries, executables <br />
*Built documentation: html, pdf <br />
*Personal settings: Eclipse projects, ... <br />
*Test output<br />
<br />
= Subversion for Users =<br />
<br />
A good cross platform client: [http://www.rapidsvn.org/index.php/Documentation RapidSVN], combined with [http://meldmerge.org/ Meld] a visual diff and merge tool.<br />
<br />
== RapidSVN, Gforge &amp; passwords ==<br />
<br />
'''Problem:''' RapidSVN doesn't directly support svn over ssh and so doesn't remember ssh passwords. And &nbsp;<strike>gforge.ucd.ie appares not to support passwordless authentication with publickey</strike>. There was a bug in the cron job that updated the keys.<strike></strike> <br />
<br />
'''Solution:''' &nbsp; ssh gforge.ucd.ie <br />
<br />
&nbsp; chmod 700 .ssh <br />
<br />
And add your key (.ssh/id_rsa.pub) to http://gforge.ucd.ie/<br />
<br />
(and wait for cron job to add it, or do it manually) <br />
<br />
<strike>'''Solution:''' Use sshpass to remember password.</strike> <br />
<br />
<strike>Note: this method involves having your gforge password in plain text, and so is a potential security risk - it should be different to other passwords etc.</strike> <br />
<br />
<strike>Install sshpass &gt;=1.05 (note ubuntu 11.10 usese version 1.04 which just hangs - so install from sources or Ubuntu 12.4)</strike> <br />
<br />
<strike>edit ~/.subversion/config, in&nbsp;[tunnels] section add the line:</strike> <br />
<br />
<strike>gforge = sshpass -f{path to file holding password} ssh -o PubkeyAuthentication=no -o ControlMaster=no <br />
</strike><br />
<br />
<strike>Then check out with:</strike> <br />
<br />
<strike>svn checkout svn+gforge://&lt;user&gt;@gforge.ucd.ie/var/lib/gforge/chroot/scmrepos/svn/fupermod/trunk fupermod <br />
</strike><br />
<br />
<strike>(where previously it was:&nbsp;svn checkout svn+ssh)</strike> <br />
<br />
<strike>To change an existing working copy</strike> <br />
<br />
<strike>svn switch --relocate svn+ssh://&lt;user&gt;@gforge.ucd.ie/&lt;old url&gt; svn+gforge://&lt;user&gt;@gforge.ucd.ie/&lt;new url&gt;</strike></div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Subversion&diff=783Subversion2012-11-01T16:27:04Z<p>Davepc: </p>
<hr />
<div>http://svnbook.red-bean.com/ <br />
<br />
*Subversion clients work with <code>.svn</code> directories - don't remove them. <br />
*Mind the version of the client (currently, 1.5, 1.6).<br />
<br />
== Repositories ==<br />
*http://gforge.ucd.ie/softwaremap/tag_cloud.php?tag=heterogeneous+computing<br />
<br />
== To submit ==<br />
<br />
*Software sources: models, code, resource files <br />
*Documentation sources: texts, diagrams, data <br />
*Configuration files <br />
*Test sourses: code, input data<br />
<br />
== Not to submit ==<br />
<br />
*Binaries: object files, libraries, executables <br />
*Built documentation: html, pdf <br />
*Personal settings: Eclipse projects, ... <br />
*Test output<br />
<br />
= Subversion for Users =<br />
<br />
A good cross platform client: [http://www.rapidsvn.org/index.php/Documentation RapidSVN], combined with [http://meldmerge.org/ Meld] a visual diff and merge tool.<br />
<br />
== RapidSVN, Gforge &amp; passwords ==<br />
<br />
'''Problem:''' RapidSVN doesn't directly support svn over ssh and so doesn't remember ssh passwords. And &nbsp;<strike>gforge.ucd.ie appares not to support passwordless authentication with publickey</strike>. There was a bug in the cron job that updated the keys.<strike></strike> <br />
<br />
'''Solution:''' &nbsp; ssh gforge.ucd.ie <br />
<br />
&nbsp; chmod 700 .ssh <br />
<br />
And add your key (.ssh/id_rsa.pub) to http://gforge.ucd.ie/&nbsp; <br />
<br />
(and wait for cron job to add it, or do it manually) <br />
<br />
<strike>'''Solution:''' Use sshpass to remember password.</strike> <br />
<br />
<strike>Note: this method involves having your gforge password in plain text, and so is a potential security risk - it should be different to other passwords etc.</strike> <br />
<br />
<strike>Install sshpass &gt;=1.05 (note ubuntu 11.10 usese version 1.04 which just hangs - so install from sources or Ubuntu 12.4)</strike> <br />
<br />
<strike>edit ~/.subversion/config, in&nbsp;[tunnels] section add the line:</strike> <br />
<br />
<strike>gforge = sshpass -f{path to file holding password} ssh -o PubkeyAuthentication=no -o ControlMaster=no <br />
</strike><br />
<br />
<strike>Then check out with:</strike> <br />
<br />
<strike>svn checkout svn+gforge://&lt;user&gt;@gforge.ucd.ie/var/lib/gforge/chroot/scmrepos/svn/fupermod/trunk fupermod <br />
</strike><br />
<br />
<strike>(where previously it was:&nbsp;svn checkout svn+ssh)</strike> <br />
<br />
<strike>To change an existing working copy</strike> <br />
<br />
<strike>svn switch --relocate svn+ssh://&lt;user&gt;@gforge.ucd.ie/&lt;old url&gt; svn+gforge://&lt;user&gt;@gforge.ucd.ie/&lt;new url&gt;</strike></div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Subversion&diff=782Subversion2012-11-01T16:26:50Z<p>Davepc: /* RapidSVN, Gforge &amp; passwords */</p>
<hr />
<div>http://svnbook.red-bean.com/ <br />
<br />
*Subversion clients work with <code>.svn</code> directories - don't remove them. <br />
*Mind the version of the client (currently, 1.5, 1.6).<br />
<br />
== Repositories ==<br />
*http://gforge.ucd.ie/softwaremap/tag_cloud.php?tag=heterogeneous+computing<br />
<br />
== To submit ==<br />
<br />
*Software sources: models, code, resource files <br />
*Documentation sources: texts, diagrams, data <br />
*Configuration files <br />
*Test sourses: code, input data<br />
<br />
== Not to submit ==<br />
<br />
*Binaries: object files, libraries, executables <br />
*Built documentation: html, pdf <br />
*Personal settings: Eclipse projects, ... <br />
*Test output<br />
<br />
= Subversion for Users =<br />
<br />
A good cross platform client: [http://www.rapidsvn.org/index.php/Documentation RapidSVN], combined with [http://meldmerge.org/ Meld] a visual diff and merge tool.<br />
<br />
== RapidSVN, Gforge &amp; passwords ==<br />
<br />
&lt;strike&lt;/strike&gt; <br />
<br />
'''Problem:''' RapidSVN doesn't directly support svn over ssh and so doesn't remember ssh passwords. And &nbsp;<strike>gforge.ucd.ie appares not to support passwordless authentication with publickey</strike>. There was a bug in the cron job that updated the keys.<strike></strike> <br />
<br />
'''Solution:''' &nbsp; ssh gforge.ucd.ie <br />
<br />
&nbsp; chmod 700 .ssh <br />
<br />
And add your key (.ssh/id_rsa.pub) to http://gforge.ucd.ie/&nbsp;<br />
<br />
(and wait for cron job to add it, or do it manually)<br />
<br />
<strike>'''Solution:''' Use sshpass to remember password.</strike> <br />
<br />
<strike>Note: this method involves having your gforge password in plain text, and so is a potential security risk - it should be different to other passwords etc.</strike> <br />
<br />
<strike>Install sshpass &gt;=1.05 (note ubuntu 11.10 usese version 1.04 which just hangs - so install from sources or Ubuntu 12.4)</strike> <br />
<br />
<strike>edit ~/.subversion/config, in&nbsp;[tunnels] section add the line:</strike> <br />
<br />
<strike>gforge = sshpass -f{path to file holding password} ssh -o PubkeyAuthentication=no -o ControlMaster=no <br />
</strike><br />
<br />
<strike>Then check out with:</strike> <br />
<br />
<strike>svn checkout svn+gforge://&lt;user&gt;@gforge.ucd.ie/var/lib/gforge/chroot/scmrepos/svn/fupermod/trunk fupermod <br />
</strike><br />
<br />
<strike>(where previously it was:&nbsp;svn checkout svn+ssh)</strike> <br />
<br />
<strike>To change an existing working copy</strike> <br />
<br />
<strike>svn switch --relocate svn+ssh://&lt;user&gt;@gforge.ucd.ie/&lt;old url&gt; svn+gforge://&lt;user&gt;@gforge.ucd.ie/&lt;new url&gt;</strike></div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=C/C%2B%2B&diff=781C/C++2012-11-01T13:12:30Z<p>Davepc: /* Tips &amp; Tricks */</p>
<hr />
<div>== Coding ==<br />
* C++ programming style is preferrable. For example, in variable declarations, pointers and references should have their reference symbol next to the type rather than to the name. Variables should be initialized where they are declared, and should be declared where they are used. For more details, see [http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml Google C++ Style Guide]<br />
* [http://en.wikipedia.org/wiki/Indent_style#Variant:_1TBS One-true-brace ident style]<br />
* [http://en.wikipedia.org/wiki/Pragma_once Coding header files]<br />
* Learn from examples and use coding approaches from third-party software<br />
<br />
== Commenting ==<br />
* Place [[Doxygen]] comments in header files (before declarations of namespaces/classes/structs/typedefs/macros) and main source files (for documenting tools and tests)<br />
* Use double forward slash for short comments in the code<br />
<br />
== C++ ==<br />
* [http://developers.sun.com/solaris/articles/mixing.html Mixing C/C++]<br />
* Provide main API in C<br />
* Use plain C unless you need flexible data structures or [[STL]]/[[Boost]] functionality<br />
* [http://en.wikipedia.org/wiki/Template_metaprogramming Template C++] is preferrable from the point of view of runtime performance<br />
* Mind the life cycle of objects: [http://en.wikipedia.org/wiki/Default_constructor Default constructor] [http://en.wikipedia.org/wiki/Copy_constructor Copy constructor], [http://en.wikipedia.org/wiki/Destructor_(computer_science) Destructor]<br />
* [http://www.gnu.org/software/hello/manual/automake/Libtool-Convenience-Libraries.html Force C++ linking]<br />
<br />
== Tips &amp; Tricks ==<br />
<br />
*[http://www.gnu.org/s/libc/manual/html_node/Date-and-Time.html#Date-and-Time Timing in C] <br />
*Don't use non-standard functions, like [http://en.wikipedia.org/wiki/Itoa itoa] <br />
*[http://www.gnu.org/software/libc/manual/html_node/Program-Arguments.html Handling program arguments] (avoid <code>argp</code> since it is not supported on many platforms) <br />
*[http://en.wikipedia.org/wiki/Dynamic_loading Dynamic loading of shared libraries] <br />
*Avoid [http://en.wikipedia.org/wiki/Variable-length_array variable-length arrays]. First, GCC allocates them on the stack. Second, the status of this feature in GCC is BROKEN. Therefore, never do this:<br />
<br />
<source lang="C"><br />
int size;<br />
MPI_Comm_size(MPI_COMM_WORLD, &size);<br />
char names[size][MPI_MAX_PROCESSOR_NAME];<br />
</source> <br />
<br />
*Implement delays in the execution of the program with help of [http://linux.die.net/man/2/nanosleep nanosleep]. Compared to sleep and usleep, nanosleep has the advantage of not affecting any signals, it is standardized by POSIX, it provides higher timing resolution, and it allows to continue a sleep that has been interrupted by a signal more easily.<br />
<br />
*Indenting in fupermod is done in the [http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Spaces_vs._Tabs#Spaces_vs._Tabs google code style], two literal spaces, no tabs. To set vim to do this put the following in .vimrc:<br />
set autoindent<br />
set expandtab<br />
set tabstop=2<br />
set shiftwidth=2<br />
set softtabstop=2<br />
<br />
*To indent all .c and .h files with vim use the following ([http://stackoverflow.com/questions/3218528/indenting-in-vim-with-all-the-files-in-folder explained here]):<br />
:args ./*/*.[ch] | argdo execute "normal gg=G" | update<br />
or use the Unix command <br />
$ indent</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=C/C%2B%2B&diff=780C/C++2012-11-01T13:10:27Z<p>Davepc: /* Tips & Tricks */</p>
<hr />
<div>== Coding ==<br />
* C++ programming style is preferrable. For example, in variable declarations, pointers and references should have their reference symbol next to the type rather than to the name. Variables should be initialized where they are declared, and should be declared where they are used. For more details, see [http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml Google C++ Style Guide]<br />
* [http://en.wikipedia.org/wiki/Indent_style#Variant:_1TBS One-true-brace ident style]<br />
* [http://en.wikipedia.org/wiki/Pragma_once Coding header files]<br />
* Learn from examples and use coding approaches from third-party software<br />
<br />
== Commenting ==<br />
* Place [[Doxygen]] comments in header files (before declarations of namespaces/classes/structs/typedefs/macros) and main source files (for documenting tools and tests)<br />
* Use double forward slash for short comments in the code<br />
<br />
== C++ ==<br />
* [http://developers.sun.com/solaris/articles/mixing.html Mixing C/C++]<br />
* Provide main API in C<br />
* Use plain C unless you need flexible data structures or [[STL]]/[[Boost]] functionality<br />
* [http://en.wikipedia.org/wiki/Template_metaprogramming Template C++] is preferrable from the point of view of runtime performance<br />
* Mind the life cycle of objects: [http://en.wikipedia.org/wiki/Default_constructor Default constructor] [http://en.wikipedia.org/wiki/Copy_constructor Copy constructor], [http://en.wikipedia.org/wiki/Destructor_(computer_science) Destructor]<br />
* [http://www.gnu.org/software/hello/manual/automake/Libtool-Convenience-Libraries.html Force C++ linking]<br />
<br />
== Tips &amp; Tricks ==<br />
<br />
*[http://www.gnu.org/s/libc/manual/html_node/Date-and-Time.html#Date-and-Time Timing in C] <br />
*Don't use non-standard functions, like [http://en.wikipedia.org/wiki/Itoa itoa] <br />
*[http://www.gnu.org/software/libc/manual/html_node/Program-Arguments.html Handling program arguments] (avoid <code>argp</code> since it is not supported on many platforms) <br />
*[http://en.wikipedia.org/wiki/Dynamic_loading Dynamic loading of shared libraries] <br />
*Avoid [http://en.wikipedia.org/wiki/Variable-length_array variable-length arrays]. First, GCC allocates them on the stack. Second, the status of this feature in GCC is BROKEN. Therefore, never do this:<br />
<br />
<source lang="C"><br />
int size;<br />
MPI_Comm_size(MPI_COMM_WORLD, &size);<br />
char names[size][MPI_MAX_PROCESSOR_NAME];<br />
</source> <br />
<br />
*Implement delays in the execution of the program with help of [http://linux.die.net/man/2/nanosleep nanosleep]. Compared to sleep and usleep, nanosleep has the advantage of not affecting any signals, it is standardized by POSIX, it provides higher timing resolution, and it allows to continue a sleep that has been interrupted by a signal more easily.<br />
<br />
*Indenting in fupermod is done in the [http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Spaces_vs._Tabs#Spaces_vs._Tabs google code style], two literal spaces, no tabs. To set vim to do this put the following in .vimrc:<br />
set autoindent<br />
set expandtab<br />
set tabstop=2<br />
set shiftwidth=2<br />
set softtabstop=2<br />
<br />
*To indent all .c and .h files with vim use the following ([http://stackoverflow.com/questions/3218528/indenting-in-vim-with-all-the-files-in-folder explained here]):<br />
:args ./*/*.[ch] | argdo execute "normal gg=G" | update</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Gnuplot&diff=779Gnuplot2012-10-11T01:40:26Z<p>Davepc: </p>
<hr />
<div>[http://www.gnuplot.info/documentation.html Official gnuplot documentation] <br />
<br />
[http://gnuplot.sourceforge.net/demo/ Demo scripts for gnuplot] <br />
<br />
[http://t16web.lanl.gov/Kawano/gnuplot/index-e.html GNUPLOT: not so Frequently Asked Questions] <br />
<br />
When plotting "points" data files from fupermod, you will need [http://gnuplot.sourceforge.net/docs_4.2/node172.html this]: set datafile missing "."<br />
<br />
Put in a multiplication symbol with {/Symbol \264} [http://gnuplot-tricks.blogspot.ie/2009/05/gnuplot-tricks-many-say-that-it-is.html here] and [http://quark.phys.s.u-tokyo.ac.jp/~kawanai/file/guide.pdf here]<br />
<br />
=== Error message "';' expected" ===<br />
That syntax (linetype specification with just a number, but no keyword) <br />
has been deprecated for several years now. It had never been an <br />
officially documented feature anyway, and was removed ages ago. Have a <br />
look at "help plot style" to see how it's done. [http://groups.google.com/group/comp.graphics.apps.gnuplot/browse_thread/thread/00cb432c02560cf3 More]<br />
<br />
Deprecated:<br />
plot with lines 1<br />
<br />
Should be:<br />
plot with lines ls 1</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Gnuplot&diff=778Gnuplot2012-10-11T01:40:04Z<p>Davepc: </p>
<hr />
<div>[http://www.gnuplot.info/documentation.html Official gnuplot documentation] <br />
<br />
[http://gnuplot.sourceforge.net/demo/ Demo scripts for gnuplot] <br />
<br />
[http://t16web.lanl.gov/Kawano/gnuplot/index-e.html GNUPLOT: not so Frequently Asked Questions] <br />
<br />
When plotting "points" data files from fupermod, you will need [http://gnuplot.sourceforge.net/docs_4.2/node172.html this]: set datafile missing "."<br />
<br />
Put in a multiplication symbol with {/Symbol \264} [http://gnuplot-tricks.blogspot.ie/2009/05/gnuplot-tricks-many-say-that-it-is.html:here] and [http://quark.phys.s.u-tokyo.ac.jp/~kawanai/file/guide.pdf:here]<br />
<br />
=== Error message "';' expected" ===<br />
That syntax (linetype specification with just a number, but no keyword) <br />
has been deprecated for several years now. It had never been an <br />
officially documented feature anyway, and was removed ages ago. Have a <br />
look at "help plot style" to see how it's done. [http://groups.google.com/group/comp.graphics.apps.gnuplot/browse_thread/thread/00cb432c02560cf3 More]<br />
<br />
Deprecated:<br />
plot with lines 1<br />
<br />
Should be:<br />
plot with lines ls 1</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Grid5000&diff=777Grid50002012-09-19T10:52:30Z<p>Davepc: /* GotoBLAS2 */</p>
<hr />
<div>https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home <br />
<br />
[https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter USAGE POLICY]&nbsp; - Very important, after booking nodes (oarsub ...) run the command:&nbsp;<source lang="">outofchart</source>&nbsp;This will check that you haven't booked too many resources and therefore get in trouble with grid5000 admin.<br />
<br />
<br />
<br />
== Login, job submission, deployment of image ==<br />
<br />
*Select sites and clusters for experiments, using information on the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Network#Grid.275000_Sites Grid5000 network] and the [https://www.grid5000.fr/mediawiki/index.php/Status Status page] <br />
*Access is provided via access nodes '''access.SITE.grid5000.fr''' marked [https://www.grid5000.fr/mediawiki/index.php/External_access here] as ''accessible from '''everywhere''' via ssh with '''keyboard-interactive''' authentication method''. As soon as you are on one of the sites, you can directly ssh frontend node of any other site:<br />
<br />
<source lang="bash"><br />
access_$ ssh frontend.SITE2<br />
</source> <br />
<br />
*There is no access to Internet from computing nodes (external IPs should be registered on proxy), therefore, download/update your stuff at the access nodes. Several revision control clients are available. <br />
*Each site has a separate NFS, therefore, to run an application on several sites at once, you need to copy it '''scp, sftp, rsync''' between access or frontend nodes. <br />
*Jobs are run from the frondend nodes, using a [http://en.wikipedia.org/wiki/OpenPBS PBS]-like system [https://www.grid5000.fr/mediawiki/index.php/Cluster_experiment-OAR2 OAR]. Basic commands: <br />
**'''oarstat''' - queue status <br />
**'''oarsub''' - job submission <br />
**'''oardel''' - job removal<br />
<br />
Interactive job on deployed images: <source lang="bash"><br />
fontend_$ oarsub -I -t deploy -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> Batch job on installed images: <source lang="bash"><br />
fontend_$ oarsub BATCH_FILE -t allow_classic_ssh -l [/cluster=N/]nodes=N,walltime=HH[:MM[:SS]] [-p 'PROPERTY="VALUE"']<br />
</source> <br />
Specifying cluster name to reserve: <source lang="bash"><br />
oarsub -r 'YYYY-MM-dd HH:mm:ss' -l nodes=2,walltime=1 -p "cluster='Genepi'"<br />
</source> If the resources are available two nodes from the cluster "Genepi" will be reserved for the specified time.<br />
<br />
*The image to deploy can be created and loaded with help of a [http://wiki.systemimager.org/index.php/Main_Page Systemimager]-like system [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2 Kadeploy]. Creating: [https://www.grid5000.fr/mediawiki/index.php/Deploy_environment-OAR2#Tune_an_environment_to_build_another_one:_customize_authentification_parameters described here]<br />
<br />
Loading: <source lang="bash"><br />
fontend_$ kadeploy3 -a PATH_TO_PRIVATE_IMAGE_DESC -f $OAR_FILE_NODES <br />
</source> A Linux distribution lenny-x64-nfs-2.1 with mc, subversion, autotools, doxygen, MPICH2, GSL, Boost, R, gnuplot, graphviz, X11, evince is available at Orsay /home/nancy/alastovetsky/grid5000.<br />
<br />
== Compiling and running MPI applications ==<br />
<br />
*Compilation should be done on one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`) <br />
*Running MPI applications is described [https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000 here] <br />
**mpirun/mpiexec should be run from one of the reserved nodes (e.g. ssh `head -n 1 $OAR_NODEFILE`)<br />
<br />
== Setting up new deploy image ==<br />
<br />
List available images <br />
<br />
kaenv3 -l<br />
<br />
Then book node and launch: <br />
<br />
oarsub -I -t deploy -l nodes=1,walltime=12<br />
kadeploy3 -e squeeze-x64-big -f $OAR_FILE_NODES -k<br />
ssh root@`head -n 1 $OAR_NODEFILE`<br />
<br />
default password: grid5000 <br />
<br />
edit /etc/apt/sources.list <br />
<br />
apt-get update<br />
apt-get upgrade<br />
<br />
apt-get install libtool autoconf automake mc colorgcc ctags libboost-serialization-dev libboost-graph-dev <br />
libatlas-base-dev gfortran vim gdb valgrind screen subversion iperf bc gsl-bin libgsl0-dev<br />
<br />
Possibly also install (for using extrae): <br />
<br />
libxml2-dev binutils-dev libunwind7-dev<br />
<br />
<br> Compiled for sources by us: <br />
<br />
*<strike>gsl-1.14 (download: ftp://ftp.gnu.org/gnu/gsl/)&nbsp;</strike> ''Now with squeeze it is in repository.''<br />
<br />
<strike>./configure &amp;&amp; make &amp;&amp; make install</strike><br />
<br />
*mpich2 (download: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)<br />
<br />
./configure --enable-shared --enable-sharedlibs=gcc --with-pm=mpd<br />
make &amp;&amp; make install<br />
<br />
Mpich2 installed to: <br />
<br />
Installing MPE2 include files to /usr/local/include<br />
Installing MPE2 libraries to /usr/local/lib<br />
Installing MPE2 utility programs to /usr/local/bin<br />
Installing MPE2 configuration files to /usr/local/etc<br />
Installing MPE2 system utility programs to /usr/local/sbin<br />
Installing MPE2 man to /usr/local/share/man<br />
Installing MPE2 html to /usr/local/share/doc/<br />
Installed MPE2 in /usr/local<br />
<br />
*hwloc (and lstopo) (download: http://www.open-mpi.org/software/hwloc/v1.2/)<br />
<br />
compile from sources. To get xml support install libxml2-dev and pkg-config <br />
<br />
apt-get install libxml2-dev pkg-config<br />
tar -xzvf hwloc-1.1.1.tar.gz<br />
cd hwloc-1.1.1<br />
./configure &amp;&amp; make &amp;&amp; make install<br />
<br />
Change root password. <br />
<br />
rm sources from root dir. <br />
<br />
Edit the "message of the day" <br />
<br />
vi /etc/motd.tail<br />
<br />
echo 90 &gt; /proc/sys/vm/overcommit_ratio<br />
echo 2 &gt; /proc/sys/vm/overcommit_memory<br />
date &gt;&gt; release<br />
<br />
Cleanup <br />
<br />
apt-get clean<br />
rm /etc/udev/rules.d/*-persistent-net.rules<br />
<br />
Make image <br />
<br />
ssh root@'''node''' tgz-g5k &gt; $HOME/grid5000/'''imagename'''.tgz<br />
<br />
make appropriate .env file. <br />
<br />
kaenv3 -p lenny-x64-nfs -u deploy &gt; lenny-x64-custom-2.3.env<br />
<br />
<br> <br />
<br />
== GotoBLAS2 ==<br />
http://www.tacc.utexas.edu/tacc-projects/gotoblas2<br />
When compiling gotoblas on a node without direct internet access get this error: <source lang="">wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
--2011-05-19 03:11:03-- http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Resolving www.netlib.org... 160.36.58.108<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
<br />
--2011-05-19 03:14:13-- (try: 2) http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
Connecting to www.netlib.org|160.36.58.108|:80... failed: Connection timed out.<br />
Retrying.<br />
...</source> <br />
<br />
Fix by downloading http://www.netlib.org/lapack/lapack-3.1.1.tgz to the GotoBLAS2 source directory and editing this line in the Makefile <br />
<br />
184c184<br />
&lt; -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
---<br />
&gt; # -wget http://www.netlib.org/lapack/lapack-3.1.1.tgz<br />
<br />
<br> GotoBLAS needs to be compiled individualy for each unique machine - ie each cluster. Add the following to .bashrc <br />
<br />
export CLUSTER=`hostname |sed 's/\([a-z]*\).*/\1/'`<br />
LD_LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LD_LIBRARY_PATH<br />
export LIBRARY_PATH=$HOME/lib/$CLUSTER:$HOME/lib:/usr/local/lib:$LIBRARY_PATH<br />
<br />
Run the following script once on each cluster: <br />
<br />
<source lang="bash">#! /bin/bash<br />
echo "Compiling gotoblas for cluster: $CLUSTER"<br />
cd $HOME/src<br />
if [ ! -d "$CLUSTER" ]; then<br />
mkdir $CLUSTER<br />
fi<br />
cd $CLUSTER<br />
tar -xzf ../Goto*.tar.gz<br />
cd Goto*<br />
make &> m.log<br />
<br />
<br />
if [ ! -d "$HOME/lib/$CLUSTER" ]; then<br />
mkdir $HOME/lib/$CLUSTER<br />
fi<br />
<br />
cp libgoto2.so $HOME/lib/$CLUSTER<br />
<br />
echo results<br />
ls -d $HOME/src/$CLUSTER<br />
ls $HOME/src/$CLUSTER<br />
<br />
ls -d $HOME/lib/$CLUSTER<br />
ls $HOME/lib/$CLUSTER</source> <br />
<br />
note: for newer processors this may fail. If it is a NEHALEM processor try: <br />
<br />
make clean<br />
make TARGET=NEHALEM<br />
<br />
== Paging and the OOM-Killer ==<br />
<br />
When doing exhaustion of available memory experiments, problems can occur with over-commit. See [[HCL cluster#Paging_and_the_OOM-Killer]] for more detail. <br />
<br />
== Example of experiment setup across several sites ==<br />
<br />
Sources of all files mentioned below is available at: [[Grid5000:sources]]. <br />
<br />
Pick one head node as the main head node (I use grenoble, but any will do). Setup sources <br />
<br />
cd dave/fupermod-1.1.0<br />
make clean<br />
./configure --with-cblas=goto --prefix=/usr/local/<br />
<br />
Reserve 2 nodes from all clusters on a 3 cluster site: <br />
<br />
oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=3/nodes=2,walltime=11:59:00<br />
<br />
Automate with: <br />
<br />
for a in 2 3 4; do for i in `cat sites.$a`; do echo $a $i; ssh $i oarsub -r "2011-07-25 11:01:01" -t deploy -l cluster=$a/nodes=2,walltime=11:59:00; done; done<br />
<br />
Then on each site, where xxx is site name: <br />
<br />
kadeploy3 -a $HOME/grid5000/lenny-dave.env -f $OAR_NODE_FILE --output-ok-nodes deployed.xxx<br />
<br />
Gather deployed files to a head node: <br />
<br />
for i in `cat ~/sites `; do echo $i; scp $i:deployed* .&nbsp;; done<br />
cat deployed.* &gt; deployed.all<br />
<br />
Copy cluster specific libs to each deployed node /usr/local/lib dir with script <br />
<br />
copy_local_libs.sh deployed.all<br />
<br />
Copy source files to root dir of each deployed node. Then make install each (node ssh -f does this in parallel) <br />
<br />
for i in `cat ~/deployed.all`; do echo $i; rsync -aP ~/dave/fupermod-1.1.0 root@$i:&nbsp;; done<br />
for i in `cat ~/deployed.all`; do echo $i; ssh -f root@$i "cd fupermod-1.1.0&nbsp;; make all install"&nbsp;; done<br />
<br />
ssh to the first node <br />
<br />
ssh `head -n1 deployed.all`<br />
n=$(cat deployed.all |wc -l)<br />
mpdboot --totalnum=$n --file=$HOME/deployed.all<br />
mpdtrace<br />
<br />
cd dave/data/<br />
mpirun -n $n /usr/local/bin/partitioner -l /usr/local/lib/libmxm_col.so -a0 -D10000 -o N=100<br />
<br />
Cleanup after: <br />
<br />
for i in `cat ~/sites `; do echo $i; ssh $i rm deployed.*&nbsp;; done<br />
<br />
== Check network speed ==<br />
<br />
apt-get install iperf<br />
<br />
== Choose which network interface to use ==<br />
<br />
mpirun --mca btl self,openib ...<br />
<br />
or <br />
<br />
mpirun --mca btl self,tcp ...</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=GDB&diff=762GDB2012-07-23T14:52:09Z<p>Davepc: </p>
<hr />
<div>Debugging with GDB<br />
<br />
compile programme with -g -o0 (or ./configure --enable-debug)<br />
<br />
For serial programme (or MPI running with 1 process)<br />
gdb ./programme_name<br />
<br />
To debug MPI application in parallel<br />
add this line to somewhere in the code:<br />
if (!rank) getc(stdin); MPI_Barrier(MPI_COMM_WORLD);<br />
<br />
Then run the code and it will hang on that line.<br />
<br />
If you get the gdb error:<br />
fupermod_....c: No such file or directory.<br />
<br />
Run the following command to add to gdb's source directories to be searched:<br />
directory ~/fupermod/</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=MPI&diff=761MPI2012-07-16T23:01:54Z<p>Davepc: </p>
<hr />
<div>== Documentation ==<br />
* http://www.mpi-forum.org/docs/docs.html<br />
<br />
== Implementations ==<br />
* [[LAM]]<br />
* [[MPICH]]<br />
* [[OpenMPI]]<br />
* [[MPICH2]]<br />
<br />
== Manual installation ==<br />
Install in separate subfolder <code>$HOME/SUBDIR</code>, because you may need some MPI implementations (see [[Libraries]])<br />
<br />
== Tips & Tricks ==<br />
* For safe consecutive communications create new context, for example:<br />
<source lang="C"><br />
int communication_operation(MPI_Comm comm) {<br />
MPI_Comm newcomm;<br />
MPI_Comm_dup(comm, &newcomm);<br />
... // work with newcomm<br />
MPI_Comm_free(&newcomm);<br />
}<br />
</source><br />
Mind the overhead of <code>MPI_Comm_dup</code> and <code>MPI_Comm_free</code>.<br />
<br />
* If you are having trouble with the multi-homed nature of the HCL Cluster, check [http://www.open-mpi.org/faq/?category=tcp#tcp-selection here]<br />
<br />
== Debugging ==<br />
* Add the following code:<br />
<source lang="C"><br />
int rank;<br />
MPI_Comm_rank(MPI_COMM_WORLD, &rank);<br />
if (!rank)<br />
getc(stdin);<br />
MPI_Barrier(MPI_COMM_WORLD);<br />
</source><br />
* Compile your code with <code>-g</code> option<br />
* Run parallel application<br />
* Attach to process(es) from [[GDB]]<br />
** MPICH-1 runs a background process for each application process: 0, 0b, 1, 1b, ..., therefore, attach to the first ones.<br />
<br />
== Profiling ==<br />
<br />
[http://www.bsc.es/computer-sciences/performance-tools/paraver/general-overview Paraver] by Barcelona Supercomputing Center is a "a flexible performance visualization and analysis tool" <br />
<br />
[http://www.bsc.es/computer-sciences/performance-tools/downloads Download here] and [http://www.bsc.es/computer-sciences/performance-tools/documentation tutorials here]. <br />
<br />
Use Extrae to create trace files. <br />
<br />
Configered and installed extrae on Grid5000 with: <br />
<br />
./configure --prefix=$HOME --with-papi=$HOME --with-mpi=/usr --enable-openmp --with-unwind=$HOME --without-dyninst<br />
make; make install<br />
<br />
Create trace.sh (modified from example extrae file): <br />
<br />
<source lang="bash">#!/bin/bash<br />
export EXTRAE_HOME=$HOME<br />
export EXTRAE_CONFIG_FILE=$HOME/bin/extrae.xml<br />
export LD_LIBRARY_PATH=${EXTRAE_HOME}/lib:@sub_MPI_HOME@/lib:@sub_PAPI_HOME@/lib:@sub_UNWIND_HOME@/lib:$LD_LIBRARY_PATH<br />
export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so<br />
<br />
## Run the desired program<br />
$*</source> <br />
<br />
Using the standard extrae.xml supplied with the package. <br />
<br />
mpirun -np 3 ~/bin/trace.sh ./executable<br />
<br />
Files created: ''TRACE.mpits, TRACExxxxxx.mpit'' <br />
<br />
On head node run: <br />
<br />
mpi2prv -f TRACE.mpits -e ./executable -o output_tracefile.prv<br />
<br />
On local machine open ''output_tracefile.prv'' with paraver</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=MPI&diff=760MPI2012-07-16T22:59:15Z<p>Davepc: </p>
<hr />
<div>== Documentation ==<br />
* http://www.mpi-forum.org/docs/docs.html<br />
<br />
== Implementations ==<br />
* [[LAM]]<br />
* [[MPICH]]<br />
* [[OpenMPI]]<br />
* [[MPICH2]]<br />
<br />
== Manual installation ==<br />
Install in separate subfolder <code>$HOME/SUBDIR</code>, because you may need some MPI implementations (see [[Libraries]])<br />
<br />
== Tips & Tricks ==<br />
* For safe consecutive communications create new context, for example:<br />
<source lang="C"><br />
int communication_operation(MPI_Comm comm) {<br />
MPI_Comm newcomm;<br />
MPI_Comm_dup(comm, &newcomm);<br />
... // work with newcomm<br />
MPI_Comm_free(&newcomm);<br />
}<br />
</source><br />
Mind the overhead of <code>MPI_Comm_dup</code> and <code>MPI_Comm_free</code>.<br />
<br />
* If you are having trouble with the multi-homed nature of the HCL Cluster, check [http://www.open-mpi.org/faq/?category=tcp#tcp-selection here]<br />
<br />
== Debugging ==<br />
* Add the following code:<br />
<source lang="C"><br />
int rank;<br />
MPI_Comm_rank(MPI_COMM_WORLD, &rank);<br />
if (!rank)<br />
getc(stdin);<br />
MPI_Barrier(MPI_COMM_WORLD);<br />
</source><br />
* Compile your code with <code>-g</code> option<br />
* Run parallel application<br />
* Attach to process(es) from [[GDB]]<br />
** MPICH-1 runs a background process for each application process: 0, 0b, 1, 1b, ..., therefore, attach to the first ones.<br />
<br />
== Profiling ==<br />
<br />
[http://www.bsc.es/computer-sciences/performance-tools/paraver/general-overview Paraver] by Barcelona Supercomputing Center is a "a flexible performance visualization and analysis tool" <br />
<br />
[http://www.bsc.es/computer-sciences/performance-tools/downloads Download from here]. Use Extrae to create trace files. <br />
<br />
Configered and installed extrae on Grid5000 with: <br />
<br />
./configure --prefix=$HOME --with-papi=$HOME --with-mpi=/usr --enable-openmp --with-unwind=$HOME --without-dyninst<br />
make; make install<br />
<br />
Create trace.sh (modified from example extrae file): <br />
<br />
<source lang="bash">#!/bin/bash<br />
export EXTRAE_HOME=$HOME<br />
export EXTRAE_CONFIG_FILE=$HOME/bin/extrae.xml<br />
export LD_LIBRARY_PATH=${EXTRAE_HOME}/lib:@sub_MPI_HOME@/lib:@sub_PAPI_HOME@/lib:@sub_UNWIND_HOME@/lib:$LD_LIBRARY_PATH<br />
export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so<br />
<br />
## Run the desired program<br />
$*</source> <br />
<br />
Using the standard extrae.xml supplied with the package. <br />
<br />
mpirun -np 3 ~/bin/trace.sh ./executable<br />
<br />
Files created: ''TRACE.mpits, TRACExxxxxx.mpit'' <br />
<br />
On head node run: <br />
<br />
mpi2prv -f TRACE.mpits -e ./executable -o output_tracefile.prv<br />
<br />
On local machine open ''output_tracefile.prv'' with paraver</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=MPI&diff=759MPI2012-07-16T22:40:26Z<p>Davepc: /* Profiling */</p>
<hr />
<div>== Documentation ==<br />
* http://www.mpi-forum.org/docs/docs.html<br />
<br />
== Implementations ==<br />
* [[LAM]]<br />
* [[MPICH]]<br />
* [[OpenMPI]]<br />
* [[MPICH2]]<br />
<br />
== Manual installation ==<br />
Install in separate subfolder <code>$HOME/SUBDIR</code>, because you may need some MPI implementations (see [[Libraries]])<br />
<br />
== Tips & Tricks ==<br />
* For safe consecutive communications create new context, for example:<br />
<source lang="C"><br />
int communication_operation(MPI_Comm comm) {<br />
MPI_Comm newcomm;<br />
MPI_Comm_dup(comm, &newcomm);<br />
... // work with newcomm<br />
MPI_Comm_free(&newcomm);<br />
}<br />
</source><br />
Mind the overhead of <code>MPI_Comm_dup</code> and <code>MPI_Comm_free</code>.<br />
<br />
* If you are having trouble with the multi-homed nature of the HCL Cluster, check [http://www.open-mpi.org/faq/?category=tcp#tcp-selection here]<br />
<br />
== Debugging ==<br />
* Add the following code:<br />
<source lang="C"><br />
int rank;<br />
MPI_Comm_rank(MPI_COMM_WORLD, &rank);<br />
if (!rank)<br />
getc(stdin);<br />
MPI_Barrier(MPI_COMM_WORLD);<br />
</source><br />
* Compile your code with <code>-g</code> option<br />
* Run parallel application<br />
* Attach to process(es) from [[GDB]]<br />
** MPICH-1 runs a background process for each application process: 0, 0b, 1, 1b, ..., therefore, attach to the first ones.<br />
<br />
== Profiling ==<br />
<br />
[http://www.bsc.es/computer-sciences/performance-tools/paraver/general-overview Paraver] by Barcelona Supercomputing Center is a "a flexible performance visualization and analysis tool" <br />
<br />
[http://www.bsc.es/computer-sciences/performance-tools/downloads Download from here]. Use Extrae to create trace files. <br />
<br />
mpirun -np 3 ~/bin/trace.sh ./executable<br />
<br />
Where trace.sh is a script containing: <br />
<br />
<source lang="bash">#!/bin/bash<br />
export EXTRAE_HOME=$HOME<br />
export EXTRAE_CONFIG_FILE=$HOME/bin/extrae.xml<br />
export LD_LIBRARY_PATH=${EXTRAE_HOME}/lib:@sub_MPI_HOME@/lib:@sub_PAPI_HOME@/lib:@sub_UNWIND_HOME@/lib:$LD_LIBRARY_PATH<br />
export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so<br />
<br />
## Run the desired program<br />
$*</source></div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=MPI&diff=758MPI2012-07-16T22:30:42Z<p>Davepc: /* Profiling */</p>
<hr />
<div>== Documentation ==<br />
* http://www.mpi-forum.org/docs/docs.html<br />
<br />
== Implementations ==<br />
* [[LAM]]<br />
* [[MPICH]]<br />
* [[OpenMPI]]<br />
* [[MPICH2]]<br />
<br />
== Manual installation ==<br />
Install in separate subfolder <code>$HOME/SUBDIR</code>, because you may need some MPI implementations (see [[Libraries]])<br />
<br />
== Tips & Tricks ==<br />
* For safe consecutive communications create new context, for example:<br />
<source lang="C"><br />
int communication_operation(MPI_Comm comm) {<br />
MPI_Comm newcomm;<br />
MPI_Comm_dup(comm, &newcomm);<br />
... // work with newcomm<br />
MPI_Comm_free(&newcomm);<br />
}<br />
</source><br />
Mind the overhead of <code>MPI_Comm_dup</code> and <code>MPI_Comm_free</code>.<br />
<br />
* If you are having trouble with the multi-homed nature of the HCL Cluster, check [http://www.open-mpi.org/faq/?category=tcp#tcp-selection here]<br />
<br />
== Debugging ==<br />
* Add the following code:<br />
<source lang="C"><br />
int rank;<br />
MPI_Comm_rank(MPI_COMM_WORLD, &rank);<br />
if (!rank)<br />
getc(stdin);<br />
MPI_Barrier(MPI_COMM_WORLD);<br />
</source><br />
* Compile your code with <code>-g</code> option<br />
* Run parallel application<br />
* Attach to process(es) from [[GDB]]<br />
** MPICH-1 runs a background process for each application process: 0, 0b, 1, 1b, ..., therefore, attach to the first ones.<br />
<br />
== Profiling ==<br />
[http://www.bsc.es/computer-sciences/performance-tools/paraver/general-overview Paraver] by Barcelona Supercomputing Center is a "a flexible performance visualization and analysis tool"</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=HCL_cluster&diff=713HCL cluster2012-07-11T23:13:03Z<p>Davepc: /* Software packages available on HCL Cluster 2.0 */</p>
<hr />
<div>== General Information ==<br />
[[Image:Cluster.jpg|right|thumbnail||HCL Cluster]]<br />
[[Image:network.jpg|right|thumbnail||Layout of the Cluster]]<br />
The hcl cluster is heterogeneous in computing hardware & network ability.<br />
<br />
Nodes are from Dell, IBM, and HP, with Celeron, Pentium 4, Xeon, and AMD processors ranging in speeds from 1.8 to 3.6Ghz. Accordingly architectures and parameters such as Front Side Bus, Cache, and Main Memory all vary.<br />
<br />
Operating System used is Debian “squeeze” with Linux kernel 2.6.32.<br />
<br />
The network hardware consists of two Cisco 24+4 port Gigabit switches. Each node has two Gigabit ethernet ports - each eth0 is connected to the first switch, and each eth1 is connected to the second switch. The switches are also connected to each other. The bandwidth of each port can be configured to meet any value between 8Kb/s and 1Gb/s. This allows testing on a very large number of network topologies, As the bandwidth on the link connecting the two switches can also be configured, the cluster can actually act as two separate clusters connected via one link.<br />
<br />
The diagram shows a schematic of the cluster.<br />
<br />
=== Detailed Cluster Specification ===<br />
* [[HCL Cluster Specifications]]<br />
* [[Old HCL Cluster Specifications]] (pre May 2010)<br />
<br />
=== Documentation ===<br />
* [[media:PE750.tgz|Dell Poweredge 750 Documentation]]<br />
* [[media:SC1425.tgz|Dell Poweredge SC1425 Documentation]]<br />
* [[media:X306.pdf|IBM x-Series 306 Documentation]]<br />
* [[media:E326.pdf|IBM e-Series 326 Documentation ]]<br />
* [[media:Proliant100SeriesGuide.pdf|HP Proliant DL-140 G2 Documentation]]<br />
* [[media:ProliantDL320G3Guide.pdf|HP Proliant DL-320 G3 Documentation]]<br />
* [[media:Cisco3560Specs.pdf|Cisco Catalyst 3560 Specifications]]<br />
* [[media:Cisco3560Guide.pdf|Cisco Catalyst 3560 User Guide]]<br />
* [[HCL Cluster Network]]<br />
<br />
== Cluster Administration ==<br />
<br />
If PBS jobs do not start after a reboot of heterogeneous.ucd.ie it may be necessary to manually start maui:<br />
/usr/local/maui/sbin/maui<br />
<br />
===Useful Tools===<br />
<code>root</code> on <code>heterogeneous.ucd.ie</code> has a number of [http://expect.nist.gov/ Expect] scripts to automate administration on the cluster (in <code>/root/scripts</code>). <code>root_ssh</code> will automatically log into a host, provide the root password and either return a shell to the user or execute a command that is passed as a second argument. Command syntax is as follows:<br />
<br />
<source lang="text"><br />
# root_ssh<br />
usage: root_ssh [user@]<host> [command]<br />
</source><br />
<br />
Example usage, to login and execute a command on each node in the cluster (note the file <code>/etc/dsh/machines.list</code> contains the hostnames of all compute nodes of the cluster):<br />
# for i in `seq -w 1 16`; do root_ssh hcl$i ps ax \| grep pbs; done<br />
<br />
The above is sequential. To run parallel jobs, for example: <code>apt-get update && apt-get -y upgrade</code>, try the following trick with [http://www.gnu.org/software/screen/ screen]:<br />
# for i in `seq -w 1 16`; do screen -L -d -m root_ssh hcl$i apt-get update \&\& apt-get -y upgrade; done<br />
You can check the screenlog.* files for errors and delete them when you are happy. Sometimes all logs are sent to screenlog.0, not sure why.<br />
<br />
== Software packages available on HCL Cluster 2.0 ==<br />
<br />
Wit a fresh installation of operating systems on HCL Cluster the follow list of packages are avalible:<br />
* autoconf<br />
* automake<br />
* gcc<br />
* ctags<br />
* cg-vg<br />
* fftw2<br />
* git<br />
* gfortran<br />
* gnuplot<br />
* libtool<br />
* netperf<br />
* octave3.2<br />
* qhull<br />
* subversion<br />
* valgrind<br />
* gsl-dev<br />
* vim<br />
* python<br />
* mc<br />
* openmpi-bin <br />
* openmpi-dev<br />
* evince<br />
* libboost-graph-dev<br />
* libboost-serialization-dev<br />
* libatlas-base-dev<br />
* r-cran-strucchange<br />
* graphviz<br />
* doxygen<br />
* colorgcc<br />
<br />
[[HCL_cluster/hcl_node_install_configuration_log|new hcl node install & configuration log]]<br />
<br />
[[HCL_cluster/heterogeneous.ucd.ie_install_log|new heterogeneous.ucd.ie install log]]<br />
<br />
===APT===<br />
To do unattended updates on cluster machines you need to specify some environment variables and switches to apt-get:<br />
<br />
export DEBIAN_FRONTEND=noninteractive apt-get -q -y upgrade<br />
<br />
NOTE: on hcl01 and hcl02 any updates to grub will force a prompt, despite the switches above. This happens because there are two disks on these machines and grub asks which it should install itself on.<br />
<br />
== Access and Security ==<br />
All access and security for the cluster is handled by the gateway machine (heterogeneous.ucd.ie). This machine is not considered a compute node and should not be used as such. The only new incoming connections allowed are ssh, other incoming packets such as http that are responding to requests from inside the cluster (established or related) are also allowed. Incoming ssh packets are only accepted if they are originating from designated IP addresses. These IP's must be registered ucd IP's. csserver.ucd.ie is allowed, as is hclgate.ucd.ie, on which all users have accounts. Thus to gain access to the cluster you can ssh from csserver, hclgate or other allowed machines to heterogeneous. From there you can ssh to any of the nodes (hcl01-hcl16) that you are running a pbs job on.<br />
<br />
Access from outside the UCD network is only allowed once you have gained entry to a server that allows outside connections (such as csserver.ucd.ie)<br />
<br />
=== Creating new user accounts ===<br />
As root on heterogeneous run:<br />
adduser <username><br />
make -C /var/yp<br />
<br />
=== Access to the nodes is controlled by Torque PBS.===<br />
Use qsub to submit a job, -I is for an interactive session, walltime is time required.<br />
qsub -I -l walltime=1:00 \\ Reserve 1 node for 1 hour<br />
qsub -l nodes=hcl01+hcl07,walltime=1:00 myscript.sh<br />
<br />
Example Script:<br />
#!/bin/sh<br />
#General Script<br />
#<br />
#<br />
#These commands set up the Grid Environment for your job:<br />
#PBS -N JOBNAME<br />
#PBS -l walltime=48:00:00<br />
#PBS -l nodes=16<br />
#PBS -m abe<br />
#PBS -k eo<br />
#PBS -V<br />
echo foo<br />
<br />
So see the queue<br />
qstat -n<br />
showq<br />
<br />
To remove your job <br />
qdel JOBNUM<br />
<br />
More info: [http://www.clusterresources.com/products/torque/docs/]<br />
<br />
== Some networking issues on HCL cluster (unsolved) ==<br />
<br />
"/sbin/route" should give:<br />
<br />
Kernel IP routing table<br />
Destination Gateway Genmask Flags Metric Ref Use Iface<br />
239.2.11.72 * 255.255.255.255 UH 0 0 0 eth0<br />
heterogeneous.u * 255.255.255.255 UH 0 0 0 eth0<br />
192.168.21.0 * 255.255.255.0 U 0 0 0 eth1<br />
192.168.20.0 * 255.255.255.0 U 0 0 0 eth0<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth0<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth1<br />
default heterogeneous.u 0.0.0.0 UG 0 0 0 eth0<br />
<br />
<br />
For reasons unclear, sometimes many machines miss the entry:<br />
<br />
192.168.21.0 * 255.255.255.0 U 0 0 0 eth1<br />
<br />
For Open MPI, this leads to inability to do a system sockets "connect" call to any 192.*.21.* address (hangup).<br />
For this case, you can <br />
<br />
* switch off eth1 (see also [http://hcl.ucd.ie/wiki/index.php/OpenMPI] ):<br />
<br />
mpirun --mca btl_tcp_if_exclude lo,eth1 ...<br />
<br />
or<br />
<br />
* you can restore the above table on all nodes by running "sh /etc/network/if-up.d/00routes" as root<br />
<br />
It is not yet clear why without this entry the connection to the "21" addresses can't be connected. We expect that in this case following rule should be matched (because of the mask):<br />
192.168.20.0 * 255.255.254.0 U 0 0 0 eth0<br />
<br />
The packets leave over the eth0 network interface then and should go over switch1 to switch2 and eth1 interface of the corresponding node<br />
<br />
<br />
* If one attempts a ping from one node A, via its eth0 interface, to the address of another node's (B) eth1 interface, the following is observed:<br />
** outgoing ping packets appear only on the eth0 interface of the first node A.<br />
** incoming ping packets appear only on eth1 interface of the second node B.<br />
** outgoing ping response packets appear on the eth0 interface of the second node B, never on the eth1 interface despite pinging the eth1 address specifically.<br />
What explains this? With the routing tables as they are above, or in the damaged case, the ping may arrive to the correct interface, but the response from B is routed to A-eth0 via B-eth0. Further, after a number of ping packets have been sent in sequence (50 to 100), pings from A, though the -i eth0 switch is specified, begin to appear on both A-eth0 and A-eth1. This behaviour is unexpected, but does not effect the return path of the ping response packet.<br />
<br />
<br />
In order to get a symmetric behaviour, where a packet leaves A-eth0, travels via the switch bridge to B-eth1 and returns back from B-eth1 to A-eth0, one must ensure the routing table of B contains no eth0 entries.<br />
<br />
== Paging and the OOM-Killer ==<br />
Due to the nature of experiments run on the cluster, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux has a pair of strategies to deal with heavy memory use. First, is overcommitting. This is where a process is allowed allocate or fork even when there is no more memory available. You can seem some interesting numbers here:[http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html]. The assumption is that processes may not use all memory that they allocate and failing on allocation is worse than failing at a later date when the memory use is actually required. More processes may be supported by allowing them to allocate memory (provided they do not use it all). The second part of the strategy is the Out-of-Memory killer (OOM Killer). When memory has been exhausted and a process tries to use some 'overcommitted' part of memory, the OOM killer is invoked. It's job is to rank all processes in terms of their memory use, priority, privilege and some other parameters, and then select a process to kill based on the ranks.<br />
<br />
The argument for using overcommital+OOM Killer is that rather than failing to allocate memory for some random unlucky process, which as a result would probably terminate, the kernel can instead allow the unlucky process to continue executing and then make a some-what-informed decision on which process to kill. Unfortunately, the behaviour of the OOM-killer sometimes causes problems which grind the machine to a complete halt, particularly when it decides to kill system processes. There is a good discussion on the OOM-killer here: [http://lwn.net/Articles/104179/]<br />
<br />
For this reason overcommit has been disabled on the cluster.<br />
cat /proc/sys/vm/overcommit_memory <br />
2<br />
cat /proc/sys/vm/overcommit_ratio <br />
100<br />
<br />
To restore to default overcommit<br />
# echo 0 > /proc/sys/vm/overcommit_memory<br />
# echo 50 > /proc/sys/vm/overcommit_ratio<br />
<br />
== Manually Limit the Memory on the OS level ==<br />
<br />
as root edit /etc/default/grub<br />
GRUB_CMDLINE_LINUX_DEFAULT="quiet mem=128M"<br />
then run the command<br />
update-grub<br />
reboot</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Subversion&diff=712Subversion2012-06-15T09:55:40Z<p>Davepc: /* RapidSVN, Gforge &amp; passwords */</p>
<hr />
<div>http://svnbook.red-bean.com/ <br />
<br />
*Subversion clients work with <code>.svn</code> directories - don't remove them. <br />
*Mind the version of the client (currently, 1.5, 1.6).<br />
<br />
== Repositories ==<br />
*http://gforge.ucd.ie/softwaremap/tag_cloud.php?tag=heterogeneous+computing<br />
<br />
== To submit ==<br />
<br />
*Software sources: models, code, resource files <br />
*Documentation sources: texts, diagrams, data <br />
*Configuration files <br />
*Test sourses: code, input data<br />
<br />
== Not to submit ==<br />
<br />
*Binaries: object files, libraries, executables <br />
*Built documentation: html, pdf <br />
*Personal settings: Eclipse projects, ... <br />
*Test output<br />
<br />
= Subversion for Users =<br />
<br />
A good cross platform client: [http://www.rapidsvn.org/index.php/Documentation RapidSVN], combined with [http://meldmerge.org/ Meld] a visual diff and merge tool.<br />
<br />
== RapidSVN, Gforge &amp; passwords ==<br />
<br />
'''Problem:''' RapidSVN doesn't directly support svn over ssh and so doesn't remember ssh passwords. And gforge.ucd.ie appares not to support passwordless authentication with publickey.&nbsp; <br />
<br />
'''Solution:''' Use sshpass to remember password. <br />
<br />
Note: this method involves having your gforge password in plain text, and so is a potential security risk - it should be different to other passwords etc. <br />
<br />
Install sshpass &gt;=1.05 (note ubuntu 11.10 usese version 1.04 which just hangs - so install from sources or Ubuntu 12.4) <br />
<br />
edit ~/.subversion/config, in&nbsp;[tunnels] section add the line: <br />
<br />
gforge = sshpass -f{path to file holding password} ssh -o PubkeyAuthentication=no -o ControlMaster=no <br />
<br />
Then check out with: <br />
<br />
svn checkout svn+gforge://&lt;user&gt;@gforge.ucd.ie/var/lib/gforge/chroot/scmrepos/svn/fupermod/trunk fupermod <br />
<br />
(where previously it was:&nbsp;svn checkout svn+ssh) <br />
<br />
To change an existing working copy <br />
<br />
svn switch --relocate svn+ssh://&lt;user&gt;@gforge.ucd.ie/&lt;old url&gt; svn+gforge://&lt;user&gt;@gforge.ucd.ie/&lt;new url&gt;</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Bash_Scripts&diff=711Bash Scripts2012-05-30T14:06:35Z<p>Davepc: </p>
<hr />
<div>A collection of useful bash scripts here.<br />
<br />
Open in vi all .c files containing a string<br />
vim -p `grep STRING *.[c]| cut -f1 -d ":"| uniq`</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=User:Davepc&diff=710User:Davepc2012-05-30T14:05:30Z<p>Davepc: Created page with "[http://hcl.ucd.ie/user/david-clarke David Clarke] - HCL PhD student."</p>
<hr />
<div>[http://hcl.ucd.ie/user/david-clarke David Clarke] - HCL PhD student.</div>Davepchttps://hcl.ucd.ie/wiki/index.php?title=Talk:Pgfplot&diff=709Talk:Pgfplot2012-05-30T14:02:19Z<p>Davepc: Created page with "A link to further documentation would be nice here. Possibly a 3rd party site with good examples?"</p>
<hr />
<div>A link to further documentation would be nice here. Possibly a 3rd party site with good examples?</div>Davepc