HCL cluster/hcl node install configuration log
HCL Nodes will be installed from a clone of a root node, hcl07. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.
See also HCL_cluster/heterogeneous.ucd.ie_install_log
Contents
General Installation
Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.
Install long list of packages.
Networking
Configure network interface as follows:
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo eth0 eth1
iface lo inet loopback
# The primary network interface
allow-hotplug eth0
iface eth0 inet dhcp
allow-hotplug eth1
iface eth1 inet dhcp
Routing Tables
Add an executable script named 00routes to the directory /etc/network/if-up.d. This script will be called after the interfaces listed as auto are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:
#!/bin/sh
# Static Routes
# route ganglia broadcast t
route add -host 239.2.11.72 dev eth0
# all traffic to heterogeneous gate goes through eth0
route add -host 192.168.20.254 dev eth0
# all subnet traffic goes through specific interface
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1
The naming of the script is important, we want our routes in place before other scripts in the /etc/network/if-up.d directory are executed, the order in which they are executed is alphabetical.
Hosts
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.
127.0.0.1       localhost
# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Ganglia
Install the ganglia-monitor package.
Configure ganglia monitor by editing /etc/ganglia/gmond.conf so that it contains:
cluster {
  name = "HCL Cluster"
  owner = "University College Dublin"
  latlong = "unspecified"
  url = "http://hcl.ucd.ie/"
}
And ...
/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  mcast_join = 239.2.11.72
  port = 8649
  ttl = 1
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.72
  port = 8649
  bind = 239.2.11.72
}
After all packages are complete execute:
service ganglia-monitor restart
NIS Client
Install nis package.
Set /etc/defaultdomain to contain heterogeneous.ucd.ie
Make sure the NIS Server has an entry in /etc/hosts, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.
192.168.21.254 heterogeneous.ucd.ie heterogeneous
Make sure the file /etc/nsswitch.conf contains:
passwd: compat group: compat shadow: compat
Append to the file:  /etc/passwd the line +::::::
Append to the file: /etc/group the line +:::
Append to the file: /etc/shadow the line +::::::::
Edit /etc/yp.conf (this wasn't need before, it is now with raid server)
ypserver 192.168.20.254
Start the nis service:
service nis start
Check that nis is operating correctly by running the following command:
ypcat passwd
NFS
apt-get install nfs-common portmap
Add the line to /etc/fstab
192.168.20.254:/home /home nfs soft,retrans=6 0 0
Set in /etc/default/nfs-common (this wasn't need before, it is now with raid server)
NEED_IDMAPD=yes
Then:
service nfs-common restart mount /home
Torque PBS
First install PBS on headnode explained here. Then:
./torque-package-mom-linux-i686.sh --install ./torque-package-clients-linux-i686.sh --install update-rc.d pbs_mom defaults service pbs_mom start
NTP
Install NTP software:
apt-get install ntp
Edit configuration and make server heterogeneous.ucd.ie the sole server entry. Comment out any other servers. Restart the NTP service.
Complications
Hostnames
Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described here
The solution we will use is to add the file /etc/dhcp3/dhclient-exit-hooks.d/hostname with the contents:
if [[ -n $new_host_name ]]; then
  echo "$new_host_name" > /etc/hostname
  /bin/hostname $new_host_name
fi
The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be eth0. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.
Further, the current hostnames for the second interface on nodes eth1 are invalid. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.
udev and Network Interfaces
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read here.
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following
- remove the file /etc/udev/rules.d/70-persistent-net.rules
- and to the top of the file: /lib/udev/rules.d/75-persistent-net-generator.rules, the following lines:
# skip generation of persistent network interfaces
ACTION=="*",                            GOTO="persistent_net_generator_end"
Sysstat
Sysstat is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file /etc/cron.d/sysstat and comment out all lines.
# The first element of the path is a directory where the debian-sa1
# script is located
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin
# Activity reports every 10 minutes everyday
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1
# Additional run at 23:59 to rotate the statistics file
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2
Upgrade September 2013
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI
Before:
uname -a Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux
After:
Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16 Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09
vi /etc/apt/sources.list
replace squeeze with wheezy
In parallel on all with Cluster ssh (cssh)
apt-get update apt-get dist-upgrade reboot
Had to manually boot hcl12 hcl13 becuase of tempiture warning.
apt-get autoremove
NFS not starting on boot, so added to /etc/rc.local
mount /home
Changed /etc/fstab line to:
192.168.20.254:/home /home nfs rw,rsize=4096,wsize=4096,hard,intr 0 0
Questions:
Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation. A:keep your currently-installed version
Q:A new version of configuration file /etc/default/nfs-common is available A:Keep installed version Q:A new version of configuration file /etc/default/grub is available A:Install package managers version.
Problems
On every package getting the following warnings:
ldconfig: Can't link /opt/lib//opt/lib/libf77blas.so to libf77blas.so ldconfig: Can't link /opt/lib//opt/lib/libcblas.so to libcblas.so ldconfig: Can't link /opt/lib//opt/lib/libatlas.so to libatlas.so
Installation of PAPI - failed
get papi-5.2.0.tar.gz
tar -xzvf papi-5.2.0.tar.gz cd papi-5.2.0/src/ ./configure --prefix=/usr/local make make test
Make test failed. Hardware too old, would need to patch kernel. Abandoned installing Papi on HCL for now.
==Installation of Extrae
apt-get install apt-get install libunwind7-dev apt-get install binutils-dev
get extrae-2.4.0.tar.bz2
tar -xjvf extrae-2.4.0.tar.bz2 cd extrae-2.4.0 ./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi --prefix=/usr/local
Note - this is a lot of good features of extrae turned off because if missing packages.
make make install
