HCL cluster/hcl node install configuration log

From HCL
Jump to: navigation, search

HCL Nodes will be installed from a clone of a root node, hcl07. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.

See also HCL_cluster/heterogeneous.ucd.ie_install_log

General Installation

Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.

Install long list of packages.

Networking

Configure network interface as follows:

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo eth0 eth1

iface lo inet loopback

# The primary network interface
allow-hotplug eth0
iface eth0 inet dhcp

allow-hotplug eth1
iface eth1 inet dhcp

Routing Tables

Add an executable script named 00routes to the directory /etc/network/if-up.d. This script will be called after the interfaces listed as auto are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:

#!/bin/sh

# Static Routes

# route ganglia broadcast t
route add -host 239.2.11.72 dev eth0

# all traffic to heterogeneous gate goes through eth0
route add -host 192.168.20.254 dev eth0

# all subnet traffic goes through specific interface
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1

The naming of the script is important, we want our routes in place before other scripts in the /etc/network/if-up.d directory are executed, the order in which they are executed is alphabetical.

Hosts

Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.

127.0.0.1       localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Ganglia

Install the ganglia-monitor package.

Configure ganglia monitor by editing /etc/ganglia/gmond.conf so that it contains:

cluster {
  name = "HCL Cluster"
  owner = "University College Dublin"
  latlong = "unspecified"
  url = "http://hcl.ucd.ie/"
}

And ...

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  mcast_join = 239.2.11.72
  port = 8649
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.72
  port = 8649
  bind = 239.2.11.72
}

After all packages are complete execute:

service ganglia-monitor restart

NIS Client

Install nis package. Set /etc/defaultdomain to contain heterogeneous.ucd.ie Make sure the NIS Server has an entry in /etc/hosts, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.

192.168.21.254   heterogeneous.ucd.ie heterogeneous

Make sure the file /etc/nsswitch.conf contains:

passwd:         compat
group:          compat
shadow:         compat

Append to the file: /etc/passwd the line +:::::: Append to the file: /etc/group the line +::: Append to the file: /etc/shadow the line +::::::::

Edit /etc/yp.conf (this wasn't need before, it is now with raid server)

ypserver 192.168.20.254

Start the nis service:

service nis start

Check that nis is operating correctly by running the following command:

ypcat passwd

NFS

apt-get install nfs-common portmap

Add the line to /etc/fstab

192.168.20.254:/home	/home			nfs	soft,retrans=6		0 0

Set in /etc/default/nfs-common (this wasn't need before, it is now with raid server)

NEED_IDMAPD=yes

Then:

service nfs-common restart
mount /home

Torque PBS

First install PBS on headnode explained here. Then:

./torque-package-mom-linux-i686.sh --install
./torque-package-clients-linux-i686.sh --install
update-rc.d pbs_mom defaults
service pbs_mom start

NTP

Install NTP software:

apt-get install ntp

Edit configuration and make server heterogeneous.ucd.ie the sole server entry. Comment out any other servers. Restart the NTP service.

Complications

Hostnames

Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described here

The solution we will use is to add the file /etc/dhcp3/dhclient-exit-hooks.d/hostname with the contents:

if [[ -n $new_host_name ]]; then
  echo "$new_host_name" > /etc/hostname
  /bin/hostname $new_host_name
fi

The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be eth0. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.

Further, the current hostnames for the second interface on nodes eth1 are invalid. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.

udev and Network Interfaces

The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read here.

The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following

  1. remove the file /etc/udev/rules.d/70-persistent-net.rules
  2. and to the top of the file: /lib/udev/rules.d/75-persistent-net-generator.rules, the following lines:
# skip generation of persistent network interfaces
ACTION=="*",                            GOTO="persistent_net_generator_end"

Sysstat

Sysstat is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file /etc/cron.d/sysstat and comment out all lines.

# The first element of the path is a directory where the debian-sa1
# script is located
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin

# Activity reports every 10 minutes everyday
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1

# Additional run at 23:59 to rotate the statistics file
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2

Upgrade September 2013

Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI

Before:

 uname -a
 Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux

After:

 Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux

Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16 Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09


 vi  /etc/apt/sources.list

replace squeeze with wheezy

In parallel on all with Cluster ssh (cssh)

 apt-get update
 apt-get dist-upgrade
 reboot

Had to manually boot hcl12 hcl13 becuase of tempiture warning.

 apt-get autoremove

NFS not starting on boot, so added to /etc/rc.local

 mount /home

Changed /etc/fstab line to:

 192.168.20.254:/home  /home  nfs  rw,rsize=4096,wsize=4096,hard,intr  0 0

Questions:

 Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.
 A:keep your currently-installed version
 Q:A new version of configuration file /etc/default/nfs-common is available
 A:Keep installed version
 
 Q:A new version of configuration file /etc/default/grub is available
 A:Install package managers version.

Problems

On every package getting the following warnings:

 ldconfig: Can't link /opt/lib//opt/lib/libf77blas.so to libf77blas.so
 ldconfig: Can't link /opt/lib//opt/lib/libcblas.so to libcblas.so
 ldconfig: Can't link /opt/lib//opt/lib/libatlas.so to libatlas.so

Installation of PAPI - failed

get papi-5.2.0.tar.gz

 tar -xzvf papi-5.2.0.tar.gz
 cd papi-5.2.0/src/
 ./configure --prefix=/usr/local
 make
 make test

Make test failed. Hardware too old, would need to patch kernel. Abandoned installing Papi on HCL for now.

==Installation of Extrae

 apt-get install 
 apt-get install libunwind7-dev
 apt-get install binutils-dev

get extrae-2.4.0.tar.bz2

 tar -xjvf extrae-2.4.0.tar.bz2
 cd extrae-2.4.0
 ./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi --prefix=/usr/local

Note - this is a lot of good features of extrae turned off because if missing packages.

 make 
 make install