Difference between revisions of "HCL cluster/hcl node install configuration log"

From HCL
Jump to: navigation, search
(udev and Interfaces)
(Upgrade September 2013)
 
(38 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.
 
HCL Nodes will be installed from a clone of a root node, <code>hcl07</code>. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.
 +
 +
See also [[HCL_cluster/heterogeneous.ucd.ie_install_log]]
  
 
=General Installation=
 
=General Installation=
Line 6: Line 8:
 
Install long list of packages.
 
Install long list of packages.
  
 +
==Networking==
 
Configure network interface as follows:
 
Configure network interface as follows:
  
Line 12: Line 15:
  
 
# The loopback network interface
 
# The loopback network interface
auto lo eth1 eth0
+
auto lo eth0 eth1
  
 
iface lo inet loopback
 
iface lo inet loopback
Line 24: Line 27:
 
</source>
 
</source>
  
 +
===Routing Tables===
 +
Add an executable script named <code>00routes</code> to the directory <code>/etc/network/if-up.d</code>. This script will be called after the interfaces listed as <code>auto</code> are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:
 +
 +
<source lang="bash">
 +
#!/bin/sh
 +
 +
# Static Routes
 +
 +
# route ganglia broadcast t
 +
route add -host 239.2.11.72 dev eth0
 +
 +
# all traffic to heterogeneous gate goes through eth0
 +
route add -host 192.168.20.254 dev eth0
 +
 +
# all subnet traffic goes through specific interface
 +
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0
 +
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1
 +
</source>
 +
 +
The naming of the script is important, we want our routes in place before other scripts in the <code>/etc/network/if-up.d</code> directory are executed, the order in which they are executed is alphabetical.
 +
 +
===Hosts===
 +
Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.
 +
<source lang="text">
 +
127.0.0.1      localhost
 +
 +
# The following lines are desirable for IPv6 capable hosts
 +
::1    localhost ip6-localhost ip6-loopback
 +
fe00::0 ip6-localnet
 +
ff00::0 ip6-mcastprefix
 +
ff02::1 ip6-allnodes
 +
ff02::2 ip6-allrouters
 +
</source>
 +
 +
==Ganglia==
 +
Install the ganglia-monitor package.
 +
 +
Configure ganglia monitor by editing <code>/etc/ganglia/gmond.conf</code> so that it contains:
 +
 +
<source lang="text">cluster {
 +
  name = "HCL Cluster"
 +
  owner = "University College Dublin"
 +
  latlong = "unspecified"
 +
  url = "http://hcl.ucd.ie/"
 +
}
 +
</source>
 +
And ...
 +
<source lang="text">
 +
/* Feel free to specify as many udp_send_channels as you like.  Gmond
 +
  used to only support having a single channel */
 +
udp_send_channel {
 +
  mcast_join = 239.2.11.72
 +
  port = 8649
 +
  ttl = 1
 +
}
 +
 +
/* You can specify as many udp_recv_channels as you like as well. */
 +
udp_recv_channel {
 +
  mcast_join = 239.2.11.72
 +
  port = 8649
 +
  bind = 239.2.11.72
 +
}
 +
</source>
 +
After all packages are complete execute:
 +
<source lang="text">
 +
service ganglia-monitor restart
 +
</source>
 +
 +
==NIS Client==
 +
 +
Install nis package.
 +
Set <code>/etc/defaultdomain</code> to contain <code>heterogeneous.ucd.ie</code>
 +
Make sure the NIS Server has an entry in <code>/etc/hosts</code>, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.
 +
192.168.21.254  heterogeneous.ucd.ie heterogeneous
 +
Make sure the file <code>/etc/nsswitch.conf</code> contains:
 +
passwd:        compat
 +
group:          compat
 +
shadow:        compat
 +
Append to the file:  <code>/etc/passwd</code> the line <code>+::::::</code>
 +
Append to the file: <code>/etc/group</code> the line <code>+:::</code>
 +
Append to the file: <code>/etc/shadow</code> the line <code>+::::::::</code>
 +
 +
Edit <code>/etc/yp.conf</code> (this wasn't need before, it is now with raid server)
 +
ypserver 192.168.20.254
 +
 +
Start the nis service:
 +
service nis start
 +
Check that nis is operating correctly by running the following command:
 +
ypcat passwd
 +
 +
==NFS==
 +
apt-get install nfs-common portmap
 +
Add the line to <code>/etc/fstab</code>
 +
192.168.20.254:/home /home nfs soft,retrans=6 0 0
 +
Set in <code>/etc/default/nfs-common</code> (this wasn't need before, it is now with raid server)
 +
NEED_IDMAPD=yes
 +
Then:
 +
service nfs-common restart
 +
mount /home
 +
 +
==Torque PBS==
 +
First install PBS on headnode [[New_heterogeneous.ucd.ie_install_log#Packages_for_nodes|explained here]]. Then:
 +
./torque-package-mom-linux-i686.sh --install
 +
./torque-package-clients-linux-i686.sh --install
 +
update-rc.d pbs_mom defaults
 +
service pbs_mom start
 +
 +
==NTP==
 +
Install NTP software:
 +
apt-get install ntp
 +
Edit configuration and make <code>server heterogeneous.ucd.ie</code> the sole server entry. Comment out any other servers. Restart the NTP service.
  
 
=Complications=
 
=Complications=
Line 45: Line 159:
 
==udev and Network Interfaces==
 
==udev and Network Interfaces==
  
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [[http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].
+
The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read [http://www.ducea.com/2008/09/01/remove-debian-udev-persistent-net-rules/ here].
  
 
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following
 
The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following
Line 53: Line 167:
 
<source lang="text"># skip generation of persistent network interfaces
 
<source lang="text"># skip generation of persistent network interfaces
 
ACTION=="*",                            GOTO="persistent_net_generator_end"</source>
 
ACTION=="*",                            GOTO="persistent_net_generator_end"</source>
 +
 +
==Sysstat==
 +
[http://pagesperso-orange.fr/sebastien.godard/ Sysstat] is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file <code>/etc/cron.d/sysstat</code> and comment out all lines.
 +
<source lang="text">
 +
# The first element of the path is a directory where the debian-sa1
 +
# script is located
 +
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin
 +
 +
# Activity reports every 10 minutes everyday
 +
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1
 +
 +
# Additional run at 23:59 to rotate the statistics file
 +
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2
 +
</source>
 +
 +
=Upgrade September 2013=
 +
Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI
 +
 +
Before:
 +
  uname -a
 +
  Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux
 +
After:
 +
  Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux
 +
Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16
 +
Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09
 +
 +
 +
  vi  /etc/apt/sources.list
 +
replace squeeze with wheezy
 +
 +
In parallel on all with Cluster ssh (cssh)
 +
  apt-get update
 +
  apt-get dist-upgrade
 +
  reboot
 +
Had to manually boot hcl12 hcl13 becuase of tempiture warning.
 +
  apt-get autoremove
 +
 +
NFS not starting on boot, so added to /etc/rc.local
 +
  mount /home
 +
Changed /etc/fstab line to:
 +
  192.168.20.254:/home  /home  nfs  rw,rsize=4096,wsize=4096,hard,intr  0 0
 +
 +
==Questions:==
 +
  Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.
 +
  A:keep your currently-installed version
 +
 +
  Q:A new version of configuration file /etc/default/nfs-common is available
 +
  A:Keep installed version
 +
 
 +
  Q:A new version of configuration file /etc/default/grub is available
 +
  A:Install package managers version.
 +
 +
==Problems==
 +
On every package getting the following warnings:
 +
  ldconfig: Can't link /opt/lib//opt/lib/libf77blas.so to libf77blas.so
 +
  ldconfig: Can't link /opt/lib//opt/lib/libcblas.so to libcblas.so
 +
  ldconfig: Can't link /opt/lib//opt/lib/libatlas.so to libatlas.so
 +
 +
==Installation of PAPI - failed==
 +
get papi-5.2.0.tar.gz
 +
  tar -xzvf papi-5.2.0.tar.gz
 +
  cd papi-5.2.0/src/
 +
  ./configure --prefix=/usr/local
 +
  make
 +
  make test
 +
Make test failed. Hardware too old, would need to patch kernel. Abandoned installing Papi on HCL for now.
 +
 +
==Installation of Extrae
 +
  apt-get install
 +
  apt-get install libunwind7-dev
 +
  apt-get install binutils-dev
 +
get extrae-2.4.0.tar.bz2
 +
  tar -xjvf extrae-2.4.0.tar.bz2
 +
  cd extrae-2.4.0
 +
  ./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi --prefix=/usr/local
 +
Note - this is a lot of good features of extrae turned off because if missing packages.
 +
  make
 +
  make install

Latest revision as of 21:33, 26 September 2013

HCL Nodes will be installed from a clone of a root node, hcl07. The general installation of the root is documented here. There are a number of complications as a result of the cloning process. Solutions to these complications are also explained.

See also HCL_cluster/heterogeneous.ucd.ie_install_log

General Installation

Partition filesystem with swap at the end of the disk, size 1GB, equal to maximum of the installed memory on cluster nodes. Root file system occupies the remainder of the disk, EXT4 format.

Install long list of packages.

Networking

Configure network interface as follows:

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo eth0 eth1

iface lo inet loopback

# The primary network interface
allow-hotplug eth0
iface eth0 inet dhcp

allow-hotplug eth1
iface eth1 inet dhcp

Routing Tables

Add an executable script named 00routes to the directory /etc/network/if-up.d. This script will be called after the interfaces listed as auto are brought up on boot (or networking service restart). Note: the DHCP process for eth1 is backgrounded so that the startup of other services can continue. Our routing script adds routes on this interface even though it is not yet fully up. The script outputs some errors but the routing entries remain none-the-less. The script should read as follows:

#!/bin/sh

# Static Routes

# route ganglia broadcast t
route add -host 239.2.11.72 dev eth0

# all traffic to heterogeneous gate goes through eth0
route add -host 192.168.20.254 dev eth0

# all subnet traffic goes through specific interface
route add -net 192.168.20.0 netmask 255.255.255.0 dev eth0
route add -net 192.168.21.0 netmask 255.255.255.0 dev eth1

The naming of the script is important, we want our routes in place before other scripts in the /etc/network/if-up.d directory are executed, the order in which they are executed is alphabetical.

Hosts

Change the hosts file so that it does not list the node's hostname, otherwise this would confuse nodes that are cloned from this image.

127.0.0.1       localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Ganglia

Install the ganglia-monitor package.

Configure ganglia monitor by editing /etc/ganglia/gmond.conf so that it contains:

cluster {
  name = "HCL Cluster"
  owner = "University College Dublin"
  latlong = "unspecified"
  url = "http://hcl.ucd.ie/"
}

And ...

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  mcast_join = 239.2.11.72
  port = 8649
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.72
  port = 8649
  bind = 239.2.11.72
}

After all packages are complete execute:

service ganglia-monitor restart

NIS Client

Install nis package. Set /etc/defaultdomain to contain heterogeneous.ucd.ie Make sure the NIS Server has an entry in /etc/hosts, this is because DNS may not be active when the NIS client is starting and we want to ensure that it connects to the server successful.

192.168.21.254   heterogeneous.ucd.ie heterogeneous

Make sure the file /etc/nsswitch.conf contains:

passwd:         compat
group:          compat
shadow:         compat

Append to the file: /etc/passwd the line +:::::: Append to the file: /etc/group the line +::: Append to the file: /etc/shadow the line +::::::::

Edit /etc/yp.conf (this wasn't need before, it is now with raid server)

ypserver 192.168.20.254

Start the nis service:

service nis start

Check that nis is operating correctly by running the following command:

ypcat passwd

NFS

apt-get install nfs-common portmap

Add the line to /etc/fstab

192.168.20.254:/home	/home			nfs	soft,retrans=6		0 0

Set in /etc/default/nfs-common (this wasn't need before, it is now with raid server)

NEED_IDMAPD=yes

Then:

service nfs-common restart
mount /home

Torque PBS

First install PBS on headnode explained here. Then:

./torque-package-mom-linux-i686.sh --install
./torque-package-clients-linux-i686.sh --install
update-rc.d pbs_mom defaults
service pbs_mom start

NTP

Install NTP software:

apt-get install ntp

Edit configuration and make server heterogeneous.ucd.ie the sole server entry. Comment out any other servers. Restart the NTP service.

Complications

Hostnames

Debian does not pull the hostname from the DHCP Server. Without intervention cloned nodes will keep the hostname stored on the image of the root node. A bug describing the setting of a hostname via DHCP is described here

The solution we will use is to add the file /etc/dhcp3/dhclient-exit-hooks.d/hostname with the contents:

if [[ -n $new_host_name ]]; then
  echo "$new_host_name" > /etc/hostname
  /bin/hostname $new_host_name
fi

The effect of this is to set the hostname of the machine after an interface is configured using dhclient (DHCP Client). Note, the hostname of the machine will be set by the last interface that is configured via DHCP, in the current configuration that will be eth0. If an interface is reconfigured using dhclient, the hostname will be reset to the name belonging to that interface.

Further, the current hostnames for the second interface on nodes eth1 are invalid. They follow the format hcl??_eth1.ucd.ie, however the '_' character is not permitted in hostnames and attempting to set such a hostname fails.

udev and Network Interfaces

The udev system attempts to keep network interface names consistent regardless of changing hardware. This may be useful for laptops with wirless cards that a plugged in and out, but it causes problems when trying to install our root node image across all machines in the cluster. A description of the problem can be read here.

The solution is to remove the udev rules for persistent network interfaces, and disable the generator script for these rules. On the root cloning node do the following

  1. remove the file /etc/udev/rules.d/70-persistent-net.rules
  2. and to the top of the file: /lib/udev/rules.d/75-persistent-net-generator.rules, the following lines:
# skip generation of persistent network interfaces
ACTION=="*",                            GOTO="persistent_net_generator_end"

Sysstat

Sysstat is a useful suite of tools for measuring performance of different system components. Unfortunately it adds some un-useful cron entries for collecting a historic set of system performance data. Though these cron entries point to disabled scripts, we will disable them none the less. Edit the file /etc/cron.d/sysstat and comment out all lines.

# The first element of the path is a directory where the debian-sa1
# script is located
#PATH=/usr/lib/sysstat:/usr/sbin:/usr/sbin:/usr/bin:/sbin:/bin

# Activity reports every 10 minutes everyday
#5-55/10 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1

# Additional run at 23:59 to rotate the statistics file
#59 23 * * * root command -v debian-sa1 > /dev/null && debian-sa1 60 2

Upgrade September 2013

Upgrade motivated by needing a kernel newer than 2.6.35 for PAPI

Before:

 uname -a
 Linux hcl03.heterogeneous.ucd.ie 2.6.32-5-686 #1 SMP Wed Jan 12 04:01:41 UTC 2011 i686 GNU/Linux

After:

 Linux hcl03.heterogeneous.ucd.ie 3.2.0-4-686-pae #1 SMP Debian 3.2.46-1+deb7u1 i686 GNU/Linux

Nodes upgraded (11): hcl01 hcl03 hcl05 hcl06 hcl10 hcl11 hcl12 hcl13 hcl14 hcl15 hcl16 Nodes not upgraded because unbootable before upgrade (5) hcl02 hcl04 hcl07 hcl08 hcl09


 vi  /etc/apt/sources.list

replace squeeze with wheezy

In parallel on all with Cluster ssh (cssh)

 apt-get update
 apt-get dist-upgrade
 reboot

Had to manually boot hcl12 hcl13 becuase of tempiture warning.

 apt-get autoremove

NFS not starting on boot, so added to /etc/rc.local

 mount /home

Changed /etc/fstab line to:

 192.168.20.254:/home  /home  nfs  rw,rsize=4096,wsize=4096,hard,intr  0 0

Questions:

 Q:Configuration file `/etc/pam.d/sshd' modified (by you or by a script) since installation.
 A:keep your currently-installed version
 Q:A new version of configuration file /etc/default/nfs-common is available
 A:Keep installed version
 
 Q:A new version of configuration file /etc/default/grub is available
 A:Install package managers version.

Problems

On every package getting the following warnings:

 ldconfig: Can't link /opt/lib//opt/lib/libf77blas.so to libf77blas.so
 ldconfig: Can't link /opt/lib//opt/lib/libcblas.so to libcblas.so
 ldconfig: Can't link /opt/lib//opt/lib/libatlas.so to libatlas.so

Installation of PAPI - failed

get papi-5.2.0.tar.gz

 tar -xzvf papi-5.2.0.tar.gz
 cd papi-5.2.0/src/
 ./configure --prefix=/usr/local
 make
 make test

Make test failed. Hardware too old, would need to patch kernel. Abandoned installing Papi on HCL for now.

==Installation of Extrae

 apt-get install 
 apt-get install libunwind7-dev
 apt-get install binutils-dev

get extrae-2.4.0.tar.bz2

 tar -xjvf extrae-2.4.0.tar.bz2
 cd extrae-2.4.0
 ./configure --with-mpi=/usr/lib/openmpi --with-unwind=/usr --without-dyninst --without-papi --prefix=/usr/local

Note - this is a lot of good features of extrae turned off because if missing packages.

 make 
 make install