Abstract | Modern clusters of computers are becoming more and more heterogeneous not only in
terms of their processing units, but also in terms of the underlying network. In grid
networks, it is common to combine optic fiber with Ethernet or Infiniband networks.
These distributed resources have varying network properties, but even supercomputers
using vendor-specific interconnects are often heterogeneous in terms of both latency
and achievable bandwidth between different process pairs. In this sense, network heterogeneity
is a general problem, with a different magnitude for different domains.
The performance of MPI collective communication operations (e.g. broadcasts)
depends strongly on awareness of the properties of such networks. The advantages of
topology-aware collective communication (in regard to the network) have been clearly
demonstrated in the grid computing domain; this aspect is increasingly important in the
domain of supercomputing. Providing network topology to collective communication
should not be the task of the application programmer; parallel programs need to be
written in a network-oblivious way. For example, the Message Passing Interface was
not designed to require any provisioning of network topology. But it is widely recognized
that topology awareness is needed for optimal performance. In modern MPI
implementations this feature can be included in a transparent way.
In this thesis, we investigate and solve a number of issues when designing efficient
collective communication for complex platforms. We first focus on the technical
difficulties of running and configuring MPI for complex grid environments. Grids
are accessible and attractive to many researchers, but difficult to use in the context of
message passing. We propose solutions to both technical and configuration problems.
Then we proceed to develop a novel method of measuring performance, in particular
achievable bandwidth, on a large scale in complex networks. The method is inspired
by peer-to-peer protocols like BitTorrent, and their adaptive nature. The resulting data
represents a simple performance model. We then use data analysis techniques like
clustering methods to recognize bandwidth clusters. We also design a hierarchical
clustering algorithm, which reconstructs the network as a hierarchy. This hierarchy
can be interpreted as a network topology. We are also able to reconstruct topology as a
tree in an alternative method.
Overall, this process results in a generic technique to produce topology from performance,
independent of the underlying network technology. To complete the process
of designing efficient communication middleware, we also describe how both performance
and topology can be used as input for performance- or topology-aware collective
communication. Topology-aware communication has been studied in the past, and we
outline some general hierarchical solutions. In addition, we use a flexible software tool,
which separates between performance models and general collective algorithms. |