Network-aware optimization of communications for parallel matrix multiplication on hierarchical HPC platforms

TitleNetwork-aware optimization of communications for parallel matrix multiplication on hierarchical HPC platforms
Publication TypeJournal Article
Year of Publication2016
AuthorsMalik, T., V. Rychkov, and A. Lastovetsky
Journal TitleConcurrency and Computation: Practice and Experience
Volume28
Issue3
Pages802-821
Journal Date03/2016
PublisherWiley
AbstractCommunications on hierarchical heterogeneous high-performance computing platforms can be optimized based on topology and performance information. For MPI, as a major programming tool for such platforms, a number of topology-aware and performance-aware implementations of collective operations have been proposed for optimal scheduling of messages. This approach improves performance of application and does not require to modify application source code. However, it is applicable to collective operations only and does not affect the parts of the application that are based on point-to-point exchanges. In this paper, we address the problem of efficient execution of data-parallel applications on interconnected clusters and present optimizations that improve data partition by taking into account the entire communication flow of the application. This approach is also non-intrusive to the source code but application specific. For illustration, we use parallel matrix multiplication, where the matrices are partitioned into irregular two-dimensional rectangles assigned to different processors and arranged in columns, and the processors communicate over this partition vertically and horizontally. By rearranging the rectangles, we can minimize communications between different levels of the network hierarchy. Finding the optimal arrangement is NP-complete; therefore, we propose two heuristic approaches based on evaluation of the communication flow on the given network topology. We demonstrate the correctness and efficiency of the proposed approaches by experimental results on multicore nodes and interconnected heterogeneous clusters.
DOI10.1002/cpe.3609