Optimization of Data-Parallel Scientific Applications on Highly Heterogeneous Modern HPC Platforms

TitleOptimization of Data-Parallel Scientific Applications on Highly Heterogeneous Modern HPC Platforms
Publication TypeThesis
Year of Publication2014
AuthorsZhong, Z.
Thesis TypePhD
AdvisorLastovetsky, A.
Academic DepartmentSchool of Computer Science and Informatics
UniversityUniversity College Dublin
Number of Pages142
Date Published08/2014
AbstractOver the past decade, the design of microprocessors has been shifting to a new model where the microprocessor has multiple homogeneous processing units, aka cores, as a result of heat dissipation and energy consumption issues. Meanwhile, the demand for heterogeneity increases in computing systems due to the need for high performance computing in recent years. The current trend in gaining high computing power is to incorporate specialized processing resources such as manycore Graphic Processing Units in multicore systems, thus making a computing system heterogeneous. Maximum performance of data-parallel scientific applications on heterogeneous platforms can be achieved by balancing the load between heterogeneous processing elements. Data parallel applications can be load balanced by applying data partitioning with respect to the performance of the platform’s computing devices. However, load balancing on such platforms is complicated by several factors, such as contention for shared system resources, non-uniform memory access, limited GPU memory and slow bandwidth of PCIe, which connects the host processor and the GPU. In this thesis, we present methods of performance modeling and performance measurement on dedicated multicore and multi-GPU systems. We model a multicore and multi-GPU system by a set of heterogeneous abstract processors determined by the configuration of the parallel application. Each abstract processor represents a processing unit made of one or a group of processing elements executing one computational kernel of the application. We group processing units by shared resources, and measure the performance of processing units in each group simultaneously, thereby taking into account the influence of resource contention. We investigate the impact of resource contention, and the impact of process mapping on systems of NUMA architecture on the performance of processing units. Using the proposed method for measuring performance, we built functional performance models of abstract processors, and partition data of data parallel applications using these performance models to balance the workload. We evaluate the proposed methods with two typical data parallel applications, namely parallel matrix multiplication and numerical simulation of lid-driven cavity flow. Experimental results demonstrate that data partitioning algorithms based on functional performance models built using proposed methods are able to balance the workload of data parallel applications on heterogeneous multicore and multi-GPU platforms.
ziming_thesis.pdf1.06 MB